[c-nsp] Cisco 3020 blade switches hung, HLFM errors, network meltdown?

Ramcharan, Vijay A vijay.ramcharan at verizonbusiness.com
Wed Mar 19 14:22:31 EDT 2008


Interesting... 
I ran into a similar network meltdown issue last year during a 6509 to
6509-E migration effort. Fairly small switching (servers only, no
connected end users) network compounded by the use of blade server ESMs
and interim 4948 switches used during the migration. 

Network ran fine for a day, then went down. The 6509-E switches logged
duplicate HSRP address messages which after research, indicated a loop.
Breaking a couple of redundant links by unplugging them brought things
back to operational. Good thing that the diagram of the site included
those redundant links clearly marked. 

Far as I can tell, either a blade server chassis did something to cause
the outage (as one of them was behaving erratically prior to the outage)
or the network diameter became a bit larger than the recommended number
due to the interim 4948 switches being daisychained together. 

I haven't had much experience with meltdowns but I do believe that a
meltdown usually occurs at a specific point in time caused by one or
more immediate factors rather than something that just gradually builds
up over a long period (24 hrs or > maybe) of time. That would lead me to
believe the blade server ESM had a part in it but that's just an
unproven hypothesis. 

I did find that famous large scale hospital? meltdown while looking up
STP loops though and that made for good reading. 

Vijay Ramcharan 
  
-----Original Message-----
From: cisco-nsp-bounces at puck.nether.net
[mailto:cisco-nsp-bounces at puck.nether.net] On Behalf Of matthew zeier
Sent: March 19, 2008 11:49
To: cisco-nsp at puck.nether.net
Subject: [c-nsp] Cisco 3020 blade switches hung, HLFM errors, network
meltdown?


I have an HP blade system with two WS-CBS3020-HPQ switches.  Console
logged the following error during which the entire network was
unreachable:

(6444)msecs, more than (2000)msecs (719/326),process = HLFM address 
learning process.
-Traceback= 4794B0 479A4C 4799B0 2E9E64 4F788C 32D6C4 11B980 11BEF0 
11D684 326BC4 322F90
323240 A86D34 A7D2FC
18w0d: %SYS-3-CPUHOG: Task is running for (2152)msecs, more than
(2000)msecs
(143/1),process = HLFM address learning process.

With their uplinks to the network disabled, the switches were still 
unreachable/unusable, even through Fa0/0.  I had to reboot each before I

could telnet back in.

Disconnecting them from the network brought the network back, 
reconnecting melted the network.

Felt like a broadcast storm or even a spanning-tree loop but I'd be 
surprised if it was the latter and the upstream switches, two 6500s, 
didn't know how to do deal with that (heck, they deal with HP 2510s that

default to not running spanning-tree).

 From some of the log entries I could gleam from the console buffer, it 
looks like the native vlan on one of the port channel members was 
inadvertently changed and was marked as incompatible with the other 
bundle member.  Still, I'm somewhat surprised that that hung the blade 
switches to the extent that everything else became  unusable.

Any insights?

[I'm stuck in this place where the WS-CBS3020-HPQ's aren't registered 
with CCO and my reseller says I have to talk to HP for support...]
_______________________________________________
cisco-nsp mailing list  cisco-nsp at puck.nether.net
https://puck.nether.net/mailman/listinfo/cisco-nsp
archive at http://puck.nether.net/pipermail/cisco-nsp/


More information about the cisco-nsp mailing list