[c-nsp] Switches/nodes drop off the network

Wed Jan 10 05:53:10 EST 2007

We currently operate 2 separate stacks of Cisco 3750 switches at our
distribution layer.

Stack 1 - Distribution 3

Switch   Ports  Model              SW Version              SW Image

------   -----  -----              ----------              ----------

     1   28     WS-C3750G-24TS     12.2(25)SED
C3750-ADVIPSERVICESK

*    2   28     WS-C3750G-24TS     12.2(25)SED
C3750-ADVIPSERVICESK

Stack 2 - Distribution 4

Switch   Ports  Model              SW Version              SW Image

------   -----  -----              ----------              ----------

     1   28     WS-C3750G-24TS-1U  12.2(25)SED
C3750-ADVIPSERVICESK

*    2   28     WS-C3750G-24TS-1U  12.2(25)SED
C3750-ADVIPSERVICESK

All our customer edge switches are WS-C2950T-24 of which there are between
25 and 30 which use port channel configurations (2 x 1000Mbps) to connect to
the distribution switches. The distribution switches contain VLANs for our
customers some private VLAN, some just standard VLANs, all are similarly
configured VLANs (nothing special).

Prior to utilising the 3750G series, we were utilising the 3550 series at
our distribution (which are still in service for certain customers), we
migrated all these switches recently however have had to migrate them back
due to these problems. During the migration period the distribution stack
was initially configured with the SDM template as "ipv4/ipv6 default",
however when we started migration, once we got to the 13th switch, we
instantly saw between 35ms and 60ms of latency when tracing through the
distribution to any node (apart from nodes on the actual switch we have just
migrated which were sill 0.x ms as expected). Initially I thought the stack
was running out of resources (we have approximately 180 VLANs active, and
about 18 port channels, storing 800 - 900 MAC addresses, the CPU was
constantly high along with memory usage), due to the SDM template chosen,
therefore we changed it to "ipv4/ipv6 vlan" and we experienced similar
issues. We then changed the template to "desktop default" and all seemed to
work fine, all VLANs were active, all port channels were active, no latency,
routing was fine, no problems in general, CPU load and memory was low.

Then a day later (after all had been working fine) some very strange
behaviour started - random server nodes seemed to be falling off the
network, and on some occasions whole VLANs disappeared. During this period,
the gateway of the VLAN is reachable globally (including from the customer
edge switch); the VLAN is up, the VLAN trunk is up and functioning on the
port channel. The nodes remain down for significant periods (i.e. 3 to 4
hours), on some occasions they come back online on their own (very random),
however if we remove the server from the equation and put a laptop on the
port, configure the IP on the laptop, it often works fine and can gain
access to the rest of the world (once the old machine is put back it still
does not work though). I have reconfigured VLANs (i.e. changed VLAN
numbers), this does not work. All our switches/routers send all logs to a
Syslog server which during this period shows nothing out of the ordinary, I
enabled debugging for various sections however this did at one point crash
one of the switches, and did not show anything out of the ordinary upon
search (however I did only analyse a fraction of the data).

Due to the continued issues we decided to move back all our switches to the
3550 series until we figured out the problems - we have another 3750 stack:

Stack 3 - Distribution 5

Switch   Ports  Model              SW Version              SW Image

------   -----  -----              ----------              ----------

     1   26     WS-C3750-24TS      12.2(25)SED
C3750-ADVIPSERVICESK

*    2   26     WS-C3750-24TS      12.2(25)SED
C3750-ADVIPSERVICESK

This runs the IPv4/IPv6 default SDM template and works fine (however does
not have that many VLANs or customers on at this time) - however this does
occasionally have randomly high CPU load (95% - 100%), which is not due to
routing updates, topologies changes or anything such as that since this
switch sees very little action in terms of changes - again I have enabled
logging and during this period I cannot see anything out of the ordinary,
the "show processes cpu history" shows the issues (and at the time access
becomes sluggish), yet "show processes cpu sorted" never shows anything out
of the ordinary, or something that it using a lot of CPU, however the
overview figures at the head of the table show the same high figures as the
"cpu history" graph.

I believe some of these issues relate to a bug in the IOS some how - can
anyone confirm if they have had any similar issues?

Please advise with regards any information anyone has!

Regards,

Paul Davies

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 3082 bytes
Desc: not available
Url : https://puck.nether.net/pipermail/cisco-nsp/attachments/20070110/867cea70/attachment-0001.bin