[c-nsp] Switches/nodes drop off the network

Wed Jan 10 10:38:49 EST 2007

Hi Paul,

On Wed, Jan 10, 2007 at 02:45:31PM -0000, Paul Davies wrote:
> Trent
> 
> > One issue that comes to mind is using 
> >	"switchport block unicast"
> 
> > This essentially stops the switch from flooding packets when the MAC's
> > port is unknown, each device has to send a packet *out* before it is
> > contactable, which can cause things to "disappear" and "come back"
> 
> Thank you for your response. This is interesting, but things have a habit of
> "not coming back" as it were, it takes the hosts hours as opposed to
> seconds/minutes to come back (and sometimes they don't come back at all -
> the second time this happened I migrated all switches back to our 3550
> platform because no matter what I did it would not come back (yet putting a
> laptop online, on the same port, as any of the IP's in question worked
> fine).
> 
> I found that when it did occur the distributions (3750 stack) ARP table
> showed all the IP entries as incomplete for the affected servers (naturally
> when the laptop the ARP entry was complete).
> 
> I had thought that this was some form of bug (it is as if the stack has run
> out of resources and cannot store any more entries - yet we are no where
> near the limit of the SDM template in question).
> 
> > Furthermore, a 3750 stack will share the mac table between switches in
> > the same stack, but multiple stacks will not share it between the
> > stacks, so you may end up in situations with trunks (especially
> > when the trunks arent as busy) where one stack can get to it but
> > the other can not. 
> 
> Yes this I understand, but these distributions are completely separate and
> the stacks do not require one another for the network to operate (i.e.
> different customer base for different stacks). Each customer edge switch is
> connected to only one stack (there are no edge switches where one port goes
> to one stack and another port goes to another stack), which I believe is
> what you are suggesting could cause problems (which is understandable).

If the two stacks are not connected to each other in a way that requires
layer-2 traffic to be passed then yeh this "sub-part" wouldn't affect
you but the general issue still stands

The above case just serves to "amplify" the issue because on the one
stack in a busy environment you may never have your ARPs timeout but
over a trunk that isn't very busy it may be more likely to time out.

Cheers,
Trent

> 
> Regards,
>  
> Paul Davies
> -----Original Message-----
> From: Trent Lloyd [mailto:lathiat at bur.st] 
> Sent: 10 January 2007 14:26
> To: Paul Davies
> Cc: cisco-nsp at puck.nether.net
> Subject: Re: [c-nsp] Switches/nodes drop off the network
> 
> One issue that comes to mind is using 
> 	"switchport block unicast"
> 
> This essentially stops the switch from flooding packets when the MAC's
> port is unknown, each device has to send a packet *out* before it is
> contactable, which can cause things to "disappear" and "come back"
> 
> Furthermore, a 3750 stack will share the mac table between switches in
> the same stack, but multiple stacks will not share it between the
> stacks, so you may end up in situations with trunks (especially
> when the trunks arent as busy) where one stack can get to it but
> the other can not.
> 
> Cheers, 
> Trent
> 
> On Wed, Jan 10, 2007 at 10:53:10AM -0000, Paul Davies wrote:
> > We currently operate 2 separate stacks of Cisco 3750 switches at our
> > distribution layer.
> > 
> >  
> > 
> >  
> > 
> > Stack 1 - Distribution 3
> > 
> >  
> > 
> > Switch   Ports  Model              SW Version              SW Image
> > 
> > 
> > ------   -----  -----              ----------              ----------
> > 
> > 
> >      1   28     WS-C3750G-24TS     12.2(25)SED
> > C3750-ADVIPSERVICESK
> > 
> > *    2   28     WS-C3750G-24TS     12.2(25)SED
> > C3750-ADVIPSERVICESK
> > 
> >  
> > 
> >  
> > 
> > Stack 2 - Distribution 4
> > 
> >  
> > 
> > Switch   Ports  Model              SW Version              SW Image
> > 
> > 
> > ------   -----  -----              ----------              ----------
> > 
> > 
> >      1   28     WS-C3750G-24TS-1U  12.2(25)SED
> > C3750-ADVIPSERVICESK
> > 
> > *    2   28     WS-C3750G-24TS-1U  12.2(25)SED
> > C3750-ADVIPSERVICESK
> > 
> >  
> > 
> >  
> > 
> > All our customer edge switches are WS-C2950T-24 of which there are between
> > 25 and 30 which use port channel configurations (2 x 1000Mbps) to connect
> to
> > the distribution switches. The distribution switches contain VLANs for our
> > customers some private VLAN, some just standard VLANs, all are similarly
> > configured VLANs (nothing special).
> > 
> >  
> > 
> > Prior to utilising the 3750G series, we were utilising the 3550 series at
> > our distribution (which are still in service for certain customers), we
> > migrated all these switches recently however have had to migrate them back
> > due to these problems. During the migration period the distribution stack
> > was initially configured with the SDM template as "ipv4/ipv6 default",
> > however when we started migration, once we got to the 13th switch, we
> > instantly saw between 35ms and 60ms of latency when tracing through the
> > distribution to any node (apart from nodes on the actual switch we have
> just
> > migrated which were sill 0.x ms as expected). Initially I thought the
> stack
> > was running out of resources (we have approximately 180 VLANs active, and
> > about 18 port channels, storing 800 - 900 MAC addresses, the CPU was
> > constantly high along with memory usage), due to the SDM template chosen,
> > therefore we changed it to "ipv4/ipv6 vlan" and we experienced similar
> > issues. We then changed the template to "desktop default" and all seemed
> to
> > work fine, all VLANs were active, all port channels were active, no
> latency,
> > routing was fine, no problems in general, CPU load and memory was low.
> > 
> >  
> > 
> > Then a day later (after all had been working fine) some very strange
> > behaviour started - random server nodes seemed to be falling off the
> > network, and on some occasions whole VLANs disappeared. During this
> period,
> > the gateway of the VLAN is reachable globally (including from the customer
> > edge switch); the VLAN is up, the VLAN trunk is up and functioning on the
> > port channel. The nodes remain down for significant periods (i.e. 3 to 4
> > hours), on some occasions they come back online on their own (very
> random),
> > however if we remove the server from the equation and put a laptop on the
> > port, configure the IP on the laptop, it often works fine and can gain
> > access to the rest of the world (once the old machine is put back it still
> > does not work though). I have reconfigured VLANs (i.e. changed VLAN
> > numbers), this does not work. All our switches/routers send all logs to a
> > Syslog server which during this period shows nothing out of the ordinary,
> I
> > enabled debugging for various sections however this did at one point crash
> > one of the switches, and did not show anything out of the ordinary upon
> > search (however I did only analyse a fraction of the data).
> > 
> >  
> > 
> > Due to the continued issues we decided to move back all our switches to
> the
> > 3550 series until we figured out the problems - we have another 3750
> stack:
> > 
> >  
> > 
> >  
> > 
> > Stack 3 - Distribution 5
> > 
> >  
> > 
> > Switch   Ports  Model              SW Version              SW Image
> > 
> > 
> > ------   -----  -----              ----------              ----------
> > 
> > 
> >      1   26     WS-C3750-24TS      12.2(25)SED
> > C3750-ADVIPSERVICESK
> > 
> > *    2   26     WS-C3750-24TS      12.2(25)SED
> > C3750-ADVIPSERVICESK
> > 
> >  
> > 
> >  
> > 
> > This runs the IPv4/IPv6 default SDM template and works fine (however does
> > not have that many VLANs or customers on at this time) - however this does
> > occasionally have randomly high CPU load (95% - 100%), which is not due to
> > routing updates, topologies changes or anything such as that since this
> > switch sees very little action in terms of changes - again I have enabled
> > logging and during this period I cannot see anything out of the ordinary,
> > the "show processes cpu history" shows the issues (and at the time access
> > becomes sluggish), yet "show processes cpu sorted" never shows anything
> out
> > of the ordinary, or something that it using a lot of CPU, however the
> > overview figures at the head of the table show the same high figures as
> the
> > "cpu history" graph.
> > 
> >  
> > 
> > I believe some of these issues relate to a bug in the IOS some how - can
> > anyone confirm if they have had any similar issues?
> > 
> > Please advise with regards any information anyone has!
> > 
> >  
> > 
> > Regards,
> >  
> > Paul Davies
> > 
> >  
> > 
> 
> 
> 
> > _______________________________________________
> > cisco-nsp mailing list  cisco-nsp at puck.nether.net
> > https://puck.nether.net/mailman/listinfo/cisco-nsp
> > archive at http://puck.nether.net/pipermail/cisco-nsp/