[f-nsp] How to protect a foundry device from layer2-loops

Fri Feb 23 17:58:34 EST 2007

Hi Gunter,

Great question/problem you describe below... I think it is somewhat common 
for many people.

The tricky part about bigiron's is they don't have a way to effectively 
deal with floods of broadcasts/multicasts/unknown unicasts other than 
sending them to the mgmt CPU.  Our jetcore M4 modules can handle about 
100,000 pps of this type of traffic before CPU reaches 100% and routing 
protocols start to fail.

As you have learned, running spanning tree does absolutely nothing to 
protect your equipment from floods of packets so some kind of hardware 
filtering is needed before packets get punted to the mgmt cpu for 
processing.  Why do you believe that some sort of broadcast/multicast 
limiting feature would not have helped in this situation?  If you tried to 
run that command on a bigiron I could understand why you would feel that 
but from the MLX it should be no problem.  What kind of switch does the 
leased line come in on?

Loops are caused by broadcast/multicast/unk-unicast frames being generated 
somewhere on a network, and then being kept on the wire indefinately due to 
infinite forwarding.  If you had a controlled lab environment with a loop, 
the switches would be fine, as soon as you inject a single arp broadcast, 
everything would be fine...it's not untill enough arp broadcasts or vrrp 
heartbeats compound themselves in this type of environment that traffic 
levels in the loop start becoming a problem.

What is most disheartening is the apparent vulnerability the XMR has.  The 
linecard should have been handling any bogus traffic and acting as a filter 
to the control plane that talks to the main management process.  Did you 
get any snapshots of lc cpu on the effected XMR interface or was it just 
the main cpu that was showing signs of stress?

Did you happen to get a dm raw from the bigiron to see what type of packets 
it was recieving?  That information would be greatly beneficial to putting 
in place preventative measures for future problems that might flood your way.

Thanks,
-Brent

At 11:00 AM 2/18/2007, Gunther Stammwitz wrote:
>Hello colleagues,
>
>
>We're using spanning tree and vlans in our internal network and everything
>is working fine so far since layer2-loops are being resolved by spanning
>tree and we can achieve redundancy this way.
>
>A few days ago a disturbing event happened: one of our leased line providers
>who's providing us an untagged vlan between our site and a remote location
>had a failed switch in his network which caused spanning tree to stop
>working and therefore created a layer2-loop.
>What we saw then was frightening: our network got "flooded" although we're
>having only ONE port to the leased line provider and the loop was somewhere
>in his network. The link from the ll-provider was coming in on a switch that
>connects to our Bigiron 4000 core-switch with two links in the same untagged
>vlan and uses spanning tree.
>
>Our Bigiron 4000 (SW: Version 07.8.01dT53) started melting down: the cli got
>really slow and traffic wasn't switched anymore or at least there was a huge
>packet loss.
>The log file showed something like this:
>W:System: Slot 1   Free Queue decreases less than the desirable values 3
>consecutive times.
>I:System: Slot 1 Write Sequence Drop 14177005 within 5 minutes.
>I:System: Slot 1 Write Sequence Drop 14170290 within 5 minutes.
>And so on..
>
>
>Another thing we saw was that a Netiron MLX (software 3.2.x) that was
>connected to the very same vlan got slow on the cli too. The cpu load seemed
>to be very high and the device started loosing bgp sessions because the bgp
>timers expired since it obviously didn't answer them in time.
>N:BGP: Peer x.x.x.x DOWN (Rcv Notification:Hold Timer Expired)
>
>
>Any idea how one can protect the network in such a situation?
>Mac-Limits and Multicast-Limits wouldn't help. I guess broadcast storm
>protection/broadcast limits wouldn't help either :-(
>Would Limiting Unknown Unicasts help in such a situation? Is there some sort
>of intelligence we can use on the switch in order to detect such situations
>and use appropriate counter measures?
>
>How can it be that a loop in the ll providers network affects our switches
>in such a bad way?  I mean not only the vlan the ll-port was connected to
>was down but all other vlans on the switch too because the switch started
>failing.
>
>
>
>
>And what exactly is happening on a network when there is a layer2-loop: as
>far as I understand a packet being sent to the network is being copied and
>copied again until forever and floods everything.
>
>
>
>_______________________________________________
>foundry-nsp mailing list
>foundry-nsp at puck.nether.net
>http://puck.nether.net/mailman/listinfo/foundry-nsp