[f-nsp] How to protect a foundry device from layer2-loops

Fri Feb 23 23:04:40 EST 2007

Wouldn't one way to get around this be to use layer 3 routing on those
links from the carrier?   

> -----Original Message-----
> From: foundry-nsp-bounces at puck.nether.net 
> [mailto:foundry-nsp-bounces at puck.nether.net] On Behalf Of 
> Brent Van Dussen
> Sent: Friday, February 23, 2007 2:59 PM
> To: Gunther Stammwitz; foundry-nsp at puck.nether.net
> Subject: Re: [f-nsp] How to protect a foundry device from layer2-loops
> 
> Hi Gunter,
> 
> Great question/problem you describe below... I think it is 
> somewhat common 
> for many people.
> 
> The tricky part about bigiron's is they don't have a way to 
> effectively 
> deal with floods of broadcasts/multicasts/unknown unicasts other than 
> sending them to the mgmt CPU.  Our jetcore M4 modules can 
> handle about 
> 100,000 pps of this type of traffic before CPU reaches 100% 
> and routing 
> protocols start to fail.
> 
> As you have learned, running spanning tree does absolutely nothing to 
> protect your equipment from floods of packets so some kind of 
> hardware 
> filtering is needed before packets get punted to the mgmt cpu for 
> processing.  Why do you believe that some sort of broadcast/multicast 
> limiting feature would not have helped in this situation?  If 
> you tried to 
> run that command on a bigiron I could understand why you 
> would feel that 
> but from the MLX it should be no problem.  What kind of 
> switch does the 
> leased line come in on?
> 
> Loops are caused by broadcast/multicast/unk-unicast frames 
> being generated 
> somewhere on a network, and then being kept on the wire 
> indefinately due to 
> infinite forwarding.  If you had a controlled lab environment 
> with a loop, 
> the switches would be fine, as soon as you inject a single 
> arp broadcast, 
> everything would be fine...it's not untill enough arp 
> broadcasts or vrrp 
> heartbeats compound themselves in this type of environment 
> that traffic 
> levels in the loop start becoming a problem.
> 
> What is most disheartening is the apparent vulnerability the 
> XMR has.  The 
> linecard should have been handling any bogus traffic and 
> acting as a filter 
> to the control plane that talks to the main management 
> process.  Did you 
> get any snapshots of lc cpu on the effected XMR interface or 
> was it just 
> the main cpu that was showing signs of stress?
> 
> Did you happen to get a dm raw from the bigiron to see what 
> type of packets 
> it was recieving?  That information would be greatly 
> beneficial to putting 
> in place preventative measures for future problems that might 
> flood your way.
> 
> Thanks,
> -Brent
> 
> 
> 
> At 11:00 AM 2/18/2007, Gunther Stammwitz wrote:
> >Hello colleagues,
> >
> >
> >We're using spanning tree and vlans in our internal network 
> and everything
> >is working fine so far since layer2-loops are being resolved 
> by spanning
> >tree and we can achieve redundancy this way.
> >
> >A few days ago a disturbing event happened: one of our 
> leased line providers
> >who's providing us an untagged vlan between our site and a 
> remote location
> >had a failed switch in his network which caused spanning tree to stop
> >working and therefore created a layer2-loop.
> >What we saw then was frightening: our network got "flooded" 
> although we're
> >having only ONE port to the leased line provider and the 
> loop was somewhere
> >in his network. The link from the ll-provider was coming in 
> on a switch that
> >connects to our Bigiron 4000 core-switch with two links in 
> the same untagged
> >vlan and uses spanning tree.
> >
> >Our Bigiron 4000 (SW: Version 07.8.01dT53) started melting 
> down: the cli got
> >really slow and traffic wasn't switched anymore or at least 
> there was a huge
> >packet loss.
> >The log file showed something like this:
> >W:System: Slot 1   Free Queue decreases less than the 
> desirable values 3
> >consecutive times.
> >I:System: Slot 1 Write Sequence Drop 14177005 within 5 minutes.
> >I:System: Slot 1 Write Sequence Drop 14170290 within 5 minutes.
> >And so on..
> >
> >
> >Another thing we saw was that a Netiron MLX (software 3.2.x) that was
> >connected to the very same vlan got slow on the cli too. The 
> cpu load seemed
> >to be very high and the device started loosing bgp sessions 
> because the bgp
> >timers expired since it obviously didn't answer them in time.
> >N:BGP: Peer x.x.x.x DOWN (Rcv Notification:Hold Timer Expired)
> >
> >
> >Any idea how one can protect the network in such a situation?
> >Mac-Limits and Multicast-Limits wouldn't help. I guess 
> broadcast storm
> >protection/broadcast limits wouldn't help either :-(
> >Would Limiting Unknown Unicasts help in such a situation? Is 
> there some sort
> >of intelligence we can use on the switch in order to detect 
> such situations
> >and use appropriate counter measures?
> >
> >How can it be that a loop in the ll providers network 
> affects our switches
> >in such a bad way?  I mean not only the vlan the ll-port was 
> connected to
> >was down but all other vlans on the switch too because the 
> switch started
> >failing.
> >
> >
> >
> >
> >And what exactly is happening on a network when there is a 
> layer2-loop: as
> >far as I understand a packet being sent to the network is 
> being copied and
> >copied again until forever and floods everything.
> >
> >
> >
> >_______________________________________________
> >foundry-nsp mailing list
> >foundry-nsp at puck.nether.net
> >http://puck.nether.net/mailman/listinfo/foundry-nsp
> 
> _______________________________________________
> foundry-nsp mailing list
> foundry-nsp at puck.nether.net
> http://puck.nether.net/mailman/listinfo/foundry-nsp
>