[f-nsp] How to protect a foundry device from layer2-loops

Sat Feb 24 05:19:27 EST 2007

Hi,
 a port security mac limit would fix this or a packet storm limit.. looped ports always produce broadcast storms and loops of data - receiving many more macs than you are supposed to is a sure sign theres something wrong.

Steve

On Fri, Feb 23, 2007 at 02:58:34PM -0800, Brent Van Dussen wrote:
> Hi Gunter,
> 
> Great question/problem you describe below... I think it is somewhat common 
> for many people.
> 
> The tricky part about bigiron's is they don't have a way to effectively 
> deal with floods of broadcasts/multicasts/unknown unicasts other than 
> sending them to the mgmt CPU.  Our jetcore M4 modules can handle about 
> 100,000 pps of this type of traffic before CPU reaches 100% and routing 
> protocols start to fail.
> 
> As you have learned, running spanning tree does absolutely nothing to 
> protect your equipment from floods of packets so some kind of hardware 
> filtering is needed before packets get punted to the mgmt cpu for 
> processing.  Why do you believe that some sort of broadcast/multicast 
> limiting feature would not have helped in this situation?  If you tried to 
> run that command on a bigiron I could understand why you would feel that 
> but from the MLX it should be no problem.  What kind of switch does the 
> leased line come in on?
> 
> Loops are caused by broadcast/multicast/unk-unicast frames being generated 
> somewhere on a network, and then being kept on the wire indefinately due to 
> infinite forwarding.  If you had a controlled lab environment with a loop, 
> the switches would be fine, as soon as you inject a single arp broadcast, 
> everything would be fine...it's not untill enough arp broadcasts or vrrp 
> heartbeats compound themselves in this type of environment that traffic 
> levels in the loop start becoming a problem.
> 
> What is most disheartening is the apparent vulnerability the XMR has.  The 
> linecard should have been handling any bogus traffic and acting as a filter 
> to the control plane that talks to the main management process.  Did you 
> get any snapshots of lc cpu on the effected XMR interface or was it just 
> the main cpu that was showing signs of stress?
> 
> Did you happen to get a dm raw from the bigiron to see what type of packets 
> it was recieving?  That information would be greatly beneficial to putting 
> in place preventative measures for future problems that might flood your way.
> 
> Thanks,
> -Brent
> 
> 
> 
> At 11:00 AM 2/18/2007, Gunther Stammwitz wrote:
> >Hello colleagues,
> >
> >
> >We're using spanning tree and vlans in our internal network and everything
> >is working fine so far since layer2-loops are being resolved by spanning
> >tree and we can achieve redundancy this way.
> >
> >A few days ago a disturbing event happened: one of our leased line providers
> >who's providing us an untagged vlan between our site and a remote location
> >had a failed switch in his network which caused spanning tree to stop
> >working and therefore created a layer2-loop.
> >What we saw then was frightening: our network got "flooded" although we're
> >having only ONE port to the leased line provider and the loop was somewhere
> >in his network. The link from the ll-provider was coming in on a switch that
> >connects to our Bigiron 4000 core-switch with two links in the same untagged
> >vlan and uses spanning tree.
> >
> >Our Bigiron 4000 (SW: Version 07.8.01dT53) started melting down: the cli got
> >really slow and traffic wasn't switched anymore or at least there was a huge
> >packet loss.
> >The log file showed something like this:
> >W:System: Slot 1   Free Queue decreases less than the desirable values 3
> >consecutive times.
> >I:System: Slot 1 Write Sequence Drop 14177005 within 5 minutes.
> >I:System: Slot 1 Write Sequence Drop 14170290 within 5 minutes.
> >And so on..
> >
> >
> >Another thing we saw was that a Netiron MLX (software 3.2.x) that was
> >connected to the very same vlan got slow on the cli too. The cpu load seemed
> >to be very high and the device started loosing bgp sessions because the bgp
> >timers expired since it obviously didn't answer them in time.
> >N:BGP: Peer x.x.x.x DOWN (Rcv Notification:Hold Timer Expired)
> >
> >
> >Any idea how one can protect the network in such a situation?
> >Mac-Limits and Multicast-Limits wouldn't help. I guess broadcast storm
> >protection/broadcast limits wouldn't help either :-(
> >Would Limiting Unknown Unicasts help in such a situation? Is there some sort
> >of intelligence we can use on the switch in order to detect such situations
> >and use appropriate counter measures?
> >
> >How can it be that a loop in the ll providers network affects our switches
> >in such a bad way?  I mean not only the vlan the ll-port was connected to
> >was down but all other vlans on the switch too because the switch started
> >failing.
> >
> >
> >
> >
> >And what exactly is happening on a network when there is a layer2-loop: as
> >far as I understand a packet being sent to the network is being copied and
> >copied again until forever and floods everything.
> >
> >
> >
> >_______________________________________________
> >foundry-nsp mailing list
> >foundry-nsp at puck.nether.net
> >http://puck.nether.net/mailman/listinfo/foundry-nsp
> 
> _______________________________________________
> foundry-nsp mailing list
> foundry-nsp at puck.nether.net
> http://puck.nether.net/mailman/listinfo/foundry-nsp