[f-nsp] How to protect a foundry device from layer2-loops
Joseph Jackson
JJackson at aninetworks.com
Fri Feb 23 23:04:40 EST 2007
Wouldn't one way to get around this be to use layer 3 routing on those
links from the carrier?
> -----Original Message-----
> From: foundry-nsp-bounces at puck.nether.net
> [mailto:foundry-nsp-bounces at puck.nether.net] On Behalf Of
> Brent Van Dussen
> Sent: Friday, February 23, 2007 2:59 PM
> To: Gunther Stammwitz; foundry-nsp at puck.nether.net
> Subject: Re: [f-nsp] How to protect a foundry device from layer2-loops
>
> Hi Gunter,
>
> Great question/problem you describe below... I think it is
> somewhat common
> for many people.
>
> The tricky part about bigiron's is they don't have a way to
> effectively
> deal with floods of broadcasts/multicasts/unknown unicasts other than
> sending them to the mgmt CPU. Our jetcore M4 modules can
> handle about
> 100,000 pps of this type of traffic before CPU reaches 100%
> and routing
> protocols start to fail.
>
> As you have learned, running spanning tree does absolutely nothing to
> protect your equipment from floods of packets so some kind of
> hardware
> filtering is needed before packets get punted to the mgmt cpu for
> processing. Why do you believe that some sort of broadcast/multicast
> limiting feature would not have helped in this situation? If
> you tried to
> run that command on a bigiron I could understand why you
> would feel that
> but from the MLX it should be no problem. What kind of
> switch does the
> leased line come in on?
>
> Loops are caused by broadcast/multicast/unk-unicast frames
> being generated
> somewhere on a network, and then being kept on the wire
> indefinately due to
> infinite forwarding. If you had a controlled lab environment
> with a loop,
> the switches would be fine, as soon as you inject a single
> arp broadcast,
> everything would be fine...it's not untill enough arp
> broadcasts or vrrp
> heartbeats compound themselves in this type of environment
> that traffic
> levels in the loop start becoming a problem.
>
> What is most disheartening is the apparent vulnerability the
> XMR has. The
> linecard should have been handling any bogus traffic and
> acting as a filter
> to the control plane that talks to the main management
> process. Did you
> get any snapshots of lc cpu on the effected XMR interface or
> was it just
> the main cpu that was showing signs of stress?
>
> Did you happen to get a dm raw from the bigiron to see what
> type of packets
> it was recieving? That information would be greatly
> beneficial to putting
> in place preventative measures for future problems that might
> flood your way.
>
> Thanks,
> -Brent
>
>
>
> At 11:00 AM 2/18/2007, Gunther Stammwitz wrote:
> >Hello colleagues,
> >
> >
> >We're using spanning tree and vlans in our internal network
> and everything
> >is working fine so far since layer2-loops are being resolved
> by spanning
> >tree and we can achieve redundancy this way.
> >
> >A few days ago a disturbing event happened: one of our
> leased line providers
> >who's providing us an untagged vlan between our site and a
> remote location
> >had a failed switch in his network which caused spanning tree to stop
> >working and therefore created a layer2-loop.
> >What we saw then was frightening: our network got "flooded"
> although we're
> >having only ONE port to the leased line provider and the
> loop was somewhere
> >in his network. The link from the ll-provider was coming in
> on a switch that
> >connects to our Bigiron 4000 core-switch with two links in
> the same untagged
> >vlan and uses spanning tree.
> >
> >Our Bigiron 4000 (SW: Version 07.8.01dT53) started melting
> down: the cli got
> >really slow and traffic wasn't switched anymore or at least
> there was a huge
> >packet loss.
> >The log file showed something like this:
> >W:System: Slot 1 Free Queue decreases less than the
> desirable values 3
> >consecutive times.
> >I:System: Slot 1 Write Sequence Drop 14177005 within 5 minutes.
> >I:System: Slot 1 Write Sequence Drop 14170290 within 5 minutes.
> >And so on..
> >
> >
> >Another thing we saw was that a Netiron MLX (software 3.2.x) that was
> >connected to the very same vlan got slow on the cli too. The
> cpu load seemed
> >to be very high and the device started loosing bgp sessions
> because the bgp
> >timers expired since it obviously didn't answer them in time.
> >N:BGP: Peer x.x.x.x DOWN (Rcv Notification:Hold Timer Expired)
> >
> >
> >Any idea how one can protect the network in such a situation?
> >Mac-Limits and Multicast-Limits wouldn't help. I guess
> broadcast storm
> >protection/broadcast limits wouldn't help either :-(
> >Would Limiting Unknown Unicasts help in such a situation? Is
> there some sort
> >of intelligence we can use on the switch in order to detect
> such situations
> >and use appropriate counter measures?
> >
> >How can it be that a loop in the ll providers network
> affects our switches
> >in such a bad way? I mean not only the vlan the ll-port was
> connected to
> >was down but all other vlans on the switch too because the
> switch started
> >failing.
> >
> >
> >
> >
> >And what exactly is happening on a network when there is a
> layer2-loop: as
> >far as I understand a packet being sent to the network is
> being copied and
> >copied again until forever and floods everything.
> >
> >
> >
> >_______________________________________________
> >foundry-nsp mailing list
> >foundry-nsp at puck.nether.net
> >http://puck.nether.net/mailman/listinfo/foundry-nsp
>
> _______________________________________________
> foundry-nsp mailing list
> foundry-nsp at puck.nether.net
> http://puck.nether.net/mailman/listinfo/foundry-nsp
>
More information about the foundry-nsp
mailing list