[f-nsp] How to protect a foundry device from layer2-loops

Sun Feb 18 14:00:50 EST 2007

Hello colleagues,

We're using spanning tree and vlans in our internal network and everything
is working fine so far since layer2-loops are being resolved by spanning
tree and we can achieve redundancy this way.

A few days ago a disturbing event happened: one of our leased line providers
who's providing us an untagged vlan between our site and a remote location
had a failed switch in his network which caused spanning tree to stop
working and therefore created a layer2-loop.
What we saw then was frightening: our network got "flooded" although we're
having only ONE port to the leased line provider and the loop was somewhere
in his network. The link from the ll-provider was coming in on a switch that
connects to our Bigiron 4000 core-switch with two links in the same untagged
vlan and uses spanning tree.

Our Bigiron 4000 (SW: Version 07.8.01dT53) started melting down: the cli got
really slow and traffic wasn't switched anymore or at least there was a huge
packet loss.
The log file showed something like this:
W:System: Slot 1   Free Queue decreases less than the desirable values 3
consecutive times.
I:System: Slot 1 Write Sequence Drop 14177005 within 5 minutes. 
I:System: Slot 1 Write Sequence Drop 14170290 within 5 minutes. 
And so on..

Another thing we saw was that a Netiron MLX (software 3.2.x) that was
connected to the very same vlan got slow on the cli too. The cpu load seemed
to be very high and the device started loosing bgp sessions because the bgp
timers expired since it obviously didn't answer them in time.
N:BGP: Peer x.x.x.x DOWN (Rcv Notification:Hold Timer Expired)

Any idea how one can protect the network in such a situation?
Mac-Limits and Multicast-Limits wouldn't help. I guess broadcast storm
protection/broadcast limits wouldn't help either :-(
Would Limiting Unknown Unicasts help in such a situation? Is there some sort
of intelligence we can use on the switch in order to detect such situations
and use appropriate counter measures?

How can it be that a loop in the ll providers network affects our switches
in such a bad way?  I mean not only the vlan the ll-port was connected to
was down but all other vlans on the switch too because the switch started
failing.

And what exactly is happening on a network when there is a layer2-loop: as
far as I understand a packet being sent to the network is being copied and
copied again until forever and floods everything.