[j-nsp] Network, trouble after customer created a loop inside a VM host

Fri Nov 7 13:31:10 EST 2014

> From: Jeff Meyers <Jeff.Meyers at gmx.net>
> Subject: [j-nsp] Network,	trouble after customer created a loop
> 	*inside* a VM host
>
> Hello everybody,
>
> I'm writing to this list because I can't seem to find the reason for
> what we saw twice meanwhile. Here is the setup:
>
>
>
>    Juniper MX480 	no RSTP
>          ||
>          ae0
>          ||
>   Juniper EX4550 VC	RSTP bridge id 0
>          ||
>          ae0
>          ||
>   Juniper EX4200 VC	RSTP bridge id 16k
>           |
>     ProCurve 2824	RSTP bridge id 32k
>           |
>       Windows Host
>
>
> So the router itself is not part of the Spanning-Tree, everything below
> is. On the Windows host, the customer is running ESXi with just one
> uplink towards the HP ProCurve switch so there is not even a real danger
> for a physical loop. Now: on the host are two VMs running. Each of them
> has a virtual NIC which is bridged to the physical one of the host.
> Because of a mistake, the customer accidentally bridged his two VMs
> together as well which caused a loop inside the Host. So far, so good.
>
> The trouble begins at this point because immediately we saw partial
> network outages resulting in router messages like this:
>
> Nov  7 14:30:47  cr0 l2ald[2545]: L2ALD_MAC_MOVE_NOTIFICATION: MAC Moves
> detected in the system
>
>
> This message repeated over and over and the ARP counter decreased
> continueously. Host flapped and vanished for seconds or minutes and
> internal smokeping measured a lot of loss.
>
> The HP ProCurve logged only excessive broadcast for the customer port
> and that's it. Spanning-Tree didn't recognize anything. The same applies
> to the EX4200 VC and the EX4550 VC: nothing was detected by the loop
> preventing procotol and it was only a lucky shot, that we knew where to
> look because the customer called by phone and told us, what he did.
>
> The question is: how can that be and what can I do?
>
> On the EX-series switches, each downlink port is configured with
>
> set protocols rstp interface ge-0/0/0 no-root-port
>
> storm-control is enabled on all ports with 85% (but none was detected).
> There is no special configuration on the ProCurve besides the general
> RSTP activation (which is set to RSTP and not STP).
>
>
> So can anybody help with that? I am really stuck here.. :(
>
>
> Thanks in advance,
> Jeff

Jeff,
Once you draw your diagram correctly you'll see what you're up against
(and it ain't pretty).

     Juniper MX480      no RSTP
           ||
           ae0
           ||
    Juniper EX4550 VC   RSTP bridge id 0
           ||
           ae0
           ||
    Juniper EX4200 VC   RSTP bridge id 16k
            |
      ProCurve 2824     RSTP bridge id 32k
            |
        Windows Host    no RSTP
     (virtual switch)
         /   \
        /     \
    VM-host1  VM-host2  virtual hosts with bridging potential (no RSTP)
          \  /
           \/           loop via clients bridging causing ARP 'move'
                        broadcast storm

So your problem is that the final two virtual-switch layers don't
participate in your RSTP but can be looped causing your ARP storm.

You can prevent it by fiat (limit that pro-curve port to only 1 or 2
MAC addresses). Force the user to run in VM-nat mode or only run
one VM at a time.

You may be able to control the damage by limiting the broadcast/storm
thresholds on the leaf ports.

I don't think that the "mac-move-limit" feature will help you as the
mac changes are all coming in on the same physical port. The switches
don't care about ARP MAC<->IP flapping, only the router cares about it.

Good luck

-- 
Dave Funk                                  University of Iowa
<dbfunk (at) engineering.uiowa.edu>        College of Engineering
319/335-5751   FAX: 319/384-0549           1256 Seamans Center
Sys_admin/Postmaster/cell_admin            Iowa City, IA 52242-1527
#include <std_disclaimer.h>
Better is not better, 'standard' is better. B{

[j-nsp] Network, trouble after customer created a loop *inside* a VM host

[j-nsp] Network, trouble after customer created a loop inside a VM host