[j-nsp] Network, trouble after customer created a loop *inside* a VM host
David B Funk
dbfunk at engineering.uiowa.edu
Fri Nov 7 13:31:10 EST 2014
> From: Jeff Meyers <Jeff.Meyers at gmx.net>
> Subject: [j-nsp] Network, trouble after customer created a loop
> *inside* a VM host
>
> Hello everybody,
>
> I'm writing to this list because I can't seem to find the reason for
> what we saw twice meanwhile. Here is the setup:
>
>
>
> Juniper MX480 no RSTP
> ||
> ae0
> ||
> Juniper EX4550 VC RSTP bridge id 0
> ||
> ae0
> ||
> Juniper EX4200 VC RSTP bridge id 16k
> |
> ProCurve 2824 RSTP bridge id 32k
> |
> Windows Host
>
>
> So the router itself is not part of the Spanning-Tree, everything below
> is. On the Windows host, the customer is running ESXi with just one
> uplink towards the HP ProCurve switch so there is not even a real danger
> for a physical loop. Now: on the host are two VMs running. Each of them
> has a virtual NIC which is bridged to the physical one of the host.
> Because of a mistake, the customer accidentally bridged his two VMs
> together as well which caused a loop inside the Host. So far, so good.
>
> The trouble begins at this point because immediately we saw partial
> network outages resulting in router messages like this:
>
> Nov 7 14:30:47 cr0 l2ald[2545]: L2ALD_MAC_MOVE_NOTIFICATION: MAC Moves
> detected in the system
>
>
> This message repeated over and over and the ARP counter decreased
> continueously. Host flapped and vanished for seconds or minutes and
> internal smokeping measured a lot of loss.
>
> The HP ProCurve logged only excessive broadcast for the customer port
> and that's it. Spanning-Tree didn't recognize anything. The same applies
> to the EX4200 VC and the EX4550 VC: nothing was detected by the loop
> preventing procotol and it was only a lucky shot, that we knew where to
> look because the customer called by phone and told us, what he did.
>
> The question is: how can that be and what can I do?
>
> On the EX-series switches, each downlink port is configured with
>
> set protocols rstp interface ge-0/0/0 no-root-port
>
> storm-control is enabled on all ports with 85% (but none was detected).
> There is no special configuration on the ProCurve besides the general
> RSTP activation (which is set to RSTP and not STP).
>
>
> So can anybody help with that? I am really stuck here.. :(
>
>
> Thanks in advance,
> Jeff
Jeff,
Once you draw your diagram correctly you'll see what you're up against
(and it ain't pretty).
Juniper MX480 no RSTP
||
ae0
||
Juniper EX4550 VC RSTP bridge id 0
||
ae0
||
Juniper EX4200 VC RSTP bridge id 16k
|
ProCurve 2824 RSTP bridge id 32k
|
Windows Host no RSTP
(virtual switch)
/ \
/ \
VM-host1 VM-host2 virtual hosts with bridging potential (no RSTP)
\ /
\/ loop via clients bridging causing ARP 'move'
broadcast storm
So your problem is that the final two virtual-switch layers don't
participate in your RSTP but can be looped causing your ARP storm.
You can prevent it by fiat (limit that pro-curve port to only 1 or 2
MAC addresses). Force the user to run in VM-nat mode or only run
one VM at a time.
You may be able to control the damage by limiting the broadcast/storm
thresholds on the leaf ports.
I don't think that the "mac-move-limit" feature will help you as the
mac changes are all coming in on the same physical port. The switches
don't care about ARP MAC<->IP flapping, only the router cares about it.
Good luck
--
Dave Funk University of Iowa
<dbfunk (at) engineering.uiowa.edu> College of Engineering
319/335-5751 FAX: 319/384-0549 1256 Seamans Center
Sys_admin/Postmaster/cell_admin Iowa City, IA 52242-1527
#include <std_disclaimer.h>
Better is not better, 'standard' is better. B{
More information about the juniper-nsp
mailing list