[j-nsp] LACP to NetApp

Crist Clark cjc+j-nsp at pumpky.net
Tue Feb 19 08:43:07 EST 2013


We have a mixed virtual chassis of two EX4500s and two EX4200s. They
are connected to
two NetApp filers. Each filer has a LACP aggregate to the VC
consisting of two 10-Gig links
to each of the 4500s (so four xe interfaces in each one). Once things
are up and running,
it works fine, but things do not always come up cleanly after one of
the filers does a
"hand back" or reboots.

The problem happens most times, but not every time. It happens with
both controllers. It
does not happen to the same physical link in a bundle each time, and
it does not happen
only with links associated with one of the 4500 chassis. That seems to
imply a software
problem, not physical.

The trouble is one of the links in a bundle will end up stuck in the
"Defaulted" state as
seen from "show lacp interfaces" output. The symptom seen to the
network users is that
connectivity to specific machines on a network are lost, something
like the host with
192.168.2.100 is reachable, but 192.168.2.99 is not. I think this has
to do with the hashing
to chose a link in the LACP. The combinations that get sent to the
"Defaulted" link are
being lost, while others work.

>From the Juniper EX side, the problem looks like the system is not
receiving any LACPDUs
on the affected link. The "show lacp statistics interfaces" counters
are not incrementing for
"Rx" PDUs. However, we have not been able to determine whether the
problem is that the
NetApp is not sending PDUs, or the Juniper is not processing them.

Recovery from the condition is easy. On the switch side, the interface
in the Defaulted
state is manually downed and upped,

  # ifconfig xe-0/0/6 down
  # ifconfig xe-0/0/6 up

And the LACP happily completes proper negotiations.

We have been trying to work with JTAC and NetApp support. The problem
has been finding
downtime to reboot the filers.

Both Juniper and NetApp have said they have seen issues like this, but
they were resolved by
specifying the following settings for the aggregate interface on the
switch-side,

    aggregated-ether-options {
        lacp {
            active;
            periodic slow;
        }
    }

To make the EX switch match the NetApp's defaults (defaults that
cannot be changed on their
side). But this did not solve the problem for us.

Has anyone here seen LACP problems with NetApp or other vendors? The
plan, if we ever get
the chance to do some troubleshooting, is to do analyzer captures to
see what's happening
with the LACPDUs. In the mean time, we were trying to also think of a
reliable way to automate
the reset of interfaces in the bundles if they fall into the "Defaulted" state.


More information about the juniper-nsp mailing list