[c-nsp] Wierd MPLS/VPLS issue

Tue Nov 8 13:24:22 EST 2016

On 4 November 2016 at 15:40, Simon Lockhart <simon at slimey.org> wrote:
> All,
>
> Having banged my head against a brick wall all day today trying to work out
> what's going on, and not having got anywhere, I thought I'd ask this list for
> some suggestions...
>
> I've got a Cisco MPLS core network, with Extreme boxes running as VPLS
> endpoints. Over the last couple of days I've tried turning up additional
> capacity between two core nodes, and each time I try, I end up with packet
> loss over VPLS links (either full or partial loss), but only on a subset of
> VPLS instances.
>
> Simplified network diagram:
>
>    +----------+
>    |  vpls-m  |
>    |          |
>    +---+--+---+
>        |  |
>        |  | 2 x 10G LAG
>        |  |
>    +---+--+---+       +----------+
>    |  core-m  |  10G  |   sw-m   |
>    |          +-------+          |
>    |          |       +-----+----+
>    +--+-+-+---+             |
>       | | |                 |
>       | | | 3x10G           | 100G VLAN Trunk
>       | | | ECMP            |
>       | | |                 |
>    +--+-+-+---+       +----------+
>    |  core-l  |  10G  |   sw-l   |
>    |          +-------+          |
>    |          |       +----------+
>    +---+--+---+
>        |  |
>        |  | 2 x 10G LAG
>        |  |
>    +---+--+---+
>    |  vpls-l  |
>    |          |
>    +----------+
>
> vpls-m and vpls-l are Extreme X670-G2's (running EXOS 16.1.3.6)
> core-m and core-l are Cisco 6500's with Sup2T (running IOS 15.2(1)SY2)
> sw-m and sw-l are Cisco Nexus 92160YC's (running NXOS 7.0(3)I4(3))
>
> The three existing 10G links directly between core-m and core-l are live now,
> over carrier 10G EoMPLS links.
>
> Typical config for the 10G link is:
>
> interface TenGigabitEthernet1/1
>  description to core-l:Te1/2
>  mtu 9000
>  ip address xx.yy.zz.234 255.255.255.252
>  ip pim sparse-mode
>  logging event link-status
>  load-interval 30
>  ipv6 enable
>  mpls traffic-eng tunnels
>  mpls ip
>  ipv6 ospf 1 area 0.0.0.0
>  hold-queue 4096 in
> end
>
> The new 10G link I'm trying to add is going via sw-m and sw-l, over a 100G
> wavelength from a carrier. All the ports on sw-m and sw-l have an MTU of 9216
> configured, with the port facing core-* as a "switchport access" port, and the
> 100G link configured as a "switchport trunk".
>
> Config on the core-* ports towards the sw-*'s is the same as above (except I'm
> using /31 for the IPv4 addresses). IPv4 and IPv6 reachability is fine. OSPF,
> OSPFv3 and PIM come up over the link. As soon as I configure "mpls ip", I start
> getting the packet loss over some VPLS links. Remove "mpls ip", and the packet
> loss goes away.
>
> To me, everything *looks* right, it's just that some VPLS traffic traversing
> the new link gets lost.
>
> Anyone got any suggestions on what I should look for whilst troubleshooting
> this? Unfortunately, due to the impact to traffic, I have to make any changes
> within a maintenance window, but I've run out of ideas of things to try or look
> for.
>
> Many thanks,
>
> Simon

Hi Simon,

When you enable 'mpls ip' on this new ECMP path (is that correct, have
I understood your topology correctly) have you been able to check
which of your VPLS instances and packets are hashed onto which member
links? I'm wondering if VPLS instane A hashes onto links 3 & 4 (4
being the new one) is that why that instance is affected and not VPLS
instance B which hasesh to links 1 & 3, since you stated:

> As soon as I configure "mpls ip", I start
> getting the packet loss over some VPLS links.

Also have you done some load testing of your new ECMP path? Since you
are using LDP here can you push the OSPF cost right up and enable
'mpls ip' and static route some iPerf traffic over the link and check
for errors on the interface/SFP/cables/patches/fibres/muxes etc. If
you're not 100% certain its not a physical issue have you also tried
other ports, SFPs etc?

Also can you detect traffic loss if you pin up a static MPLS-TE tunnel
over that link only and iPerf some traffic over the tunnel? If the
OSPF adjancency is stable and the issue is only occuring when you
enable "mpls ip" then is it just an issue with LDP transport labels or
all MPLS traffic etc.

Can you run an ELAM capture and try to capture the dropped packets?

If you are sure its not a layer 1 issue; are you using control-words
on the VPLS pseudowires? If not, can you try turning them on?

I would really start by trying to bottom out which VPLS instances are
being affected, that might start you on the track as to why (are they
only ones hashed to this new link, can you change your hashing, if you
are using the control-word can you try disabling it an bringing the
new ECMP path online).

Cheers,
James.