[c-nsp] endless routing loop in a L3 MPLS VPN

Thu Mar 27 16:47:25 EDT 2014

Hi list,

here is something I would like to hear your opinion about:

(tl;dr routing loop in a layer 3 mpls vpn isn't interrupted by the
TTL mechanism which fails when popping the aggregate label, looking
up the destination IP in CEF and pushing another label)

A Layer 3 MPLS VPN as simple as:
CE1 <-> PE1 <-> P <-> PE2 <-> CE2 (unreachable)

and "no mpls ip propagate-ttl forwarded" configured on both PEs.

CE <--> PE routing is done with static routes.
A static default route points from PE1 towards CE1:
ip route vrf BLUE 0.0.0.0 0.0.0.0 172.16.0.2

A specific route points from PE2 to CE2:
ip route vrf BLUE 10.0.0.0 255.255.255.0 172.16.0.6

PE's are exchanging routes via MP-BGP. Straightforward.

Now, the CE2-facing interface on PE2 goes down, and because PE2
didn't have interface routes (like it should have):
ip route vrf BLUE 10.0.0.0 255.255.255.0 Gigaethernet0/1 172.16.0.6

... it does a recursive lookup of 172.16.0.6 via the default route,
creates a label for 10.0.0.0/24 and announces the prefix in MP-BGP:

PE2#show mpls forwarding-table labels 215 detail
Local      Outgoing   Prefix           Bytes Label   Outgoing   Next Hop
Label      Label      or Tunnel Id     Switched      interface
215        No Label   10.0.0.0/24[V]   \
                                       0             Tu101      point2point
        MAC/Encaps=14/22, MRU=1896, Label Stack{36 53}, via Gi0/2
        001F9D8E84180021D82053198847 0002400000035000
        VPN route: BLUE
        No output feature configured

So that leads to a routing loop, which is fine, because either IP or MPLS
TTL should be decremented by the PEs and when TTL reaches zero it would
be dropped on of the boxes.

Doesn't happen in my case.

The bottom label of the stack is 53, which is the aggregate vrf label
on PE1 (we don't use per vrf label allocation, its the aggregate label
a PE creates for its local IP addresses I believe):

PE1#show mpls forwarding-table labels 53 detail
Local      Outgoing   Prefix           Bytes Label   Outgoing   Next Hop
Label      Label      or Tunnel Id     Switched      interface
53         Pop Label  IPv4 VRF[V]      1640389355500 aggregate/BLUE
        MAC/Encaps=0/0, MRU=0, Label Stack{}
        VPN route: BLUE
        No output feature configured

When PE1 pops the bottom label 53 from the packet, it also drops the MPLS
TTL with it. PE1 then does an IP/CEF lookup, which points again to PE2's
215 label and then pushes a new MPLS label with TTL 255. It doesn't
decrement IP TTL.

So the MPLS TTL is reset to 255 everytime it crosses PE1 because all labels
are popped, and only in a subsequent dst-IP/CEF lookup the PE decides to
push another label (restarting with TTL 255).

IP TTL is never decremented.

This leads to an endless loop if I inject a single packet towards
10.0.0.0/24.

While I do understand that the route on PE2 is wrong, dangerous and we can
simply fix it by using a interface + next-hop route, I'm concerned about the
fact that the TTL mechanism fails with such a simple route misconfiguration.

Is PE1's behavior really correct here?

Finding the prefix causing the routing loop is not something particularly
relaxing nor funny, especially when the box is CPU-based (like our PE2) and
spends 99% of the CPU in interrupts forwarding the bogus traffic.

Appreciate any comments.

Thanks,

Lukas