[c-nsp] 6500 SXI9 broken MPLS L3VPN with per-prefix label allocation

Tue Mar 12 13:45:24 EDT 2013

Hello everyone,

I have a quite weird problem I cannot wrap my head around. I think it's
an annoying bug, but I'm not sure.

We are currently experimenting with MPLS in our network. The first use
will be L3VPN to get rid of some multi-step PBR when our clients with
RFC1918 addresses want to go to the internet and are redirected through
the NAT cluster, which is not on the same location as the transit.

For this we have the following "test" setup in the live network:

                                  Router R1
                                   |
Client --- PE 1 ----- P....P ----- PE 2 ---- NAT-Cluster
        NX 6.1(2)    NX 6.1(2)   VSS1440
                     IOS SXJ*    IOS SXI9

The VRF only carries a default route pointing towards the NAT-Cluster,
on a global SVI and thus a global next-hop

---
vrf definition SECOMAT
 rd 129.187.0.9:9000
 !
 address-family ipv4
 route-target export 12816:9000
 route-target import 12816:9000
 exit-address-family
!
!
router bgp 12816
!
address-family ipv4 vrf SECOMAT
  redistribute static
  no synchronization
  network 0.0.0.0
 exit-address-family
!
ip route vrf SECOMAT 0.0.0.0 0.0.0.0 Vlan1644 138.246.99.33
---

vss1-2wr#sh ip route vrf SECOMAT

Routing Table: SECOMAT
[...]
Gateway of last resort is 138.246.99.33 to network 0.0.0.0

S*   0.0.0.0/0 [1/0] via 138.246.99.33, Vlan1644

At the PE 1 traffic from RFC1918 to !our destination addresses are
supposed to be PBRed into the VRF. At the moment it is a very easy

route-map PRIVATE_TO_SECOMAT permit 10
  set vrf SECOMAT

As far as I can tell this works quite well, a trace from the client
follows the normal path to the PE 2 (not the internet transit, which was
the whole point), but then it gets ugly

traceroute to 83.170.0.1 (83.170.0.1), 30 hops max, 60 byte packets
 1  10.155.0.254 (10.155.0.254)  0.320 ms  0.446 ms  0.587 ms
 2  * * * <--- this is a NX-OS device which does not answer
 3  vl-3004.csr1-0gz.lrz.de (129.187.0.142)  1.189 ms  1.279 ms  1.318
ms
 4  * * * <--- this is a NX-OS device which does not answer
 5  * * * <--- this is the egress PE
 6  vl-3016.csr1-2wr.lrz.de (129.187.0.253)  0.790 ms  0.934 ms  0.937

Hop 6 is the upstream router of the PE 2, so at this point the traffic
is in the global routing table.

I was pretty sure this is a configuration error, but now I don't think
it is. Observe:

Ingress PE:

0.0.0.0/0, ubest/mbest: 1/0
    *via 129.187.0.9%default, [200/0], 00:16:54, bgp-12816, internal,
tag 12816 (mpls-vpn)
         MPLS[0]: Label=875 E=0 TTL=0 S=0 (VPN)
         client-specific data: 4f59d   
         recursive next hop: 129.187.0.9/32%default
         extended route information: BGP origin AS 12816 BGP peer AS

Egress PE:

vss1-2wr#sh mpls forwarding-table labels 875
Local      Outgoing   Prefix           Bytes Label   Outgoing   Next Hop    
Label      Label      or Tunnel Id     Switched      interface              
875        No Label   0.0.0.0/0[V]     845000        Vl1644
138.246.99.33

So we have a per-prefix label, with the right egress interface and
next-hop

vss1-2wr#sh mls cef mpls labels 875

Codes: + - Push label, - - Pop Label         * - Swap Label, E - exp1
Index  Local            Label                  Out i/f
       Label             Op
8009   875 (EOS)        (-)                    recirc

Okay, I think this is the problem. If label 875 (which thanks to PHP is
the only label) is popped, the packet is untagged. Recirculation means
lookup in the global routing table, so it gets sent out to the upstream
router.

Every other L3VPN setup where I have used the out i/f is set correctly.
Incidentally, when I set the route to another interface, it works as
well

vss1-2wr#sh mls cef mpls labels 875

Codes: + - Push label, - - Pop Label         * - Swap Label, E - exp1
Index  Local            Label                  Out i/f
       Label             Op
8009   875 (EOS)        (-)                    Vl60          ,
0050.568f.0167

As far as I can tell, Vl60 is not so different from Vlan1644. GRT,
next-hop directly connected, next-hop in ARP table, next-hop pingable. I
have tried several next-hops in Vlan1644 and all of them lead to recirc.
Special thing about Vlan1644 is that one next-hop (.43) has a static ARP
entry towards a multicast MAC and that multicast MAC is sent to a fixed
set of ports (CLUSTERIP netfilter extension, similar to Microsoft NLB),
but I tried normal unicast next-hops as well (i.e. .33 as above)

I have found a workaround, which is the hidden and undocumented

mpls label mode vrf SECOMAT protocol bgp-vpnv4 per-vrf

which leads to 

0.0.0.0/0, ubest/mbest: 1/0
    *via 129.187.0.9%default, [200/0], 00:00:15, bgp-12816, internal,
tag 12816 (mpls-vpn)
         MPLS[0]: Label=1253 E=0 TTL=0 S=0 (VPN)
         client-specific data: 4f59d   
         recursive next hop: 129.187.0.9/32%default
         extended route information: BGP origin AS 12816 BGP peer AS
12816

vss1-2wr#sh mpls forwarding-table labels 1253
Local      Outgoing   Prefix           Bytes Label   Outgoing   Next Hop    
Label      Label      or Tunnel Id     Switched      interface              
1253       Pop Label  IPv4 VRF[V]      590186        aggregate/SECOMAT 

and everything works as planned.

Anyone ever observed something like that?

Bernhard