[c-nsp] SUP720 blackholed MPLS traffic

Sun Mar 30 14:02:01 EDT 2014

I can tell from the overwhelming response that this is a common problem ;)

Over the past couple of weeks I've been trying to wrap my head around this and I figured out what occurred. Not why, of course...

A few seconds after I enabled GLBP 4 of the 10 MPLS interfaces basically croaked. BFD and ISIS stayed up but IP and MPLS forwarding halted. One of these interfaces (I'll call this Link A) is part of a pair of links to another core site (which explained why I lost half of the LSPs between the two sites), two of the interfaces (Links B and C) feed a route reflector, and the last interface (Link D) is a backup path for a wireless tower.

I first noticed Link A was not responding to pings from the far end, although everything looked up. A shut/noshut had no effect.

I got into the switch and shutdown the SVI I put GLBP on. BFD/ISIS then finally stopped working (or started working, depending on your perspective) and the four affected links were properly marked as offline by all peers. A shut/no shut of Links A, B, and C brought them back into service. Link D is still in this broken state as I'm not sure if there are any tests to run against it. ARP seems to be functional, but no IP or CLNS so no ISIS, no LDP, no BFD, etc.

I cannot find anything in common with these four interfaces:
- Link A is on Gi6/1 (SUP720-3B) to another SUP720-3B running 12.2(33)SXI4a, MTU 9216 with a /30
- Link B is on Gi2/13 (WS-X6148A-GE-TX) to a 7206 VXR NPE-400 running 12.4(24)T4, MTU 1530 with a /31
- Link C is on Gi9/13 and same as above
- Link D is on Gi4/23 (WS-X6724-SFP) to a 7301 running 15.2(4)S2, MTU 9216 with a /31

Gi5/1 is identical to Link A and did not have any impact. Subnets are not contiguous, BFD timers are different between some interfaces and common with others, ISIS passwords aren't unique to these four, PIM configuration isn't unique either.

Cool huh?

Ross

> -----Original Message-----
> From: cisco-nsp [mailto:cisco-nsp-bounces at puck.nether.net] On Behalf Of
> Ross Halliday
> Sent: Monday, March 10, 2014 6:35 AM
> To: cisco-nsp at puck.nether.net
> Subject: [c-nsp] SUP720 blackholed MPLS traffic
> 
> Hi list,
> 
> I ran into an odd one this morning that's got me stumped. At one of our
> POPs we're putting in a second PE. For better or for worse, we love 6500s.
> Each POP has a quasi-out-of-band management network that we connect
> directly to the global routing table (everything else is in VRFs,
> including management for customer gear, etc).
> 
> So on the current PE I pasted some config for GLBP onto an existing SVI
> right in the global routing table. Very simple stuff:
> 
>  dampening
>  ip address 9.9.9.9 255.255.255.224
>  no ip redirects
>  no ip proxy-arp
>  ip verify unicast source reachable-via rx allow-default
>  glbp 9 ip 9.9.9.9
>  glbp 9 timers msec 100 msec 400
>  glbp 9 preempt delay minimum 10
>  glbp 9 authentication md5 key-string 9MALL
>  glbp 9 name Management
> 
> On the "n" in "name" of the last line I lost the box. After recovery,
> which as far as I can tell happened on its own, I discovered that BFD only
> bounced once. A few minutes later this box dropped LDP to a few peers. A
> few MORE minutes later ISIS dropped some neighbors on its own (no BFD in
> log message) and everything started working again. I was able to access
> the switch via a direct interface and found no trace of typical high CPU.
> The total outage lasted for just over 10 minutes.
> 
> This switch is loaded with SUP720-3Bs running 12.2(33)SXI4a. Our network
> uses ISIS and LDP across the core supported by BFD, with VRFs in BGP to a
> pair of route reflectors. We are running multicast in the form of MVPNs
> including extranets with PIM.
> 
> The typical "core" interface looks like this:
> 
> interface GigabitEthernet5/1
>  description 6509 to some other 6509
>  dampening
>  mtu 9216
>  ip address 9.9.9.9 255.255.255.252
>  ip pim query-interval 300 msec
>  ip pim sparse-mode
>  ip router isis MALL
>  logging event link-status
>  load-interval 30
>  mpls traffic-eng tunnels
>  mpls ip
>  bfd interval 100 min_rx 100 multiplier 5
>  isis password MALL9
> 
> I've lost a box due to high CPU dropping BFD before but this felt
> different. No high utilization during the outage on this switch, average
> is around 9-12%. Other switches didn't appear to notice any problem. As a
> result, ISIS appeared fine, so all active LSPs that included the problem
> box stopped working.
> 
> Does this sound like a bug to anybody here? I checked over the GLBP docs
> (http://www.cisco.com/c/en/us/support/docs/switches/catalyst-6500-series-
> switches/81565-glbp-cat65k.html) and I see that I may have fouled up by
> putting "glbp ip" before the rest of the commands, but the doc doesn't
> mention side effects. I also don't see any caveats in recent 12.2(33)
> release notes that look close to what I just experienced.
> 
> I do have an upgrade to 15.1(2)SY planned for the near future, but any
> input or theories would be greatly appreciated.
> 
> Thanks!
> Ross
> 
> _______________________________________________
> cisco-nsp mailing list  cisco-nsp at puck.nether.net
> https://puck.nether.net/mailman/listinfo/cisco-nsp
> archive at http://puck.nether.net/pipermail/cisco-nsp/