[c-nsp] SUP720 blackholed MPLS traffic

Mon Mar 10 06:35:02 EDT 2014

Hi list,

I ran into an odd one this morning that's got me stumped. At one of our POPs we're putting in a second PE. For better or for worse, we love 6500s. Each POP has a quasi-out-of-band management network that we connect directly to the global routing table (everything else is in VRFs, including management for customer gear, etc).

So on the current PE I pasted some config for GLBP onto an existing SVI right in the global routing table. Very simple stuff:

 dampening
 ip address 9.9.9.9 255.255.255.224
 no ip redirects
 no ip proxy-arp
 ip verify unicast source reachable-via rx allow-default
 glbp 9 ip 9.9.9.9
 glbp 9 timers msec 100 msec 400
 glbp 9 preempt delay minimum 10
 glbp 9 authentication md5 key-string 9MALL
 glbp 9 name Management

On the "n" in "name" of the last line I lost the box. After recovery, which as far as I can tell happened on its own, I discovered that BFD only bounced once. A few minutes later this box dropped LDP to a few peers. A few MORE minutes later ISIS dropped some neighbors on its own (no BFD in log message) and everything started working again. I was able to access the switch via a direct interface and found no trace of typical high CPU. The total outage lasted for just over 10 minutes.

This switch is loaded with SUP720-3Bs running 12.2(33)SXI4a. Our network uses ISIS and LDP across the core supported by BFD, with VRFs in BGP to a pair of route reflectors. We are running multicast in the form of MVPNs including extranets with PIM.

The typical "core" interface looks like this:

interface GigabitEthernet5/1
 description 6509 to some other 6509
 dampening
 mtu 9216
 ip address 9.9.9.9 255.255.255.252
 ip pim query-interval 300 msec
 ip pim sparse-mode
 ip router isis MALL
 logging event link-status
 load-interval 30
 mpls traffic-eng tunnels
 mpls ip
 bfd interval 100 min_rx 100 multiplier 5
 isis password MALL9

I've lost a box due to high CPU dropping BFD before but this felt different. No high utilization during the outage on this switch, average is around 9-12%. Other switches didn't appear to notice any problem. As a result, ISIS appeared fine, so all active LSPs that included the problem box stopped working.

Does this sound like a bug to anybody here? I checked over the GLBP docs (http://www.cisco.com/c/en/us/support/docs/switches/catalyst-6500-series-switches/81565-glbp-cat65k.html) and I see that I may have fouled up by putting "glbp ip" before the rest of the commands, but the doc doesn't mention side effects. I also don't see any caveats in recent 12.2(33) release notes that look close to what I just experienced.

I do have an upgrade to 15.1(2)SY planned for the near future, but any input or theories would be greatly appreciated.

Thanks!
Ross