[c-nsp] SUP720 blackholed MPLS traffic
Ross Halliday
ross.halliday at wtccommunications.ca
Mon Mar 10 06:35:02 EDT 2014
Hi list,
I ran into an odd one this morning that's got me stumped. At one of our POPs we're putting in a second PE. For better or for worse, we love 6500s. Each POP has a quasi-out-of-band management network that we connect directly to the global routing table (everything else is in VRFs, including management for customer gear, etc).
So on the current PE I pasted some config for GLBP onto an existing SVI right in the global routing table. Very simple stuff:
dampening
ip address 9.9.9.9 255.255.255.224
no ip redirects
no ip proxy-arp
ip verify unicast source reachable-via rx allow-default
glbp 9 ip 9.9.9.9
glbp 9 timers msec 100 msec 400
glbp 9 preempt delay minimum 10
glbp 9 authentication md5 key-string 9MALL
glbp 9 name Management
On the "n" in "name" of the last line I lost the box. After recovery, which as far as I can tell happened on its own, I discovered that BFD only bounced once. A few minutes later this box dropped LDP to a few peers. A few MORE minutes later ISIS dropped some neighbors on its own (no BFD in log message) and everything started working again. I was able to access the switch via a direct interface and found no trace of typical high CPU. The total outage lasted for just over 10 minutes.
This switch is loaded with SUP720-3Bs running 12.2(33)SXI4a. Our network uses ISIS and LDP across the core supported by BFD, with VRFs in BGP to a pair of route reflectors. We are running multicast in the form of MVPNs including extranets with PIM.
The typical "core" interface looks like this:
interface GigabitEthernet5/1
description 6509 to some other 6509
dampening
mtu 9216
ip address 9.9.9.9 255.255.255.252
ip pim query-interval 300 msec
ip pim sparse-mode
ip router isis MALL
logging event link-status
load-interval 30
mpls traffic-eng tunnels
mpls ip
bfd interval 100 min_rx 100 multiplier 5
isis password MALL9
I've lost a box due to high CPU dropping BFD before but this felt different. No high utilization during the outage on this switch, average is around 9-12%. Other switches didn't appear to notice any problem. As a result, ISIS appeared fine, so all active LSPs that included the problem box stopped working.
Does this sound like a bug to anybody here? I checked over the GLBP docs (http://www.cisco.com/c/en/us/support/docs/switches/catalyst-6500-series-switches/81565-glbp-cat65k.html) and I see that I may have fouled up by putting "glbp ip" before the rest of the commands, but the doc doesn't mention side effects. I also don't see any caveats in recent 12.2(33) release notes that look close to what I just experienced.
I do have an upgrade to 15.1(2)SY planned for the near future, but any input or theories would be greatly appreciated.
Thanks!
Ross
More information about the cisco-nsp
mailing list