[c-nsp] Cisco Cat6k Sup2T: 6904 line card has extreme overruns after a traffic spike

Mon Nov 9 14:35:22 EST 2015

Hi everyone,

I'm hoping to find someone with in-depth knowledge of newer Cat6k 
hardware, specifically the WS-X6904-40G line card. Looks like we're 
running into a very interesting bug and I haven't found anything 
matching it in the Bug Search Tool yet.

What we've seen twice now: after a short traffic spike (lots of clients 
syncing an update from an FTP mirror in our network, peak is 4 Gbps @ 10 
min avg, 7 Gbps @ 1 min avg ), somehow the router refuses to forward 
certain flows. Specific source/destination IPs that worked before aren't 
processed anymore; say I1-I4 are internal to our network and E1-E4 are 
external, then eg I1 can communicate with E1-E3 but not E4, but E4 can 
communicate with I2-I4 but not I1. Also the router drops some OSPF 
adjacencies, some BGP neighbors and some PIM neighborships. No 
LAGs/etherchannels involved. ELAM capture for a non-working src/dest 
combination sees one direction of the flow correctly (rewrite is okay, 
dest index okay, etc), but setting up the reverse ELAM capture, the 
triggers never fire.

CPU load is normal, amount of packets going to CPU (show ibc & looking 
at netdr capture) is normal. Input overruns (especially on a 40 Gbps 
interface) is extremely high, eg 80 kpps overruns while there's 160 kpps 
input, according to the counters (5 min avg).

Moved away almost all IPv4 routing by selectively shutting BGP and OSPF 
sessions, so the box only processes IPv6; even then, overruns keep 
increasing: now at 10 kpps input (about 100 Mbps) I still see 160 pps 
overruns on Fo2/1. At those traffic levels I don't believe these 
overruns can be explained by oversubscription. And input discards stays 
close to zero, so even with these low traffic levels, it appears that 
traffic is dropped in the hardware receive path before a lookup is even 
done.

We've been running IOS 15.1(1)SY3 on this box since it was released in 
March 2014; today is the second time that we've it happen, first time 
was ~ 3 weeks ago and could only be fixed by a reload (last resort after 
7-8 hours troubleshooting).

We'll be upgrading to 15.1(2)SY6 tomorrow morning, but since I haven't 
found a bug ID that resembles what we've seen here, I'm not sure if it 
will stop this from reoccurring.

By the way: changing the "load-balance" config on interface level, ie 
setting Fo2/1 load-balance from the default "src-dst-ip" to "mpls" or 
some other option changes the connectivity status for some hosts.

Looking at the architecture 
(http://www.cisco.com/c/en/us/products/collateral/switches/catalyst-6500-series-switches/white_paper_c11-696669.html): 
given that ELAM isn't even triggered for some cases where I'm pretty 
sure that the traffic does get to the box, I guess that traffic doesn't 
make it to the Replication Engine. That would suggest that the RX MUX 
FPGA part somehow has one of its four outgoing 16 Gbps channels stuck or 
something like that. Could that happen, and if so, is there any way I 
could verify it if it happens again? And is it a known bug?

Lots of questions, hoping for some pointers or answers... Thanks in advance!

Regards,

Jeroen van Ingen
ICT Service Centre
University of Twente, P.O.Box 217, 7500 AE Enschede, The Netherlands