[c-nsp] Cisco Cat6k Sup2T: 6904 line card has extreme overruns after a traffic spike
Jeroen van Ingen
jeroen at zijndomein.nl
Mon Nov 9 14:35:22 EST 2015
Hi everyone,
I'm hoping to find someone with in-depth knowledge of newer Cat6k
hardware, specifically the WS-X6904-40G line card. Looks like we're
running into a very interesting bug and I haven't found anything
matching it in the Bug Search Tool yet.
What we've seen twice now: after a short traffic spike (lots of clients
syncing an update from an FTP mirror in our network, peak is 4 Gbps @ 10
min avg, 7 Gbps @ 1 min avg ), somehow the router refuses to forward
certain flows. Specific source/destination IPs that worked before aren't
processed anymore; say I1-I4 are internal to our network and E1-E4 are
external, then eg I1 can communicate with E1-E3 but not E4, but E4 can
communicate with I2-I4 but not I1. Also the router drops some OSPF
adjacencies, some BGP neighbors and some PIM neighborships. No
LAGs/etherchannels involved. ELAM capture for a non-working src/dest
combination sees one direction of the flow correctly (rewrite is okay,
dest index okay, etc), but setting up the reverse ELAM capture, the
triggers never fire.
CPU load is normal, amount of packets going to CPU (show ibc & looking
at netdr capture) is normal. Input overruns (especially on a 40 Gbps
interface) is extremely high, eg 80 kpps overruns while there's 160 kpps
input, according to the counters (5 min avg).
Moved away almost all IPv4 routing by selectively shutting BGP and OSPF
sessions, so the box only processes IPv6; even then, overruns keep
increasing: now at 10 kpps input (about 100 Mbps) I still see 160 pps
overruns on Fo2/1. At those traffic levels I don't believe these
overruns can be explained by oversubscription. And input discards stays
close to zero, so even with these low traffic levels, it appears that
traffic is dropped in the hardware receive path before a lookup is even
done.
We've been running IOS 15.1(1)SY3 on this box since it was released in
March 2014; today is the second time that we've it happen, first time
was ~ 3 weeks ago and could only be fixed by a reload (last resort after
7-8 hours troubleshooting).
We'll be upgrading to 15.1(2)SY6 tomorrow morning, but since I haven't
found a bug ID that resembles what we've seen here, I'm not sure if it
will stop this from reoccurring.
By the way: changing the "load-balance" config on interface level, ie
setting Fo2/1 load-balance from the default "src-dst-ip" to "mpls" or
some other option changes the connectivity status for some hosts.
Looking at the architecture
(http://www.cisco.com/c/en/us/products/collateral/switches/catalyst-6500-series-switches/white_paper_c11-696669.html):
given that ELAM isn't even triggered for some cases where I'm pretty
sure that the traffic does get to the box, I guess that traffic doesn't
make it to the Replication Engine. That would suggest that the RX MUX
FPGA part somehow has one of its four outgoing 16 Gbps channels stuck or
something like that. Could that happen, and if so, is there any way I
could verify it if it happens again? And is it a known bug?
Lots of questions, hoping for some pointers or answers... Thanks in advance!
Regards,
Jeroen van Ingen
ICT Service Centre
University of Twente, P.O.Box 217, 7500 AE Enschede, The Netherlands
More information about the cisco-nsp
mailing list