[c-nsp] 6500 TCAM overflows; certain hosts unreachable?
Nate Carlson
cisco-nsp at natecarlson.com
Wed Dec 3 12:26:15 EST 2008
We're having some really odd issues with a pair of 6500's. We know that
our TCAM table is overflowed, but it's worked fine up until now (new pair
of SUP720-10GE's on order, but not here yet, of course.)
Here's the TCAM errors we are getting, which are pretty typical:
Dec 3 10:29:18: %MLSCEF-SP-7-FIB_EXCEPTION: FIB TCAM exception, Some entries will be software switched
Dec 3 10:31:49: %MLSCEF-SP-7-FIB_EXCEPTION: FIB TCAM exception, Some entries will be software switched
Dec 3 10:38:10: %MLSCEF-SP-7-FIB_EXCEPTION: FIB TCAM exception, Some entries will be software switched
Our CPU load looks ok:
cat2:
CPU utilization for five seconds: 3%/1%; one minute: 6%; five minutes: 7%
PID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process
177 1033718361003824268 102 0.90% 0.44% 0.39% 0 Port manager per
86 20705308 326963504 63 0.32% 0.09% 0.08% 0 IP Input
3 52 149 348 0.32% 0.03% 0.00% 1 Virtual Exec
68 4432208 3722355 1190 0.08% 0.03% 0.01% 0 esw_vlan_stat_pr
105 2177564 4944501 440 0.08% 0.01% 0.00% 0 IP RIB Update
5 160343340 8258274 19416 0.00% 0.82% 0.90% 0 Check heaps
cat1:
CPU utilization for five seconds: 0%/0%; one minute: 6%; five minutes: 7%
PID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process
15 98375448 466005765 211 0.32% 0.58% 0.58% 0 ARP Input
3 24 123 195 0.16% 0.01% 0.00% 1 Virtual Exec
105 1696000 4795797 353 0.08% 0.01% 0.00% 0 IP RIB Update
1 0 124 0 0.00% 0.00% 0.00% 0 Chunk Manager
2 11072 3505348 3 0.00% 0.00% 0.00% 0 Load Meter
4 0 2 0 0.00% 0.00% 0.00% 0 IpSecMibTopN
5 161516388 8266617 19538 0.00% 1.04% 0.98% 0 Check heaps
We have meshed BGP between these two 6500's and a pair of 7200's, one with
a NPE-G1 and one with a NPE-G2. The ISP connections are on the 7200's, and
we have the routes coming back to the 6500's via iBGP.
These problems all started early this morning, when we swapped the NPE-G1
for a NPE-G2. After that, we started having intermittent connectivity
issues to various IP's on the internet. When we saw those issues, we
swapped the G1 back in, with the same config (verified via Rancid.)
>From our hosts connected to the 6500's, some remote IP's work fine, IE
(mtr report):
$ mtr --report 216.250.164.1
HOST: nagios Loss% Snt Last Avg Best Wrst StDev
1. x.x.207.14 0.0% 10 108.6 11.4 0.3 108.6 34.2
2. x.x..207.229 0.0% 10 1.2 0.7 0.4 1.2 0.3
3. 207-250-239-5.static.twtelec 0.0% 10 79.7 33.6 0.9 103.9 44.0
4. peer-02-so-0-0-0-0.chcg.twte 0.0% 10 12.2 12.7 11.6 19.1 2.3
5. min-edge-12.inet.qwest.net 0.0% 10 11.7 11.5 11.2 12.1 0.3
6. 67.130.18.94 0.0% 10 13.2 12.3 11.9 13.2 0.4
7. c4500-1.bdr.mpls.iphouse.net 0.0% 10 12.6 13.3 12.3 18.9 2.0
8. c2801-1-uplink.msp.technical 0.0% 10 14.1 12.9 12.0 14.1 0.6
9. oxygen.msp.technicality.org 0.0% 10 12.6 12.8 12.1 14.1 0.6
Other remote IP's, we lose packets at the first .14 hop (which is the
6509):
$ mtr --report 67.135.105.97
HOST: nagios Loss% Snt Last Avg Best Wrst StDev
1. x.x.207.14 40.0% 10 0.2 12.9 0.2 74.3 30.1
2. ??? 100.0 10 0.0 0.0 0.0 0.0 0.0
3. min-edge-10.inet.qwest.net 80.0% 10 0.9 1.2 0.9 1.6 0.5
4. min-core-01.inet.qwest.net 90.0% 10 1.3 1.3 1.3 1.3 0.0
5. ??? 100.0 10 0.0 0.0 0.0 0.0 0.0
6. 205.171.139.30 70.0% 10 11.1 11.2 11.0 11.5 0.2
Of course, all the people reporting connectivity issues to us are on IP's
like this where the first hop goes bad.
Now, the real odd part, is that from the same 6509, coming from the .14
address, I can hit those IP's without any issues:
-- start of output --
511-cat1>#ping
Protocol [ip]:
Target IP address: 67.135.105.97
Repeat count [5]: 50
Datagram size [100]:
Timeout in seconds [2]:
Extended commands [n]: y
Source address or interface: 66.187.207.14
Type of service [0]:
Set DF bit in IP header? [no]:
Validate reply data? [no]:
Data pattern [0xABCD]:
Loose, Strict, Record, Timestamp, Verbose[none]:
Sweep range of sizes [n]:
Type escape sequence to abort.
Sending 50, 100-byte ICMP Echos to 67.135.105.97, timeout is 2 seconds:
Packet sent with a source address of 66.187.207.14
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Success rate is 100 percent (50/50), round-trip min/avg/max = 8/10/12 ms
-- end of output --
Are these the type of issues expected with TCAM overflows? It seems odd to
me that our CPU utilization would be low, but we'd be having these, unless
'sh proc cpu' isn't the right place to look for that?
Appreciate any thoughts. If we can definitively say that TCAM is the
issue, we'll filter our BGP routes (get rid of the /24's).. my
understanding is that to get hardware-switched routes again, though, we'd
have to reboot the 6500 - is that also correct?
Thanks much!
-Nate
More information about the cisco-nsp
mailing list