[c-nsp] 6500 TCAM overflows; certain hosts unreachable?
John van Oppen
john at vanoppen.com
Wed Dec 3 13:12:50 EST 2008
Do you have a reason you can't do a partial BGP feed with a default
route between the 7200s and the 6500s to lower the table size?
-----Original Message-----
From: cisco-nsp-bounces at puck.nether.net
[mailto:cisco-nsp-bounces at puck.nether.net] On Behalf Of Nate Carlson
Sent: Wednesday, December 03, 2008 9:26 AM
To: cisco-nsp at puck.nether.net
Subject: [c-nsp] 6500 TCAM overflows; certain hosts unreachable?
We're having some really odd issues with a pair of 6500's. We know that
our TCAM table is overflowed, but it's worked fine up until now (new
pair
of SUP720-10GE's on order, but not here yet, of course.)
Here's the TCAM errors we are getting, which are pretty typical:
Dec 3 10:29:18: %MLSCEF-SP-7-FIB_EXCEPTION: FIB TCAM exception, Some
entries will be software switched
Dec 3 10:31:49: %MLSCEF-SP-7-FIB_EXCEPTION: FIB TCAM exception, Some
entries will be software switched
Dec 3 10:38:10: %MLSCEF-SP-7-FIB_EXCEPTION: FIB TCAM exception, Some
entries will be software switched
Our CPU load looks ok:
cat2:
CPU utilization for five seconds: 3%/1%; one minute: 6%; five minutes:
7%
PID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process
177 1033718361003824268 102 0.90% 0.44% 0.39% 0 Port
manager per
86 20705308 326963504 63 0.32% 0.09% 0.08% 0 IP Input
3 52 149 348 0.32% 0.03% 0.00% 1 Virtual
Exec
68 4432208 3722355 1190 0.08% 0.03% 0.01% 0
esw_vlan_stat_pr
105 2177564 4944501 440 0.08% 0.01% 0.00% 0 IP RIB
Update
5 160343340 8258274 19416 0.00% 0.82% 0.90% 0 Check
heaps
cat1:
CPU utilization for five seconds: 0%/0%; one minute: 6%; five minutes:
7%
PID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process
15 98375448 466005765 211 0.32% 0.58% 0.58% 0 ARP
Input
3 24 123 195 0.16% 0.01% 0.00% 1 Virtual
Exec
105 1696000 4795797 353 0.08% 0.01% 0.00% 0 IP RIB
Update
1 0 124 0 0.00% 0.00% 0.00% 0 Chunk
Manager
2 11072 3505348 3 0.00% 0.00% 0.00% 0 Load
Meter
4 0 2 0 0.00% 0.00% 0.00% 0
IpSecMibTopN
5 161516388 8266617 19538 0.00% 1.04% 0.98% 0 Check
heaps
We have meshed BGP between these two 6500's and a pair of 7200's, one
with
a NPE-G1 and one with a NPE-G2. The ISP connections are on the 7200's,
and
we have the routes coming back to the 6500's via iBGP.
These problems all started early this morning, when we swapped the
NPE-G1
for a NPE-G2. After that, we started having intermittent connectivity
issues to various IP's on the internet. When we saw those issues, we
swapped the G1 back in, with the same config (verified via Rancid.)
>From our hosts connected to the 6500's, some remote IP's work fine, IE
(mtr report):
$ mtr --report 216.250.164.1
HOST: nagios Loss% Snt Last Avg Best Wrst
StDev
1. x.x.207.14 0.0% 10 108.6 11.4 0.3 108.6
34.2
2. x.x..207.229 0.0% 10 1.2 0.7 0.4 1.2
0.3
3. 207-250-239-5.static.twtelec 0.0% 10 79.7 33.6 0.9 103.9
44.0
4. peer-02-so-0-0-0-0.chcg.twte 0.0% 10 12.2 12.7 11.6 19.1
2.3
5. min-edge-12.inet.qwest.net 0.0% 10 11.7 11.5 11.2 12.1
0.3
6. 67.130.18.94 0.0% 10 13.2 12.3 11.9 13.2
0.4
7. c4500-1.bdr.mpls.iphouse.net 0.0% 10 12.6 13.3 12.3 18.9
2.0
8. c2801-1-uplink.msp.technical 0.0% 10 14.1 12.9 12.0 14.1
0.6
9. oxygen.msp.technicality.org 0.0% 10 12.6 12.8 12.1 14.1
0.6
Other remote IP's, we lose packets at the first .14 hop (which is the
6509):
$ mtr --report 67.135.105.97
HOST: nagios Loss% Snt Last Avg Best Wrst
StDev
1. x.x.207.14 40.0% 10 0.2 12.9 0.2 74.3
30.1
2. ??? 100.0 10 0.0 0.0 0.0 0.0
0.0
3. min-edge-10.inet.qwest.net 80.0% 10 0.9 1.2 0.9 1.6
0.5
4. min-core-01.inet.qwest.net 90.0% 10 1.3 1.3 1.3 1.3
0.0
5. ??? 100.0 10 0.0 0.0 0.0 0.0
0.0
6. 205.171.139.30 70.0% 10 11.1 11.2 11.0 11.5
0.2
Of course, all the people reporting connectivity issues to us are on
IP's
like this where the first hop goes bad.
Now, the real odd part, is that from the same 6509, coming from the .14
address, I can hit those IP's without any issues:
-- start of output --
511-cat1>#ping
Protocol [ip]:
Target IP address: 67.135.105.97
Repeat count [5]: 50
Datagram size [100]:
Timeout in seconds [2]:
Extended commands [n]: y
Source address or interface: 66.187.207.14
Type of service [0]:
Set DF bit in IP header? [no]:
Validate reply data? [no]:
Data pattern [0xABCD]:
Loose, Strict, Record, Timestamp, Verbose[none]:
Sweep range of sizes [n]:
Type escape sequence to abort.
Sending 50, 100-byte ICMP Echos to 67.135.105.97, timeout is 2 seconds:
Packet sent with a source address of 66.187.207.14
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Success rate is 100 percent (50/50), round-trip min/avg/max = 8/10/12 ms
-- end of output --
Are these the type of issues expected with TCAM overflows? It seems odd
to
me that our CPU utilization would be low, but we'd be having these,
unless
'sh proc cpu' isn't the right place to look for that?
Appreciate any thoughts. If we can definitively say that TCAM is the
issue, we'll filter our BGP routes (get rid of the /24's).. my
understanding is that to get hardware-switched routes again, though,
we'd
have to reboot the 6500 - is that also correct?
Thanks much!
-Nate
_______________________________________________
cisco-nsp mailing list cisco-nsp at puck.nether.net
https://puck.nether.net/mailman/listinfo/cisco-nsp
archive at http://puck.nether.net/pipermail/cisco-nsp/
More information about the cisco-nsp
mailing list