[c-nsp] 6500 TCAM overflows; certain hosts unreachable?

Wed Dec 3 13:12:50 EST 2008

Do you have a reason you can't do a partial BGP feed with a default
route between the 7200s and the 6500s to lower the table size?    

-----Original Message-----
From: cisco-nsp-bounces at puck.nether.net
[mailto:cisco-nsp-bounces at puck.nether.net] On Behalf Of Nate Carlson
Sent: Wednesday, December 03, 2008 9:26 AM
To: cisco-nsp at puck.nether.net
Subject: [c-nsp] 6500 TCAM overflows; certain hosts unreachable?

We're having some really odd issues with a pair of 6500's. We know that 
our TCAM table is overflowed, but it's worked fine up until now (new
pair 
of SUP720-10GE's on order, but not here yet, of course.)

Here's the TCAM errors we are getting, which are pretty typical:

Dec  3 10:29:18: %MLSCEF-SP-7-FIB_EXCEPTION: FIB TCAM exception, Some
entries will be software switched
Dec  3 10:31:49: %MLSCEF-SP-7-FIB_EXCEPTION: FIB TCAM exception, Some
entries will be software switched
Dec  3 10:38:10: %MLSCEF-SP-7-FIB_EXCEPTION: FIB TCAM exception, Some
entries will be software switched

Our CPU load looks ok:

cat2:
CPU utilization for five seconds: 3%/1%; one minute: 6%; five minutes:
7%
  PID Runtime(ms)   Invoked      uSecs   5Sec   1Min   5Min TTY Process
  177   1033718361003824268        102  0.90%  0.44%  0.39%   0 Port
manager per
   86    20705308 326963504         63  0.32%  0.09%  0.08%   0 IP Input
    3          52       149        348  0.32%  0.03%  0.00%   1 Virtual
Exec
   68     4432208   3722355       1190  0.08%  0.03%  0.01%   0
esw_vlan_stat_pr
  105     2177564   4944501        440  0.08%  0.01%  0.00%   0 IP RIB
Update
    5   160343340   8258274      19416  0.00%  0.82%  0.90%   0 Check
heaps

cat1:
CPU utilization for five seconds: 0%/0%; one minute: 6%; five minutes:
7%
  PID Runtime(ms)   Invoked      uSecs   5Sec   1Min   5Min TTY Process
   15    98375448 466005765        211  0.32%  0.58%  0.58%   0 ARP
Input
    3          24       123        195  0.16%  0.01%  0.00%   1 Virtual
Exec
  105     1696000   4795797        353  0.08%  0.01%  0.00%   0 IP RIB
Update
    1           0       124          0  0.00%  0.00%  0.00%   0 Chunk
Manager
    2       11072   3505348          3  0.00%  0.00%  0.00%   0 Load
Meter
    4           0         2          0  0.00%  0.00%  0.00%   0
IpSecMibTopN
    5   161516388   8266617      19538  0.00%  1.04%  0.98%   0 Check
heaps

We have meshed BGP between these two 6500's and a pair of 7200's, one
with 
a NPE-G1 and one with a NPE-G2. The ISP connections are on the 7200's,
and 
we have the routes coming back to the 6500's via iBGP.

These problems all started early this morning, when we swapped the
NPE-G1 
for a NPE-G2. After that, we started having intermittent connectivity 
issues to various IP's on the internet. When we saw those issues, we 
swapped the G1 back in, with the same config (verified via Rancid.)

>From our hosts connected to the 6500's, some remote IP's work fine, IE 
(mtr report):

$ mtr --report 216.250.164.1
HOST: nagios                      Loss%   Snt   Last   Avg  Best  Wrst
StDev
   1. x.x.207.14                    0.0%    10  108.6  11.4   0.3 108.6
34.2
   2. x.x..207.229                  0.0%    10    1.2   0.7   0.4   1.2
0.3
   3. 207-250-239-5.static.twtelec  0.0%    10   79.7  33.6   0.9 103.9
44.0
   4. peer-02-so-0-0-0-0.chcg.twte  0.0%    10   12.2  12.7  11.6  19.1
2.3
   5. min-edge-12.inet.qwest.net    0.0%    10   11.7  11.5  11.2  12.1
0.3
   6. 67.130.18.94                  0.0%    10   13.2  12.3  11.9  13.2
0.4
   7. c4500-1.bdr.mpls.iphouse.net  0.0%    10   12.6  13.3  12.3  18.9
2.0
   8. c2801-1-uplink.msp.technical  0.0%    10   14.1  12.9  12.0  14.1
0.6
   9. oxygen.msp.technicality.org   0.0%    10   12.6  12.8  12.1  14.1
0.6

Other remote IP's, we lose packets at the first .14 hop (which is the 
6509):

$ mtr --report 67.135.105.97
HOST: nagios                      Loss%   Snt   Last   Avg  Best  Wrst
StDev
   1. x.x.207.14                   40.0%    10    0.2  12.9   0.2  74.3
30.1
   2. ???                          100.0    10    0.0   0.0   0.0   0.0
0.0
   3. min-edge-10.inet.qwest.net   80.0%    10    0.9   1.2   0.9   1.6
0.5
   4. min-core-01.inet.qwest.net   90.0%    10    1.3   1.3   1.3   1.3
0.0
   5. ???                          100.0    10    0.0   0.0   0.0   0.0
0.0
   6. 205.171.139.30               70.0%    10   11.1  11.2  11.0  11.5
0.2

Of course, all the people reporting connectivity issues to us are on
IP's 
like this where the first hop goes bad.

Now, the real odd part, is that from the same 6509, coming from the .14 
address, I can hit those IP's without any issues:

-- start of output --
511-cat1>#ping
Protocol [ip]:
Target IP address: 67.135.105.97
Repeat count [5]: 50
Datagram size [100]:
Timeout in seconds [2]:
Extended commands [n]: y
Source address or interface: 66.187.207.14
Type of service [0]:
Set DF bit in IP header? [no]:
Validate reply data? [no]:
Data pattern [0xABCD]:
Loose, Strict, Record, Timestamp, Verbose[none]:
Sweep range of sizes [n]:
Type escape sequence to abort.
Sending 50, 100-byte ICMP Echos to 67.135.105.97, timeout is 2 seconds:
Packet sent with a source address of 66.187.207.14
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Success rate is 100 percent (50/50), round-trip min/avg/max = 8/10/12 ms
-- end of output --

Are these the type of issues expected with TCAM overflows? It seems odd
to 
me that our CPU utilization would be low, but we'd be having these,
unless 
'sh proc cpu' isn't the right place to look for that?

Appreciate any thoughts. If we can definitively say that TCAM is the 
issue, we'll filter our BGP routes (get rid of the /24's).. my 
understanding is that to get hardware-switched routes again, though,
we'd 
have to reboot the 6500 - is that also correct?

Thanks much!

-Nate
_______________________________________________
cisco-nsp mailing list  cisco-nsp at puck.nether.net
https://puck.nether.net/mailman/listinfo/cisco-nsp
archive at http://puck.nether.net/pipermail/cisco-nsp/