[c-nsp] 6500 TCAM overflows; certain hosts unreachable?

Wed Dec 3 12:26:15 EST 2008

We're having some really odd issues with a pair of 6500's. We know that 
our TCAM table is overflowed, but it's worked fine up until now (new pair 
of SUP720-10GE's on order, but not here yet, of course.)

Here's the TCAM errors we are getting, which are pretty typical:

Dec  3 10:29:18: %MLSCEF-SP-7-FIB_EXCEPTION: FIB TCAM exception, Some entries will be software switched
Dec  3 10:31:49: %MLSCEF-SP-7-FIB_EXCEPTION: FIB TCAM exception, Some entries will be software switched
Dec  3 10:38:10: %MLSCEF-SP-7-FIB_EXCEPTION: FIB TCAM exception, Some entries will be software switched

Our CPU load looks ok:

cat2:
CPU utilization for five seconds: 3%/1%; one minute: 6%; five minutes: 7%
  PID Runtime(ms)   Invoked      uSecs   5Sec   1Min   5Min TTY Process
  177   1033718361003824268        102  0.90%  0.44%  0.39%   0 Port manager per
   86    20705308 326963504         63  0.32%  0.09%  0.08%   0 IP Input
    3          52       149        348  0.32%  0.03%  0.00%   1 Virtual Exec
   68     4432208   3722355       1190  0.08%  0.03%  0.01%   0 esw_vlan_stat_pr
  105     2177564   4944501        440  0.08%  0.01%  0.00%   0 IP RIB Update
    5   160343340   8258274      19416  0.00%  0.82%  0.90%   0 Check heaps

cat1:
CPU utilization for five seconds: 0%/0%; one minute: 6%; five minutes: 7%
  PID Runtime(ms)   Invoked      uSecs   5Sec   1Min   5Min TTY Process
   15    98375448 466005765        211  0.32%  0.58%  0.58%   0 ARP Input
    3          24       123        195  0.16%  0.01%  0.00%   1 Virtual Exec
  105     1696000   4795797        353  0.08%  0.01%  0.00%   0 IP RIB Update
    1           0       124          0  0.00%  0.00%  0.00%   0 Chunk Manager
    2       11072   3505348          3  0.00%  0.00%  0.00%   0 Load Meter
    4           0         2          0  0.00%  0.00%  0.00%   0 IpSecMibTopN
    5   161516388   8266617      19538  0.00%  1.04%  0.98%   0 Check heaps

We have meshed BGP between these two 6500's and a pair of 7200's, one with 
a NPE-G1 and one with a NPE-G2. The ISP connections are on the 7200's, and 
we have the routes coming back to the 6500's via iBGP.

These problems all started early this morning, when we swapped the NPE-G1 
for a NPE-G2. After that, we started having intermittent connectivity 
issues to various IP's on the internet. When we saw those issues, we 
swapped the G1 back in, with the same config (verified via Rancid.)

>From our hosts connected to the 6500's, some remote IP's work fine, IE 
(mtr report):

$ mtr --report 216.250.164.1
HOST: nagios                      Loss%   Snt   Last   Avg  Best  Wrst StDev
   1. x.x.207.14                    0.0%    10  108.6  11.4   0.3 108.6  34.2
   2. x.x..207.229                  0.0%    10    1.2   0.7   0.4   1.2   0.3
   3. 207-250-239-5.static.twtelec  0.0%    10   79.7  33.6   0.9 103.9  44.0
   4. peer-02-so-0-0-0-0.chcg.twte  0.0%    10   12.2  12.7  11.6  19.1   2.3
   5. min-edge-12.inet.qwest.net    0.0%    10   11.7  11.5  11.2  12.1   0.3
   6. 67.130.18.94                  0.0%    10   13.2  12.3  11.9  13.2   0.4
   7. c4500-1.bdr.mpls.iphouse.net  0.0%    10   12.6  13.3  12.3  18.9   2.0
   8. c2801-1-uplink.msp.technical  0.0%    10   14.1  12.9  12.0  14.1   0.6
   9. oxygen.msp.technicality.org   0.0%    10   12.6  12.8  12.1  14.1   0.6

Other remote IP's, we lose packets at the first .14 hop (which is the 
6509):

$ mtr --report 67.135.105.97
HOST: nagios                      Loss%   Snt   Last   Avg  Best  Wrst StDev
   1. x.x.207.14                   40.0%    10    0.2  12.9   0.2  74.3 30.1
   2. ???                          100.0    10    0.0   0.0   0.0   0.0 0.0
   3. min-edge-10.inet.qwest.net   80.0%    10    0.9   1.2   0.9   1.6 0.5
   4. min-core-01.inet.qwest.net   90.0%    10    1.3   1.3   1.3   1.3 0.0
   5. ???                          100.0    10    0.0   0.0   0.0   0.0 0.0
   6. 205.171.139.30               70.0%    10   11.1  11.2  11.0  11.5 0.2

Of course, all the people reporting connectivity issues to us are on IP's 
like this where the first hop goes bad.

Now, the real odd part, is that from the same 6509, coming from the .14 
address, I can hit those IP's without any issues:

-- start of output --
511-cat1>#ping
Protocol [ip]:
Target IP address: 67.135.105.97
Repeat count [5]: 50
Datagram size [100]:
Timeout in seconds [2]:
Extended commands [n]: y
Source address or interface: 66.187.207.14
Type of service [0]:
Set DF bit in IP header? [no]:
Validate reply data? [no]:
Data pattern [0xABCD]:
Loose, Strict, Record, Timestamp, Verbose[none]:
Sweep range of sizes [n]:
Type escape sequence to abort.
Sending 50, 100-byte ICMP Echos to 67.135.105.97, timeout is 2 seconds:
Packet sent with a source address of 66.187.207.14
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Success rate is 100 percent (50/50), round-trip min/avg/max = 8/10/12 ms
-- end of output --

Are these the type of issues expected with TCAM overflows? It seems odd to 
me that our CPU utilization would be low, but we'd be having these, unless 
'sh proc cpu' isn't the right place to look for that?

Appreciate any thoughts. If we can definitively say that TCAM is the 
issue, we'll filter our BGP routes (get rid of the /24's).. my 
understanding is that to get hardware-switched routes again, though, we'd 
have to reboot the 6500 - is that also correct?

Thanks much!

-Nate