[c-nsp] TCAM troubles on 3750 stack

Alexander Gall gall at switch.ch
Fri Apr 28 15:49:18 EDT 2006


Warning: long message, but all those who are into debugging major
Cisco weirdnesses are going to love it :-) I should have probably
turned this into a TAC case right away, but maybe this information is
useful for others.

One of our 3750 stacks has started to do some very strange things
lately.  This stack consists of a WS-C3750G-24TS-E, which currently is
the master, and a WS-C3750G-16TD-E.  They're running 12.2(25)SEE, but
at least the first of the problems described below has appeared with
12.2(25)SED as well (it actually shows up on the two other 3750 stacks
we have).

One day we noticed that the CPU load was at 80% and throughput limited
to 100Mbps.  After a lot of searching around, we found this

swiCP2#sh platform ip unicast route 
Dumping IOS-HL3U Fib info
Fib 0.0.0.0/0 Tbl:0 Bucket:0
        Path(0)AdjIP:130.59.36.9 Vl:1006 000a.f330.1d80 RWI:0x2
        HL3UFlags:0x28 COVERING FIB ADJ Failed 
        SFT Entry:hdl:0x3C  HwFL:0x4
[...]

and this

swiCP2#sh platform ip unicast failed route 
Total of 0 covering fib entries
Entries covered by Actual default route(0.0.0.0/0)
                  129.194.0.0/15 Tbl:0 : Cover:0.0.0.0/0 Tbl:0
        Total of 1 entries covered by 0.0.0.0/0 Tbl:0

I don't know what exactly that means, but the effect was that all
traffic to destinations reached by the default route (this router
doesn't do BGP and uses a OSPF default route) was forwarded in
software.  We're using the "desktop IPv4 and IPv6 default" SDM
template and none of the TCAMs is full. I finally figured out that
doing "clear ip route *" several times in a row fixed this problem on
the master switch, which could indicate a sort of race condition.

However, the other stack member has the same problem.  There would be
some other arbitrary prefix (sometimes several of them) that got stuck
in this state, e.g.

swiCP2#remote command 2 sh platform ip unicast failed route
Total of 0 covering fib entries
Entries covered by Actual default route(0.0.0.0/0)
                  195.176.224.0/19 Tbl:0 : Cover:0.0.0.0/0 Tbl:0
        Total of 1 entries covered by 0.0.0.0/0 Tbl:0

Now, there seems to be no way to get that switch to reinstall its
TCAM.  The "clear ip route" on the master has no effect, presumably
because the FIB doesn't actually change.  In this state, the slave
switch would continue to send certain traffic to the CPU of the master
switch.  No amount of mucking around with (d)CEF helped.  Anybody seen
this or has any sort of clue what's going on?

This is a lot of fun, but there is more.  We have several dual-homed
hosts attached to this router, one interface to each switch.  Today,
some of the interfaces attached to the slave switch became unreachable
(IPv4 and IPv6).  I could see the packets arriving on the host with
tcpdump but the hosts wouldn't reply.  Turns out that the switch uses
the wrong mac address when it forwards the packets to the hosts!

For example, the MAC address of one of the affected hosts is
0003.ba9b.07bb.  The ethernet header of a packet captured with tcpdump
shows 

ETHER:  ----- Ether Header -----
ETHER:
ETHER:  Packet 78 arrived at 18:33:6.55
ETHER:  Packet size = 98 bytes
ETHER:  Destination = 0:6:3:c8:a5:f8,
ETHER:  Source      = 0:12:d9:ba:40:ca,
ETHER:  Ethertype = 0800 (IP)
ETHER:

Where the heck does 0:6:3:c8:a5:f8 come from?  This address doesn't
exist anywhere in our network.  The arp cache and mac address table on
the switch (master and slave) are OK

swiCP2#sh arp | inc 130.59.138.25
Internet  130.59.138.25           4   0003.ba9b.07bb  ARPA   Vlan138
swiCP2#sh mac-address-table address 0003.ba9b.07bb
          Mac Address Table
-------------------------------------------

Vlan    Mac Address       Type        Ports
----    -----------       --------    -----
 138    0003.ba9b.07bb    DYNAMIC     Gi2/0/4
Total Mac Addresses for this criterion: 1
swiCP2#remote command 2 sh mac-address-table address 0003.ba9b.07bb
Switch : 2 :
------------
          Mac Address Table
-------------------------------------------

Vlan    Mac Address       Type        Ports
----    -----------       --------    -----
 138    0003.ba9b.07bb    DYNAMIC     Gi2/0/4
Total Mac Addresses for this criterion: 1

There also is an entry in the mac TCAM of the master

swiCP2#sh platform tcam table mac-address | inc BB
7      B0090003 BA9B07BB

But there's no trace of 0003.ba9b.07bb in the mac TCAM of switch #2

swiCP2#remote command 2 sh platform tcam table mac-address | inc BB
Switch : 2 :
------------

swiCP2#

There are a couple of these weird addresses with OUI 00-06-03
(apparently, this belongs to Baker Hughes Inc. in Houston, TX, but I
don't think they've been taken over by Cisco yet ;-)

swiCP2#remote command 2 sh platform tcam table mac-address | inc 0006
Switch : 2 :
------------
2      F00C0006 03C8C818
3      C00D0006 03C8C678
4      90090006 03C8AFB8
6      90090006 03C8B638
7      A0090006 03C8A5F8
9      F0090006 03C8C338
10     A0090006 03C89C38
11     A0090006 03C8B978
13     A0040006 03C8A2B8
14     90040006 03C89758
17     A0040006 03C8BCB8
18     90090006 03C89F78

including the one that gets used when forwarding to this host
(0006.03C8.A5F8).  Where do these things come from?  I'm prepared to
believe that they are used internally by the switch to do some sort of
magic, but they sure shouldn't show up on the wire.  The interface is
reachable from the router itself, because the router uses the correct
mac address when it doesn't forward in hardware.

You sure get a lot of complexity for your money with these boxes!
BTW, the flash memory in one of these switches failed.  IOS appears to
be unable to map out a single bad block and since the flash is
on-board, we had to replace the entire box...

--
Alex





More information about the cisco-nsp mailing list