[c-nsp] TCAM troubles on 3750 stack
Alexander Gall
gall at switch.ch
Fri Apr 28 15:49:18 EDT 2006
Warning: long message, but all those who are into debugging major
Cisco weirdnesses are going to love it :-) I should have probably
turned this into a TAC case right away, but maybe this information is
useful for others.
One of our 3750 stacks has started to do some very strange things
lately. This stack consists of a WS-C3750G-24TS-E, which currently is
the master, and a WS-C3750G-16TD-E. They're running 12.2(25)SEE, but
at least the first of the problems described below has appeared with
12.2(25)SED as well (it actually shows up on the two other 3750 stacks
we have).
One day we noticed that the CPU load was at 80% and throughput limited
to 100Mbps. After a lot of searching around, we found this
swiCP2#sh platform ip unicast route
Dumping IOS-HL3U Fib info
Fib 0.0.0.0/0 Tbl:0 Bucket:0
Path(0)AdjIP:130.59.36.9 Vl:1006 000a.f330.1d80 RWI:0x2
HL3UFlags:0x28 COVERING FIB ADJ Failed
SFT Entry:hdl:0x3C HwFL:0x4
[...]
and this
swiCP2#sh platform ip unicast failed route
Total of 0 covering fib entries
Entries covered by Actual default route(0.0.0.0/0)
129.194.0.0/15 Tbl:0 : Cover:0.0.0.0/0 Tbl:0
Total of 1 entries covered by 0.0.0.0/0 Tbl:0
I don't know what exactly that means, but the effect was that all
traffic to destinations reached by the default route (this router
doesn't do BGP and uses a OSPF default route) was forwarded in
software. We're using the "desktop IPv4 and IPv6 default" SDM
template and none of the TCAMs is full. I finally figured out that
doing "clear ip route *" several times in a row fixed this problem on
the master switch, which could indicate a sort of race condition.
However, the other stack member has the same problem. There would be
some other arbitrary prefix (sometimes several of them) that got stuck
in this state, e.g.
swiCP2#remote command 2 sh platform ip unicast failed route
Total of 0 covering fib entries
Entries covered by Actual default route(0.0.0.0/0)
195.176.224.0/19 Tbl:0 : Cover:0.0.0.0/0 Tbl:0
Total of 1 entries covered by 0.0.0.0/0 Tbl:0
Now, there seems to be no way to get that switch to reinstall its
TCAM. The "clear ip route" on the master has no effect, presumably
because the FIB doesn't actually change. In this state, the slave
switch would continue to send certain traffic to the CPU of the master
switch. No amount of mucking around with (d)CEF helped. Anybody seen
this or has any sort of clue what's going on?
This is a lot of fun, but there is more. We have several dual-homed
hosts attached to this router, one interface to each switch. Today,
some of the interfaces attached to the slave switch became unreachable
(IPv4 and IPv6). I could see the packets arriving on the host with
tcpdump but the hosts wouldn't reply. Turns out that the switch uses
the wrong mac address when it forwards the packets to the hosts!
For example, the MAC address of one of the affected hosts is
0003.ba9b.07bb. The ethernet header of a packet captured with tcpdump
shows
ETHER: ----- Ether Header -----
ETHER:
ETHER: Packet 78 arrived at 18:33:6.55
ETHER: Packet size = 98 bytes
ETHER: Destination = 0:6:3:c8:a5:f8,
ETHER: Source = 0:12:d9:ba:40:ca,
ETHER: Ethertype = 0800 (IP)
ETHER:
Where the heck does 0:6:3:c8:a5:f8 come from? This address doesn't
exist anywhere in our network. The arp cache and mac address table on
the switch (master and slave) are OK
swiCP2#sh arp | inc 130.59.138.25
Internet 130.59.138.25 4 0003.ba9b.07bb ARPA Vlan138
swiCP2#sh mac-address-table address 0003.ba9b.07bb
Mac Address Table
-------------------------------------------
Vlan Mac Address Type Ports
---- ----------- -------- -----
138 0003.ba9b.07bb DYNAMIC Gi2/0/4
Total Mac Addresses for this criterion: 1
swiCP2#remote command 2 sh mac-address-table address 0003.ba9b.07bb
Switch : 2 :
------------
Mac Address Table
-------------------------------------------
Vlan Mac Address Type Ports
---- ----------- -------- -----
138 0003.ba9b.07bb DYNAMIC Gi2/0/4
Total Mac Addresses for this criterion: 1
There also is an entry in the mac TCAM of the master
swiCP2#sh platform tcam table mac-address | inc BB
7 B0090003 BA9B07BB
But there's no trace of 0003.ba9b.07bb in the mac TCAM of switch #2
swiCP2#remote command 2 sh platform tcam table mac-address | inc BB
Switch : 2 :
------------
swiCP2#
There are a couple of these weird addresses with OUI 00-06-03
(apparently, this belongs to Baker Hughes Inc. in Houston, TX, but I
don't think they've been taken over by Cisco yet ;-)
swiCP2#remote command 2 sh platform tcam table mac-address | inc 0006
Switch : 2 :
------------
2 F00C0006 03C8C818
3 C00D0006 03C8C678
4 90090006 03C8AFB8
6 90090006 03C8B638
7 A0090006 03C8A5F8
9 F0090006 03C8C338
10 A0090006 03C89C38
11 A0090006 03C8B978
13 A0040006 03C8A2B8
14 90040006 03C89758
17 A0040006 03C8BCB8
18 90090006 03C89F78
including the one that gets used when forwarding to this host
(0006.03C8.A5F8). Where do these things come from? I'm prepared to
believe that they are used internally by the switch to do some sort of
magic, but they sure shouldn't show up on the wire. The interface is
reachable from the router itself, because the router uses the correct
mac address when it doesn't forward in hardware.
You sure get a lot of complexity for your money with these boxes!
BTW, the flash memory in one of these switches failed. IOS appears to
be unable to map out a single bad block and since the flash is
on-board, we had to replace the entire box...
--
Alex
More information about the cisco-nsp
mailing list