[c-nsp] "%HARDWARE-1-TCAM_ERROR: Found error in HFTM TCAM Space and not able to recover the error" + server losing default GW

Sat Mar 10 17:30:35 EST 2012

Problem: solaris server connected to a port on a 3750 switch.

Reported problem: solaris server lost capability to communicate over
the network (checks performed from remote location / different VLAN -
important to know!)

Immediate reaction - network folks engaged: switch investigation
reveals error from $subj:

%HARDWARE-1-TCAM_ERROR: Found error in HFTM TCAM Space and not able to
recover the error

so decision taken to immediately reload the switch

Phase II: switch recovers, no more errors, server still reported
unreachable from monitoring tool; a quick test from within switch
reveals reachability of server from within its own VLAN, though (all
tests = ICMP)!

Phase III: finally server folks involved - reached out to "down"
server via another one, on the same VLAN, connected to the same switch
- found missing gateway on the "down" server (allegedly there for the
last 4xx days of uptime)

Phase III - post-mortem monitoring: no more TCAM errors but also no
more problems (obviously) after re-adding the default GW on the server

What we are missing: test at the time of reported failure in
communication with server did not include an ICMP from within its own
VLAN (as the apparent problem was the error reported on the switch
TCAM)

My question to the audience: having done a little research on old
solaris behavior (as we have it), I found this:

http://www.tek-tips.com/viewthread.cfm?qid=211132

and now I wonder - is it possible that solaris mechanisms of spewing
whatever traffic, in missing the default GW, caused the TCAM issue, or
(and how come) the TCAM issue causing the "disappearance" of the
solaris default GW.

Anybody having experienced the problem described?

***Stefan