[c-nsp] "%HARDWARE-1-TCAM_ERROR: Found error in HFTM TCAM Space and not able to recover the error" + server losing default GW

Sun Mar 11 01:06:33 EST 2012

Thanks for reply, James. I assume you meant "proxy arp shouldn't", right?

Will get more details form the server folks on Monday (if willing top
share ;-)). To me the interesting part was "all but one functional
ports (1) + TCAM errors (2) + one port w/a server having lost its
default GW (3)" - I could see (2) and (3), as (2) => (3), but then why
(1)? ... unless (3) => (2)

***Stefan

On Sat, Mar 10, 2012 at 9:40 PM, James S. Smith <JSmith at windmobile.ca> wrote:
> Did the Solaris system have the gateway in the defaultrouter file, or did it need to be added?
>
> It's possible that it never did have a default gateway, and your local router was doing proxy arp.  I've run into that a few times where a server isn't given the proper gateway but still ends up getting connectivity because the local router is responding to the arps.  Or perhaps someone had added the default route by cli and never added it to the defaultrouter file, and then it somehow got lost.
>
> It's an odd chain of events, but proxy arp should cause issues with the TCAM.
>
>
> ----- Original Message -----
> From: Stefan [mailto:netfortius at gmail.com]
> Sent: Saturday, March 10, 2012 05:30 PM
> To: cisco-nsp at puck.nether.net <cisco-nsp at puck.nether.net>
> Subject: [c-nsp] "%HARDWARE-1-TCAM_ERROR: Found error in HFTM TCAM Space and not able to recover the error" + server losing default GW
>
> Problem: solaris server connected to a port on a 3750 switch.
>
> Reported problem: solaris server lost capability to communicate over
> the network (checks performed from remote location / different VLAN -
> important to know!)
>
> Immediate reaction - network folks engaged: switch investigation
> reveals error from $subj:
>
> %HARDWARE-1-TCAM_ERROR: Found error in HFTM TCAM Space and not able to
> recover the error
>
> so decision taken to immediately reload the switch
>
> Phase II: switch recovers, no more errors, server still reported
> unreachable from monitoring tool; a quick test from within switch
> reveals reachability of server from within its own VLAN, though (all
> tests = ICMP)!
>
> Phase III: finally server folks involved - reached out to "down"
> server via another one, on the same VLAN, connected to the same switch
> - found missing gateway on the "down" server (allegedly there for the
> last 4xx days of uptime)
>
> Phase III - post-mortem monitoring: no more TCAM errors but also no
> more problems (obviously) after re-adding the default GW on the server
>
> What we are missing: test at the time of reported failure in
> communication with server did not include an ICMP from within its own
> VLAN (as the apparent problem was the error reported on the switch
> TCAM)
>
> My question to the audience: having done a little research on old
> solaris behavior (as we have it), I found this:
>
> http://www.tek-tips.com/viewthread.cfm?qid=211132
>
> and now I wonder - is it possible that solaris mechanisms of spewing
> whatever traffic, in missing the default GW, caused the TCAM issue, or
> (and how come) the TCAM issue causing the "disappearance" of the
> solaris default GW.
>
> Anybody having experienced the problem described?
>
> ***Stefan
> _______________________________________________
> cisco-nsp mailing list  cisco-nsp at puck.nether.net
> https://puck.nether.net/mailman/listinfo/cisco-nsp
> archive at http://puck.nether.net/pipermail/cisco-nsp/