[c-nsp] spurious NAT "disconnects"

Jeff Bacon bacon at walleyesoftware.com
Tue Sep 1 11:19:16 EDT 2009


I have a TAC case open for this, but I'll throw it out to the crowd as
well. 

Cat6500, dual-sup720-3B, SXH4. Two linecards - a 6748-GE-TX (no DFC) and
a 6816A (DFC3). 

The switch acts as a "vendor nexus" - we connect to a bunch of
exchanges, and it's the gateway between us and them. Most of the traffic
is multicast coming in on various 1GB links, but there are also several
outbound TCP connections, of the "stay up all day" type - connect in the
morning, talk all day, shut down at 6PM. 

All of the TCP connections are NATted, generally with a different range
to each different vendor (yes someday it would be nice to present one
public routed space to all of them, but not all of them can handle it
even if). 


Occasionally, it appears as though the 6500 will occasionally "confuse"
one established TCP NAT connection with another.

The effect is such that, given connection 1 between A (inside) port W
and B (outside) port X and connection 2 between C (inside) port Y and D
(outside) port Z, the switch will spontaneously start translating the
incoming packets from B (previously being correctly translated to A port
W) and send them to C port W instead of to A port W. Packets from A to
B, however, will still be translated correctly. 

The result to the hosts involved is that B still receives A's packets, B
still thinks it is sending packets to A... but A doesn't receive
anything, and eventually the socket will time out and die (in various
ways, depending on whether keepalive is set or if there's a heartbeat in
the connection protocol or whatever). 

C gets the packets, but since it is invariably to an invalid port on C,
C sends an RST to B, the switch sees a RST packet from C port W to B
port X, doesn't have a NAT entry for that and presumably says "eh,
what's the point", and the packet just drops. 

No errors of any type from the switch. It just notes the connection
dropping (I have "ip nat trans syslog" set). And it's not all that
common - it does it to one connection every odd few days. 

There are a bunch of NAT entries being created/deleted all the time from
a monitoring host, mostly ICMP watching the far-end hosts. 


This was first seen in 12.2(18)SXF15a on a sup32; I upgraded to
12.2(33)SXH4 and the problem appeared to go away. (however, we also
wrote code to detect dropped connections and restart more gracefully.)
The problem has now appeared on SXH4 on a sup720. Or appears to have;
the symptoms are the same in any event.

Weird, huh? 

I have a sniffer on the inside to watch the traffic streams so I can
capture it happening. The question is, what might I set for debug or
poll from the switch in order to determine what the heck the switch is
thinking when it's doing this? 

-bacon



More information about the cisco-nsp mailing list