[c-nsp] Re-thinking (remembering) how a switch operates

Wed Apr 27 22:21:55 EDT 2005

I had a most "enlightening" discovery today of a rather serious traffic
leak that has gone by unnoticed for, uh, well, an embarassingly long
time.  After discovering the underlying reason, I thought this just
might rattle a few of your heads as it did mine :-)

One of our engineers was investigating very erratic ssh response times
on one of our central (network administration) servers.  This server
resides in our management vlan, which reaches far and wide across campus
to isolate our management traffic and access.

He reported to me seeing a very high stream of UDP traffic, which was
soon identified as syslog traffic to our logging servers.  I thought he
was nuts, the traffic was unicast and several hops (switches) away from
either the source or destination of the syslog traffic.  So I cranked up
ethereal on one of my boxes, and lo and behold, syslog traffic, LOTS of
it.  Proper source and destination IPs, proper ports, verified MAC
addresses; what the heck?

Ran back to our corner of the server farm, to our KVM switch, checked
the logging servers, core switches, everything is up and healthy.
Checked the routers, and the syslog server has a proper ARP entry.
Everybody involved has the correct ARP entry.

To make a long story short, I started checking mac-address-tables (and
cam on the lone CatOS Catalyst in the mix).  NOBODY has a mac entry!

The syslog server just sits and logs traffic.  As a general rule, it
never transmits anything.  The switches, therefore, only very rarely see
it's mac as a source address, so they never learn the mac entry.  So we
go back to basic switch operation:  when they are sent a packet with a
destination MAC of the syslog server, they don't know where it is, so
they broadcast it out every port on the vlan (and trunk containing the
vlan), and for the management vlan, that's a lot of noise to broadcast
it over the whole vlan.  And a lot of traffic -- 5-10 gigs/day.

As a workaround, I added static mac table entries for the server, and
the problem went away.  And the traffic graphs for uplink trunks across
campus took a rather pleasing dip (not that it was all that significant
in the big picture, but it was a lot of unnecessary "noise" that was
previously going everywhere).

Now afterwards, it has me thinking philosophically about the relatively
short default mac-address table aging time (300 secs is default in IOS
and CatOS, IIRC) versus the relatively long ARP cache timeout (which is
what, 400 minutes?  it's a real long time relative to mac-address
aging).  Having the ARP cache saves you from having to do frequent ARPs,
but if you *did* ARP a little more frequently, it would keep the
mac-address tables loaded up when the answer was returned.  And if the
device is down, but still in the ARP cache, anything sent to the device
will be sent (layer-3) and broadcast (layer-2 due to the switches).

And adding the syslog servers to our polling list would at least
generate a periodic response from the devices and refresh the mac table.

Well, enough for now.  Food for thought.

Jeff