[f-nsp] Interesting problem with ServerIron GT

Gabriel Cain gabriel at popcap.com
Mon Mar 6 12:59:41 EST 2006


So I've got an Interesting problem on a ServerIron GT EGC16.

I have two mail servers (running postfix) that are being load balanced in
the normal, easy way[1]. See below[2] for the software version.

Every so often, we get pages from our alerting system (nagios).  Those
messages are that the 3 addresses, the two real mail servers, and the vip,
are down.  Unreachable.

They don't always go down at the same time, but they often do in clusters;
one goes down, then back up, then another goes down.  Or both go down close
in time to eachother, then back up close to eachother.  The events haven't
been observed to last longer than about 10 minutes.  Most of them are only
3-4 minutes in duration.

I've checked the logging on the servers, and they show no interruption in
layer 1 connectivity (i.e., no log messages about the interfaces going down,
which would show if it had).  Arp timeouts occurred to me as a possibility,
but I've not been able to get any conclusive data.

The log messages on the serveriron are brief, just stating that it went
down, and back up.  No useful information :^(  Our cat6500 says nothing at
all in its logs during these events.

The real servers are in a VLAN, vlan 20.    The nagios system is across our
network in another place.  The foundry links to our catalyst 6509 via a
trunk group of four gig-E ports (i.e., "trunk switch ethe 3/15 to 3/16 ethe
4/15 to 4/16")

Network arch is roughly:

{corp offices with nagios probe}----[router]
				 [cat 6500]--------{Internet}
{real servers}-------------------[SIGT EGC16]

What have run TCP dumps on the servers and on clients during these events.
One thing that I do notice is that arp requests appear to come from the
foundry's configured management IP address, rather than the VIP.  I don't
know if this is a problem or not, but it may be, as the VIP and the
management address are in different subnets.  This is also confirmed from
the log messages on the servers:

	arplookup failed: host is not on local network

Anyway, it's really frustrating, and I'm unsure of where to look next.

Has anyone seen this behavior before?

Thanks for the help!

[1] Configuration excerpts:   (IP subnet has been replaced with 1.2.3)

trunk switch ethe 3/15 to 3/16 ethe 4/15 to 4/16
server real mail1
 port smtp
server real mail2
 port smtp
server virtual mail-cluster
 port smtp
 bind smtp mail1 smtp mail2 smtp
vlan 20 name mail-servers by port
 tagged ethe 3/15 to 3/16 ethe 4/15 to 4/16
 untagged ethe 3/5 ethe 4/5
hostname sigt-sea-01
ip address
ip default-gateway


[2] show version:
  SW: Version 09.3.01bTD2 Copyright (c) 1996-2003 Foundry Networks, Inc.
      Compiled on Jul 07 2005 at 21:17:20 labeled as WXM09301b
      (3769367 bytes) from Primary wxm09301b.bin
  HW: ServerIronGT E-1 Switch, SYSIF version 21, Serial #: Non-exist

Slot 1 & 2 are:
SL 1: B0GMR WSM2 Management Module, SYSIF 2, M6, ACTIVE
      Serial #:   removed
    0 MB SHM, 1 Application Processors
16384 KB BRAM, SMC version 5, BM version 21
  SW: (1)09.3.01bTF2

Slots 3 & 4 are J-BxGC16 JetCore Gig Copper Module, SYSIF 2

Gabriel Cain					Senior Systems Administrator
PopCap Games						  gabriel at popcap.com
Direct: (206) 256-4243				      Mobile: (425) 418-8166

More information about the foundry-nsp mailing list