[f-nsp] Interesting problem with ServerIron GT

Peter Clark pclark at raindance.com
Mon Mar 6 18:34:43 EST 2006


What OS are your real servers running?  Do they have multiple
interfaces?  Is the VIP using DSR?  I ask because we ran into a number
of ARP related problems with real servers with multiple ethernet
interfaces running Linux kernels of 2.2.x and 2.4.x.   

-----Original Message-----
From: foundry-nsp-bounces at puck.nether.net
[mailto:foundry-nsp-bounces at puck.nether.net] On Behalf Of Cliff Fogle
Sent: Monday, March 06, 2006 11:18 AM
To: Gabriel Cain; foundry-nsp at puck.nether.net
Subject: Re: [f-nsp] Interesting problem with ServerIron GT

This is just a quick guess.  But you may want to configure a server
source-ip on the subnet local to the real servers:

Server source-ip <ip address> <mask> 0.0.0.0

This is done from the global configuration.  If you read the docs you
will see that this is usually for source-nat.  But it does quite a bit
more, including sourcing keepalives from this address and possibly arp
requests.

-----Original Message-----
From: foundry-nsp-bounces at puck.nether.net
[mailto:foundry-nsp-bounces at puck.nether.net] On Behalf Of Gabriel Cain
Sent: Monday, March 06, 2006 10:00 AM
To: foundry-nsp at puck.nether.net
Subject: [f-nsp] Interesting problem with ServerIron GT


Hi.

So I've got an Interesting problem on a ServerIron GT EGC16.

I have two mail servers (running postfix) that are being load balanced
in the normal, easy way[1]. See below[2] for the software version.

Every so often, we get pages from our alerting system (nagios).  Those
messages are that the 3 addresses, the two real mail servers, and the
vip, are down.  Unreachable.

They don't always go down at the same time, but they often do in
clusters; one goes down, then back up, then another goes down.  Or both
go down close in time to eachother, then back up close to eachother.
The events haven't been observed to last longer than about 10 minutes.
Most of them are only
3-4 minutes in duration.

I've checked the logging on the servers, and they show no interruption
in layer 1 connectivity (i.e., no log messages about the interfaces
going down, which would show if it had).  Arp timeouts occurred to me as
a possibility, but I've not been able to get any conclusive data.

The log messages on the serveriron are brief, just stating that it went
down, and back up.  No useful information :^(  Our cat6500 says nothing
at all in its logs during these events.

The real servers are in a VLAN, vlan 20.    The nagios system is across
our
network in another place.  The foundry links to our catalyst 6509 via a
trunk group of four gig-E ports (i.e., "trunk switch ethe 3/15 to 3/16
ethe
4/15 to 4/16")

Network arch is roughly:

{corp offices with nagios probe}----[router]
					|
				 [cat 6500]--------{Internet}
					|
{real servers}-------------------[SIGT EGC16]

What have run TCP dumps on the servers and on clients during these
events.
One thing that I do notice is that arp requests appear to come from the
foundry's configured management IP address, rather than the VIP.  I
don't know if this is a problem or not, but it may be, as the VIP and
the management address are in different subnets.  This is also confirmed
from the log messages on the servers:

	arplookup 1.2.3.130 failed: host is not on local network

Anyway, it's really frustrating, and I'm unsure of where to look next.

Has anyone seen this behavior before?

Thanks for the help!
Gabriel




[1] Configuration excerpts:   (IP subnet has been replaced with 1.2.3)

trunk switch ethe 3/15 to 3/16 ethe 4/15 to 4/16 !
server real mail1 1.2.3.102
 port smtp
!
server real mail2 1.2.3.103
 port smtp
!
!
server virtual mail-cluster 1.2.3.101
 port smtp
 bind smtp mail1 smtp mail2 smtp
!
vlan 20 name mail-servers by port
 tagged ethe 3/15 to 3/16 ethe 4/15 to 4/16  untagged ethe 3/5 ethe 4/5
!
hostname sigt-sea-01
ip address 1.2.3.130 255.255.255.192
ip default-gateway 1.2.3.129



********

[2] show version:
  SW: Version 09.3.01bTD2 Copyright (c) 1996-2003 Foundry Networks, Inc.
      Compiled on Jul 07 2005 at 21:17:20 labeled as WXM09301b
      (3769367 bytes) from Primary wxm09301b.bin
  HW: ServerIronGT E-1 Switch, SYSIF version 21, Serial #: Non-exist

Slot 1 & 2 are:
SL 1: B0GMR WSM2 Management Module, SYSIF 2, M6, ACTIVE
      Serial #:   removed
    0 MB SHM, 1 Application Processors
16384 KB BRAM, SMC version 5, BM version 21
  SW: (1)09.3.01bTF2

Slots 3 & 4 are J-BxGC16 JetCore Gig Copper Module, SYSIF 2


-- 
Gabriel Cain					Senior Systems
Administrator
PopCap Games
gabriel at popcap.com
Direct: (206) 256-4243				      Mobile: (425)
418-8166

_______________________________________________
foundry-nsp mailing list
foundry-nsp at puck.nether.net
http://puck.nether.net/mailman/listinfo/foundry-nsp


_______________________________________________
foundry-nsp mailing list
foundry-nsp at puck.nether.net
http://puck.nether.net/mailman/listinfo/foundry-nsp




More information about the foundry-nsp mailing list