[f-nsp] Interesting problem with ServerIron GT
Peter Clark
pclark at raindance.com
Mon Mar 6 18:34:43 EST 2006
What OS are your real servers running? Do they have multiple
interfaces? Is the VIP using DSR? I ask because we ran into a number
of ARP related problems with real servers with multiple ethernet
interfaces running Linux kernels of 2.2.x and 2.4.x.
-----Original Message-----
From: foundry-nsp-bounces at puck.nether.net
[mailto:foundry-nsp-bounces at puck.nether.net] On Behalf Of Cliff Fogle
Sent: Monday, March 06, 2006 11:18 AM
To: Gabriel Cain; foundry-nsp at puck.nether.net
Subject: Re: [f-nsp] Interesting problem with ServerIron GT
This is just a quick guess. But you may want to configure a server
source-ip on the subnet local to the real servers:
Server source-ip <ip address> <mask> 0.0.0.0
This is done from the global configuration. If you read the docs you
will see that this is usually for source-nat. But it does quite a bit
more, including sourcing keepalives from this address and possibly arp
requests.
-----Original Message-----
From: foundry-nsp-bounces at puck.nether.net
[mailto:foundry-nsp-bounces at puck.nether.net] On Behalf Of Gabriel Cain
Sent: Monday, March 06, 2006 10:00 AM
To: foundry-nsp at puck.nether.net
Subject: [f-nsp] Interesting problem with ServerIron GT
Hi.
So I've got an Interesting problem on a ServerIron GT EGC16.
I have two mail servers (running postfix) that are being load balanced
in the normal, easy way[1]. See below[2] for the software version.
Every so often, we get pages from our alerting system (nagios). Those
messages are that the 3 addresses, the two real mail servers, and the
vip, are down. Unreachable.
They don't always go down at the same time, but they often do in
clusters; one goes down, then back up, then another goes down. Or both
go down close in time to eachother, then back up close to eachother.
The events haven't been observed to last longer than about 10 minutes.
Most of them are only
3-4 minutes in duration.
I've checked the logging on the servers, and they show no interruption
in layer 1 connectivity (i.e., no log messages about the interfaces
going down, which would show if it had). Arp timeouts occurred to me as
a possibility, but I've not been able to get any conclusive data.
The log messages on the serveriron are brief, just stating that it went
down, and back up. No useful information :^( Our cat6500 says nothing
at all in its logs during these events.
The real servers are in a VLAN, vlan 20. The nagios system is across
our
network in another place. The foundry links to our catalyst 6509 via a
trunk group of four gig-E ports (i.e., "trunk switch ethe 3/15 to 3/16
ethe
4/15 to 4/16")
Network arch is roughly:
{corp offices with nagios probe}----[router]
|
[cat 6500]--------{Internet}
|
{real servers}-------------------[SIGT EGC16]
What have run TCP dumps on the servers and on clients during these
events.
One thing that I do notice is that arp requests appear to come from the
foundry's configured management IP address, rather than the VIP. I
don't know if this is a problem or not, but it may be, as the VIP and
the management address are in different subnets. This is also confirmed
from the log messages on the servers:
arplookup 1.2.3.130 failed: host is not on local network
Anyway, it's really frustrating, and I'm unsure of where to look next.
Has anyone seen this behavior before?
Thanks for the help!
Gabriel
[1] Configuration excerpts: (IP subnet has been replaced with 1.2.3)
trunk switch ethe 3/15 to 3/16 ethe 4/15 to 4/16 !
server real mail1 1.2.3.102
port smtp
!
server real mail2 1.2.3.103
port smtp
!
!
server virtual mail-cluster 1.2.3.101
port smtp
bind smtp mail1 smtp mail2 smtp
!
vlan 20 name mail-servers by port
tagged ethe 3/15 to 3/16 ethe 4/15 to 4/16 untagged ethe 3/5 ethe 4/5
!
hostname sigt-sea-01
ip address 1.2.3.130 255.255.255.192
ip default-gateway 1.2.3.129
********
[2] show version:
SW: Version 09.3.01bTD2 Copyright (c) 1996-2003 Foundry Networks, Inc.
Compiled on Jul 07 2005 at 21:17:20 labeled as WXM09301b
(3769367 bytes) from Primary wxm09301b.bin
HW: ServerIronGT E-1 Switch, SYSIF version 21, Serial #: Non-exist
Slot 1 & 2 are:
SL 1: B0GMR WSM2 Management Module, SYSIF 2, M6, ACTIVE
Serial #: removed
0 MB SHM, 1 Application Processors
16384 KB BRAM, SMC version 5, BM version 21
SW: (1)09.3.01bTF2
Slots 3 & 4 are J-BxGC16 JetCore Gig Copper Module, SYSIF 2
--
Gabriel Cain Senior Systems
Administrator
PopCap Games
gabriel at popcap.com
Direct: (206) 256-4243 Mobile: (425)
418-8166
_______________________________________________
foundry-nsp mailing list
foundry-nsp at puck.nether.net
http://puck.nether.net/mailman/listinfo/foundry-nsp
_______________________________________________
foundry-nsp mailing list
foundry-nsp at puck.nether.net
http://puck.nether.net/mailman/listinfo/foundry-nsp
More information about the foundry-nsp
mailing list