[c-nsp] FWSM HA secondary reload & long downtime

Peter Rathlev peter at rathlev.dk
Tue Mar 10 17:10:08 EDT 2009


On Tue, 2009-03-10 at 11:32 +0100, Andrew Yourtchenko wrote:
> if it is merely a new standby that is coming up, the active should not
> stop forwarding the traffic.

That's what I would've assumed too. :-) I do seem to remember that we've
seen this before though, when we had random reboots (CSCse15099 if I
remember correctly) of secondary units resulting in downtime.

> I'd watch out for "logging standby" - I vaguely remember there were
> some issues where the newly coming up box would try to send the
> traffic with the wrong IP/MAC and/or send the gratuitous arp with the
> wrong info in there.
> 
> Especially may be true if you are bringing up primary as standby - at
> this moment the secondary/active is forwarding the traffic using the
> primary's mac addresses.

Standby logging is disabled on all contexts, so this shouldn't be the
issue. Of course if the module coming up sends out gratuitous ARPs this
could break things with the way PIX/FWSM/ASA does HA, using the same
MAC-address.

I had thought that the unit coming up would start by looking at the
failover interface(s) to see if there is already an active unit, and
then start acting as active/standby depending of what it hears.

Otherwise a crashing/rebooting primary unit would always introduce
downtime when coming up again.

> Of course, interesting would be to check if indeed this is on all the 
> contexts or only some of them, etc.

The log buffer (logging errors) on the contexts on the standby unit
(which is the configured primary) didn't say "'

>From the sys context on the standby:

Mar 09 2009 16:41:04: %FWSM-4-411003: Interface statefullfailover, changed state to administratively up
Mar 09 2009 16:41:05: %FWSM-5-504001: Security context admin was added to the system
...
Mar 09 2009 16:41:07: %FWSM-5-504001: Security context sample was added to the system
Mar 09 2009 16:41:26: %FWSM-1-709006: (Primary) End Configuration Replication (STB)
Mar 09 2009 16:42:02: %FWSM-6-210022: LU missed 4837568 updates

>From one of the contexts, still the standby unit:

Mar 09 2009 16:41:35: %FWSM-1-105006: (Primary) Link status 'Up' on interface internet
Mar 09 2009 16:41:35: %FWSM-1-105003: (Primary) Monitoring on interface internet waiting
Mar 09 2009 16:41:35: %FWSM-1-105006: (Primary) Link status 'Up' on interface aars_pro
Mar 09 2009 16:41:35: %FWSM-1-105003: (Primary) Monitoring on interface aars_pro waiting
Mar 09 2009 16:41:35: %FWSM-1-105006: (Primary) Link status 'Up' on interface inside
Mar 09 2009 16:41:35: %FWSM-1-105003: (Primary) Monitoring on interface inside waiting
Mar 09 2009 16:41:35: %FWSM-1-105006: (Primary) Link status 'Up' on interface aars_interfw
Mar 09 2009 16:41:35: %FWSM-1-105003: (Primary) Monitoring on interface aars_interfw waiting
Mar 09 2009 16:41:44: %FWSM-1-105004: (Primary) Monitoring on interface internet normal
Mar 09 2009 16:41:44: %FWSM-1-105004: (Primary) Monitoring on interface aars_pro normal
Mar 09 2009 16:41:44: %FWSM-1-105004: (Primary) Monitoring on interface inside normal
Mar 09 2009 16:41:44: %FWSM-1-105004: (Primary) Monitoring on interface aars_interfw normal

So the contexts weren't really activated (with "Up" interfaces) during
all the downtime, just at the end. To me that seems to suggest that it's
not just simply that it "steals" the traffic for the interfaces.

It seems we have to try and replicate it in the lab to find out what
actually happened. :-)

Thank you for the input.

Regards,
Peter




More information about the cisco-nsp mailing list