[c-nsp] FWSM HA secondary reload & long downtime

Wed Mar 11 06:16:36 EDT 2009

On Tue, 10 Mar 2009, Peter Rathlev wrote:

> On Tue, 2009-03-10 at 11:32 +0100, Andrew Yourtchenko wrote:
>> if it is merely a new standby that is coming up, the active should not
>> stop forwarding the traffic.
>
> That's what I would've assumed too. :-) I do seem to remember that we've
> seen this before though, when we had random reboots (CSCse15099 if I
> remember correctly) of secondary units resulting in downtime.

ahha. interesting - so there's probably also something specific to the 
setup that might be contributing to seeing this.

>
>> I'd watch out for "logging standby" - I vaguely remember there were
>> some issues where the newly coming up box would try to send the
>> traffic with the wrong IP/MAC and/or send the gratuitous arp with the
>> wrong info in there.
>>
>> Especially may be true if you are bringing up primary as standby - at
>> this moment the secondary/active is forwarding the traffic using the
>> primary's mac addresses.
>
> Standby logging is disabled on all contexts, so this shouldn't be the
> issue. Of course if the module coming up sends out gratuitous ARPs this
> could break things with the way PIX/FWSM/ASA does HA, using the same
> MAC-address.
>
> I had thought that the unit coming up would start by looking at the
> failover interface(s) to see if there is already an active unit, and
> then start acting as active/standby depending of what it hears.
>

That's precisely how it is supposed to work :-)

> Otherwise a crashing/rebooting primary unit would always introduce
> downtime when coming up again.
>
>> Of course, interesting would be to check if indeed this is on all the
>> contexts or only some of them, etc.
>
> The log buffer (logging errors) on the contexts on the standby unit
> (which is the configured primary) didn't say "'
>
>> From the sys context on the standby:
>
> Mar 09 2009 16:41:04: %FWSM-4-411003: Interface statefullfailover, changed state to administratively up
> Mar 09 2009 16:41:05: %FWSM-5-504001: Security context admin was added to the system
> ...
> Mar 09 2009 16:41:07: %FWSM-5-504001: Security context sample was added to the system
> Mar 09 2009 16:41:26: %FWSM-1-709006: (Primary) End Configuration Replication (STB)
> Mar 09 2009 16:42:02: %FWSM-6-210022: LU missed 4837568 updates
>
>> From one of the contexts, still the standby unit:
>
> Mar 09 2009 16:41:35: %FWSM-1-105006: (Primary) Link status 'Up' on interface internet
> Mar 09 2009 16:41:35: %FWSM-1-105003: (Primary) Monitoring on interface internet waiting
> Mar 09 2009 16:41:35: %FWSM-1-105006: (Primary) Link status 'Up' on interface aars_pro
> Mar 09 2009 16:41:35: %FWSM-1-105003: (Primary) Monitoring on interface aars_pro waiting
> Mar 09 2009 16:41:35: %FWSM-1-105006: (Primary) Link status 'Up' on interface inside
> Mar 09 2009 16:41:35: %FWSM-1-105003: (Primary) Monitoring on interface inside waiting
> Mar 09 2009 16:41:35: %FWSM-1-105006: (Primary) Link status 'Up' on interface aars_interfw
> Mar 09 2009 16:41:35: %FWSM-1-105003: (Primary) Monitoring on interface aars_interfw waiting
> Mar 09 2009 16:41:44: %FWSM-1-105004: (Primary) Monitoring on interface internet normal
> Mar 09 2009 16:41:44: %FWSM-1-105004: (Primary) Monitoring on interface aars_pro normal
> Mar 09 2009 16:41:44: %FWSM-1-105004: (Primary) Monitoring on interface inside normal
> Mar 09 2009 16:41:44: %FWSM-1-105004: (Primary) Monitoring on interface aars_interfw normal
>
> So the contexts weren't really activated (with "Up" interfaces) during
> all the downtime, just at the end. To me that seems to suggest that it's
> not just simply that it "steals" the traffic for the interfaces.

Right. That was why I was wondering whether indeed "all" (as in 
"absolutely all, i swear" :) the traffic was affected, or only some part 
of it, and with what timing. Also interesting thing to check if the 
existing TCP connections continue to run ok (so then the problem area 
could be isolated to session path and up). Of course, in the real-world 
scenario there's frequently simply not enough time to look at those 
details, but they might be definitely helpful.

>
> It seems we have to try and replicate it in the lab to find out what
> actually happened. :-)

Yes, that would be most ideal - if this issue is reproducible 
in the lab, then a case to further nail it down would be the way to go.

cheers,
andrew