[c-nsp] FWSM HA secondary reload & long downtime

Wed Mar 11 14:14:10 EDT 2009

On Wed, 11 Mar 2009, Peter Rathlev wrote:

> Hmm... I have discovered that my original analysis was flawed. I knew
> TCP sessions without activity survived this, among others a couple of

hmm, so "no traffic during the problem" = "survival"... for those sessions 
that died in the process, would be interesting to see the reason in the 
syslog. If it's "reset-I" or "reset-O" at the time of sync, then the next 
step would be to look in more detail where and how the reset comes from. 
Iterative PCAP.

> SSH sessions I had going through two of the contexts to test exactly
> this. I had setup a ping job (200 ms interval) between two hosts for
> traffic crossing two contexts. This job lost 509 packets during the
> downtime, and I assumed this meant that the total time-to-recover for
> the system was ~100 seconds.
>
> This conclusion was not right though; assumption bit me again. :-D
>
> I have looked at the log-files in a more thorough and systematic
> fashion, and actually it was only traffic through one context that was
> impacted. All the contexts, including the impacted one, have logged
> continuously with no problems. When I saw SYN Timeouts on the others
> contexts too it was always for connections coming from the one with
> problems.

ahha. "Never assume anything unless you have hard proof" is a good rule 
of thumb that helps a lot. Though of course the reality dictates the 
probabilistic shortcuts at times.

>
> This of course points to something else being the problem, not the FWSM.

*bling* too strong of an assumption :).

I'd not discard it right away, but from the practical standpoint the fact 
that some contexts survive the replication just fine and it is only one 
that has an issue, gives a very good jumpstart to look for differences 
between the problematic context and the others - configuration, topology, 
traffic volumes, anything. The good point is that this is a non-intrusive 
(although a bit time-consuming) exercise.

> Or at least that it is something concerning the context, not the
> hardware or system configuration.

yup, with this I totally agree.

>
> The switch housing the FWSM logged some things during the bootup, but
> only the usual stuff and only before the break:
>
> Mar  9 16:37:00.303: %C6KPWR-SP-4-DISABLED: power to module in slot 2 set off (Reset)
> Mar  9 16:39:01.552: %PM_SCP-SP-4-UNK_OPCODE: Received unknown unsolicited message from module 2, opcode 0x330
> Mar  9 16:39:02.572: %DIAG-SP-6-RUN_MINIMUM: Module 2: Running Minimum Online Diagnostics...
> Mar  9 16:39:04.624: %DIAG-SP-6-DIAG_OK: Module 2: Passed Online Diagnostics
> Mar  9 16:39:05.076: %OIR-SP-6-INSCARD: Card inserted in slot 2, interfaces are now online
> Mar  9 16:39:12.079: %PM_SCP-SP-4-UNK_OPCODE: Received unknown unsolicited message from module 2, opcode 0x330
> Mar  9 16:39:22.579: %PM_SCP-SP-4-UNK_OPCODE: Received unknown unsolicited message from module 2, opcode 0x330
>
> The break was just around 16:40:10. Nothing in those messages make me
> concerned.

+1

>
> I think we'll let it rest here. It was a much lower impact than we
> thought, so no biggie.
>

Indeed - though still would be good to nail it down. The latent problems 
are worse in a sense that they usually pop up at the worst possible times 
in combination with something else, and complicate the life a lot. It's 
like walking on the rails is practically not a huge issue, and a 
high-speed train passing by is not of an issue, but combined the two 
events are highly undesirable :)

> And Andrew: Thank you very much for your input. :-)

My pleasure :)

cheers,
andrew