[c-nsp] FWSM HA secondary reload & long downtime

Peter Rathlev peter at rathlev.dk
Wed Mar 11 13:47:06 EDT 2009


On Wed, 2009-03-11 at 11:16 +0100, Andrew Yourtchenko wrote:
> > So the contexts weren't really activated (with "Up" interfaces)
> > during all the downtime, just at the end. To me that seems to
> > suggest that it's not just simply that it "steals" the traffic for
> > the interfaces.
> 
> Right. That was why I was wondering whether indeed "all" (as in 
> "absolutely all, i swear" :) the traffic was affected, or only some part 
> of it, and with what timing. Also interesting thing to check if the 
> existing TCP connections continue to run ok (so then the problem area 
> could be isolated to session path and up). Of course, in the real-world 
> scenario there's frequently simply not enough time to look at those 
> details, but they might be definitely helpful.

Hmm... I have discovered that my original analysis was flawed. I knew
TCP sessions without activity survived this, among others a couple of
SSH sessions I had going through two of the contexts to test exactly
this. I had setup a ping job (200 ms interval) between two hosts for
traffic crossing two contexts. This job lost 509 packets during the
downtime, and I assumed this meant that the total time-to-recover for
the system was ~100 seconds.

This conclusion was not right though; assumption bit me again. :-D

I have looked at the log-files in a more thorough and systematic
fashion, and actually it was only traffic through one context that was
impacted. All the contexts, including the impacted one, have logged
continuously with no problems. When I saw SYN Timeouts on the others
contexts too it was always for connections coming from the one with
problems.

This of course points to something else being the problem, not the FWSM.
Or at least that it is something concerning the context, not the
hardware or system configuration.

The switch housing the FWSM logged some things during the bootup, but
only the usual stuff and only before the break:

Mar  9 16:37:00.303: %C6KPWR-SP-4-DISABLED: power to module in slot 2 set off (Reset)
Mar  9 16:39:01.552: %PM_SCP-SP-4-UNK_OPCODE: Received unknown unsolicited message from module 2, opcode 0x330
Mar  9 16:39:02.572: %DIAG-SP-6-RUN_MINIMUM: Module 2: Running Minimum Online Diagnostics...
Mar  9 16:39:04.624: %DIAG-SP-6-DIAG_OK: Module 2: Passed Online Diagnostics
Mar  9 16:39:05.076: %OIR-SP-6-INSCARD: Card inserted in slot 2, interfaces are now online
Mar  9 16:39:12.079: %PM_SCP-SP-4-UNK_OPCODE: Received unknown unsolicited message from module 2, opcode 0x330
Mar  9 16:39:22.579: %PM_SCP-SP-4-UNK_OPCODE: Received unknown unsolicited message from module 2, opcode 0x330

The break was just around 16:40:10. Nothing in those messages make me
concerned.

I think we'll let it rest here. It was a much lower impact than we
thought, so no biggie.

And Andrew: Thank you very much for your input. :-)

Regards,
Peter






More information about the cisco-nsp mailing list