[c-nsp] FWSM HA secondary reload & long downtime
Andrew Yourtchenko
ayourtch at cisco.com
Wed Mar 11 14:14:10 EDT 2009
On Wed, 11 Mar 2009, Peter Rathlev wrote:
> Hmm... I have discovered that my original analysis was flawed. I knew
> TCP sessions without activity survived this, among others a couple of
hmm, so "no traffic during the problem" = "survival"... for those sessions
that died in the process, would be interesting to see the reason in the
syslog. If it's "reset-I" or "reset-O" at the time of sync, then the next
step would be to look in more detail where and how the reset comes from.
Iterative PCAP.
> SSH sessions I had going through two of the contexts to test exactly
> this. I had setup a ping job (200 ms interval) between two hosts for
> traffic crossing two contexts. This job lost 509 packets during the
> downtime, and I assumed this meant that the total time-to-recover for
> the system was ~100 seconds.
>
> This conclusion was not right though; assumption bit me again. :-D
>
> I have looked at the log-files in a more thorough and systematic
> fashion, and actually it was only traffic through one context that was
> impacted. All the contexts, including the impacted one, have logged
> continuously with no problems. When I saw SYN Timeouts on the others
> contexts too it was always for connections coming from the one with
> problems.
ahha. "Never assume anything unless you have hard proof" is a good rule
of thumb that helps a lot. Though of course the reality dictates the
probabilistic shortcuts at times.
>
> This of course points to something else being the problem, not the FWSM.
*bling* too strong of an assumption :).
I'd not discard it right away, but from the practical standpoint the fact
that some contexts survive the replication just fine and it is only one
that has an issue, gives a very good jumpstart to look for differences
between the problematic context and the others - configuration, topology,
traffic volumes, anything. The good point is that this is a non-intrusive
(although a bit time-consuming) exercise.
> Or at least that it is something concerning the context, not the
> hardware or system configuration.
yup, with this I totally agree.
>
> The switch housing the FWSM logged some things during the bootup, but
> only the usual stuff and only before the break:
>
> Mar 9 16:37:00.303: %C6KPWR-SP-4-DISABLED: power to module in slot 2 set off (Reset)
> Mar 9 16:39:01.552: %PM_SCP-SP-4-UNK_OPCODE: Received unknown unsolicited message from module 2, opcode 0x330
> Mar 9 16:39:02.572: %DIAG-SP-6-RUN_MINIMUM: Module 2: Running Minimum Online Diagnostics...
> Mar 9 16:39:04.624: %DIAG-SP-6-DIAG_OK: Module 2: Passed Online Diagnostics
> Mar 9 16:39:05.076: %OIR-SP-6-INSCARD: Card inserted in slot 2, interfaces are now online
> Mar 9 16:39:12.079: %PM_SCP-SP-4-UNK_OPCODE: Received unknown unsolicited message from module 2, opcode 0x330
> Mar 9 16:39:22.579: %PM_SCP-SP-4-UNK_OPCODE: Received unknown unsolicited message from module 2, opcode 0x330
>
> The break was just around 16:40:10. Nothing in those messages make me
> concerned.
+1
>
> I think we'll let it rest here. It was a much lower impact than we
> thought, so no biggie.
>
Indeed - though still would be good to nail it down. The latent problems
are worse in a sense that they usually pop up at the worst possible times
in combination with something else, and complicate the life a lot. It's
like walking on the rails is practically not a huge issue, and a
high-speed train passing by is not of an issue, but combined the two
events are highly undesirable :)
> And Andrew: Thank you very much for your input. :-)
My pleasure :)
cheers,
andrew
More information about the cisco-nsp
mailing list