[c-nsp] OSPF OOB Resync and peer stuck in EXSTART (SeqNumberMismatch)

John Neiberger jneiberger at gmail.com
Sat Feb 9 00:05:06 EST 2013


I think I may have found a clue in the informational RFC for OOB Resync:

When a DBD packet is received with the R-bit set and the sender is
   known to be OOB-incapable, the packet should be dropped and a
   SeqNumber-Mismatch event should be generated for the neighbor.


My router must have received a DBD from the firewall with the R-bit set,
which means the neighbor is participating in OOB resync; however, if the
router did not previously recognize the firewall as being capable of OOB
Resync, it will drop the packet and log a sequence number mismatch.

That may explain part of what we were seeing. Several questions now remain:

1. What triggered the OOB resync in the first place?
2. If the firewall isn't capable of doing OOB resync, why would it send DBD
packets with the R-bit set? (Perhaps it is capable and just wasn't
previously setting the LR-bit in hello messages)

John


On Fri, Feb 8, 2013 at 9:28 PM, John Neiberger <jneiberger at gmail.com> wrote:

> This is a new one on me. We had a situation where OSPF between a router
> and a firewall seemed to go insane and it involves something I've never
> heard of before: Out of band Resync. Here are the logs from the beginning
> of the event:
>
> Feb  8 23:32:45.777 UTC: %OSPF-5-ADJCHG: Process 100, Nbr 1.2.3.4 on Vlan7
> from FULL to EXSTART, OOB-Resynchronization
> Feb  8 23:32:50.777 UTC: %OSPF-5-ADJCHG: Process 100, Nbr 1.2.3.4 on Vlan7
> from EXSTART to EXCHANGE, Negotiation Done
> Feb  8 23:34:49.830 UTC: %OSPF-5-ADJCHG: Process 100, Nbr 1.2.3.4 on Vlan7
> from EXCHANGE to DOWN, Neighbor Down: Too many retransmissions
> Feb  8 23:35:49.830 UTC: %OSPF-5-ADJCHG: Process 100, Nbr 1.2.3.4 on Vlan7
> from DOWN to DOWN, Neighbor Down: Ignore timer expired
> Feb  8 23:35:50.790 UTC: %OSPF-5-ADJCHG: Process 100, Nbr 1.2.3.4 on Vlan7
> from DOWN to INIT, Received Hello
> Feb  8 23:35:50.790 UTC: %OSPF-5-ADJCHG: Process 100, Nbr 1.2.3.4 on Vlan7
> from INIT to 2WAY, 2-Way Received
> Feb  8 23:35:50.790 UTC: %OSPF-5-ADJCHG: Process 100, Nbr 1.2.3.4 on Vlan7
> from 2WAY to EXSTART, AdjOK?
> Feb  8 23:35:50.810 UTC: %OSPF-5-ADJCHG: Process 100, Nbr 1.2.3.4 on Vlan7
> from EXSTART to EXSTART, SeqNumberMismatch
> Feb  8 23:36:00.814 UTC: %OSPF-5-ADJCHG: Process 100, Nbr 1.2.3.4 on Vlan7
> from EXSTART to EXSTART, SeqNumberMismatch
> Feb  8 23:36:10.814 UTC: %OSPF-5-ADJCHG: Process 100, Nbr 1.2.3.4 on Vlan7
> from EXSTART to EXSTART, SeqNumberMismatch
> Feb  8 23:36:25.814 UTC: %OSPF-5-ADJCHG: Process 100, Nbr 1.2.3.4 on Vlan7
> from EXSTART to EXSTART, SeqNumberMismatch
> Feb  8 23:36:30.818 UTC: %OSPF-5-ADJCHG: Process 100, Nbr 1.2.3.4 on Vlan7
> from EXSTART to EXSTART, SeqNumberMismatch
>
> Something happens to trigger an out-of-band resync and then the neighbor
> gets stuck in EXSTART because of a sequence number mismatch. I first
> thought we had an MTU mismatch, but the MTUs seem to check out. I read
> somewhere that sequence number mismatches can be caused by a software
> error. This just isn't something I've run into before.
>
> First, I don't know what OOB Resynchronization is or what all it entails,
> so I'm going to read some more about that to find out what triggers it and
> what it is supposed to be doing under the hood. Second, why would a peer
> that had been working just fine suddenly divebomb into the ground and then
> get stuck in exstart?
>
> We ultimately resolved the problem by clearing the OSPF process a couple
> of times. Eventually all seemed to clear up and things are working fine. I
> suspect a buggy OSPF implementation on the firewall but that's really just
> a guess. The router is running 12.2(33)SRE3 code, which I think has a
> pretty mature OSPF code.
>
> Any thoughts?
>
> Thanks,
> John
>


More information about the cisco-nsp mailing list