[c-nsp] OSPF OOB Resync and peer stuck in EXSTART (SeqNumberMismatch)

Sat Feb 9 11:41:51 EST 2013

Here's another interesting tidbit I found while researching this
synchronization problem:

------
"When the OSPF in a router restarts, neighbors with adjacencies to the
restarting router learn about the OSPF restart. How does a neighboring
router find out about the OSPF restart? A neighbor detects the restart
event on an adjacency when the associated hold timer expires, or when it
receives a Hello packet containing incomplete information. For example, the
Hello does not mention the receiving router in the neighbor list.

When neighbors learn about the OSPF restart, they cycle their adjacencies
to the restarting router through the down state. To keep the LSDB
synchronized, a change in adjacency state (up to down) forces the
neighboring routers to advertise new LSAs. The complete reliable flooding
and installation of these LSAs in the LSDB will force the SPF to run in the
entire area or routing domain.

In the original OSPF specification, the main objective of this protocol
behavior was reducing the possibility of incorrect forwarding by routing
around the restarting router while its database is being resynchronized.
This OSPF protocol behavior is necessary in the case of a restarting router
that is incapable of preserving its FIB across the restart. This inability
is due to the fact that if traffic is allowed to pass through such a
restarting router, there is an increased likelihood of incorrect forwarding
because of an incomplete database and the FIB. Therefore, to reduce the
possibility of incorrect forwarding, such as routing loops and black holes,
OSPF deliberately routes around the restarting router."
------

The adjacency between our router and the firewall never times out, so it
doesn't appear that we have communication problems. Something is just
triggering a resychronization of the LSDB. I believe this corresponds with
the first log message the router sees:

Feb  8 23:32:45.777 UTC: %OSPF-5-ADJCHG: Process 65300, Nbr 1.2.3.4 on
Vlan7 from FULL to EXSTART, OOB-Resynchronization Feb

The OSPF spec says that if a neighbor starts sending Hello packets that do
not list the receiving router as a neighbor, the receiving router should
change the state of that relationship to DOWN. However, since both the
firewall and the router are advertising that they're capable of OOB Resync,
maybe the router puts it into EXSTART state instead. Subsequent messages
from the firewall apparently (this is assumption) do not have the R-bit
set, which is why the router logs a sequence number mismatch and then
ignores the packets.

If the above is correct then it seems that something is causing the
firewall to restart OSPF, or at least behave in a way that makes the router
think it is restarting. It's really difficult to tell. I'd never even heard
of OOB Resync until last night, so much of this is new to me.

On Sat, Feb 9, 2013 at 8:26 AM, John Neiberger <jneiberger at gmail.com> wrote:

> It's a new-ish Checkpoint firewall, but I have no idea what code it is
> running. I was sent a snippet of their logs and I see a lot of the
> following:
>
> OSPF LSA: different instance of lsa on retranmission list received: type
> RTR
> OSPF LSA: different instance of lsa on retranmission list received: type
> NTW
>
> I verified the the firewall *is* setting the LR-bit indicating that it is
> capable of OOB Resync. The RFC says that if the LR-bit is set in the hello
> messages, DBD packets should have the R-bit set during an OOB Resync. If
> those DBD packets do not have the R-bit set, the receiving device is
> supposed to drop them and log a sequence number mismatch. I suspect that is
> what happened here. It looks like "something" is causing their database to
> become unsynchronized and the firewall triggers an OOB Resync but then
> doesn't set the R-bit in the DBD packets. I'm not exactly sure, though.
> That's just what I'm thinking, so far.
>
>
> On Sat, Feb 9, 2013 at 3:25 AM, Phil Mayers <p.mayers at imperial.ac.uk>wrote:
>
>> On 02/09/2013 05:05 AM, John Neiberger wrote:
>>
>>  1. What triggered the OOB resync in the first place?
>>>
>>
>> I assume there is nothing in the logs for the device, or adjacent
>> devices, at the time?
>>
>>
>>  2. If the firewall isn't capable of doing OOB resync, why would it send
>>> DBD
>>> packets with the R-bit set? (Perhaps it is capable and just wasn't
>>> previously setting the LR-bit in hello messages)
>>>
>>
>> You didn't specify the model of the firewall and it's software version,
>> so it's difficult to say.
>>
>> ______________________________**_________________
>> cisco-nsp mailing list  cisco-nsp at puck.nether.net
>> https://puck.nether.net/**mailman/listinfo/cisco-nsp<https://puck.nether.net/mailman/listinfo/cisco-nsp>
>> archive at http://puck.nether.net/**pipermail/cisco-nsp/<http://puck.nether.net/pipermail/cisco-nsp/>
>>
>
>