[c-nsp] IS-IS LSP Generation/Expiry + Database Optimization - Issue

Thu Feb 26 06:56:58 EST 2009

Mark Tinka <mailto:mtinka at globaltransit.net> wrote on Sunday, February
22, 2009 16:31:

>> I've "worked" with the increased lifetime/refresh
>> intervals in several large networks for the last 8 years,
>> and I've not seen an issue with it. Do you have any
>> indication that the problem you've been experiencing is
>> caused by "corrupt" LSPs?
> 
> Admittedly, we haven't sat down to really analyze and debug
> the flow of LSP's (or lack thereof), as each time it
> happens, we can't afford this luxury; the router has to be
> online in the shortest time possible (and I can't replicate
> this exactly in the lab as we don't have enough of the exact
> spare kit to do so at the moment).

Right.. But a "show isis database detail [xxxxx.yy-zz]" dump into a file
would have allowed more conclusions ;-)

>> It is strange that you only
>> seem to see the problem on some routers, and not on
>> others, which makes a "corrupt" LSP advertised by the
>> restarting router a bit unlikely..
> 
> We've only seen the issue on recovering routers that were
> previously part of the IS-IS domain. As mentioned, routers
> that are new to the domain come up fine the first time.

Well, I meant something different: If I understand your description
correctly, only some routers in your network have problems reaching the
restarting node, others can reach it just fine. Is this the case or not?
So if the issue is indeed a "wrong" LSP in the ISIS domain, I would
suspect that all nodes would see this "wrong" LSP?

> 
>> I would still recommend the higher lifetime values,
>> however the original reason (reducing the "chatter") is
>> certainly much less important these days with high-speed
>> CPU and links, so I'm not passionate about it..
> 
> Clearly, even though we did reduce the lifetime and refresh
> timers, we would still need to wait "that long" before the
> link database is cleaned out. And since we need the
> restarting router to be firing on all cylinders when it
> returns to the network, it doesn't matter whether the
> database will be refreshed in 18 minutes or 18 hours - we
> need uptime the moment the router is able to start
> processing frames/packets.

right..

> So in that respect, keeping these values at "where ever"
> they need to be to scale IS-IS is fine. We just need to
> figure out why the recovering router does not "properly"
> signal the DIS to refresh it's link state database upon a
> successful initialization of the IS-IS process.

I'm not sure if this is really the case:

1) we can generally assume that a reloading router will essential
advertise the same information it did before it crashed.. So even if the
restarting router didn't have a chance purging his LSPs before it went
down, the "stale" LSP will very likely still reflect the correct
information, so even if a remote node didn't receive the "new" LSP from
the recovering node, it would be able to reach the recovering node after
its neighbors started to advertise the adjacency.

2) If there are problems getting the new LSP out (for example after a
controlled reload where the router was able to purge the LSPs), we would
likely see most (if not all) nodes not being able to reach it.

> I will say that we have the 'ignore-lsp-errors' feature
> enabled. Given its purpose, could that have an adverse
> effect on a recovering router's capability to effectively
> get its new LSP's out to the DIS?

I don't think so..

you mention DIS: Is this only happening on broadcast segments?

But this is all a bit too speculative for me. We should really get a
complete database output from the recovering node and from one of the
nodes not being able to reach it (and possibly one from a node which is
able to reach it), and work from there.. "show ip route <...>" would
also help..

	oli