[c-nsp] IS-IS max-area-addresses

Wed Apr 18 19:47:19 EDT 2007

I had to up the maximum number of IS-IS areas across our network last 
night.  Apparently max-area-addresses is one of those things that must 
be common across IS-IS neighbors for adjacencies to build, or so it 
seemed.  I upped it to 254 though I only need a couple dozen.  I did 
this on 3 3800s, 2 2800s, 1 7206VXR and 2 Sup720-3BXLs in separate 
chassis.  All devices are running the latest greatest 12.4T except for 
the 3BXLs which are running SRB.

Our NMS alerted me to a problem about 18 hours after the maintenance 
window.  3 of the routers dropped off the network.  As it turns out 2 of 
the 3800s that provide SSL VPN termination and one of the 2800s stopped 
sending IS-IS routes.  At least I think they stopped *sending* them. 
The routes may have been filtered on the 7600s; I'm not sure.  Can 
anyone refresh my memory on how to see what IS-IS routes are being 
advertised?  I remember OSPF and BGP but not IS-IS.  Back to my story. 
I check the IS-IS neighbors on the 7600s and found that System ID for 
each of the affected routers was no longer the value of the hostname 
command like normal but was instead their NSAP.  For example:

7613-2.tld     L1   Vl4005    10.64.130.3     UP    9 
7613-2.tld.0F
0100.6400.0033 L1   Gi9/1     10.64.0.176     UP    8 
0100.6400.0033.02

The Circuit ID was also affected.  None of the routes that were supposed 
to be advertised were in the 7600's RIB.  The affected routers did 
however have the routes from the 7600s.  This is why my NMS thought the 
hosts were down.  The routes for Lo0 weren't being propagated.  The 
reason one of the 2800s was still being hit by the NMS was because I had 
a static route to Lo0 on that 7600.  Removing this demonstrates the problem.

All 4 affected routers had errors similar to this:

022564: Apr 17 19:51:41 CDT: %CLNS-3-BADPACKET: ISIS: L1 LSP, bad 
max-area-addresses 0, ID 0100.6400.0034.00-00, seq 85, ht 0 from 
0018.7425.6500 (GigabitEthernet0/0)

The 7206 and other 3845 that perform border functions were not affected 
at all; they are also L1-2.  To resolve this issue on the 2 3845s I 
rebooted them both.  The problem went away and has not returned.  I have 
not rebooted the 2800s yet.  I opened a TAC case instead.

I do no believe this is a one-off problem.  Not when 4 routers show the 
problem.  One common element is that all 4 routers are in a common VLAN 
that's trunked between the 7600s.  Another common element is that they 
are all L1.  The only routers with more than one area configured are the 
7600s in the core.  For the change I started with the 2800s, then the 
pair of 3800s, the single 3800 and the corresponding 7600.  After the 
adjacencies re-established I did the other 7600 and 7206.

I held off on the reboot to give Cisco an opportunity to get in and look 
at the routers when they are hosed.  This sounds like a bug to me and 
I'd like to see it fixed.  I will have to reboot at least one 2800 
tomorrow though.  I have to restore service to that site.  I'm trying to 
get my TAC engineer to do whatever he needs to do to get someone's eyes 
on the problem before I reboot.

Any ideas?  Thanks
  Justin