[c-nsp] Reasons for "random" ISIS flapping?

Wed Aug 7 05:55:11 EDT 2013

On Wed, 2013-08-07 at 11:47 +0300, Saku Ytti wrote:
>  a) are you seeing input drops in the hold-queue? (try 1k or even 4k
>     hold-queue input)

Only one of the interfaces show any relevant amount of input queue drops
(122 drops) and the interface that have experienced most lost
adjacencies have only 1 drop.

ROUTR-A#sh int te5/4 | incl Input queue
  Input queue: 0/256/1/0 (size/max/drops/flushes); Total output drops: 0
ROUTR-A#sh int gi4/2 | incl Input queue      
  Input queue: 0/256/1/0 (size/max/drops/flushes); Total output drops: 309614
ROUTR-A#sh int gi5/1 | incl Input queue
  Input queue: 1/256/122/0 (size/max/drops/flushes); Total output drops: 107672

The device in question has a rather large-ish amount of SPD drops (42
ppm) according to "show ibc", but other devices in the network have much
higher values and no comparable problems. Is ISIS elegible for SPD or
prioritized? Top 5 devices with IBC drops among the C6k's in the
network:

  Actual-paks    Drops   ppm SPD-drops ppm
  ----------- -------- ----- --------- ---
   1807024142  1247536   690   1050616 581
   1985318421 90561796 45616    321319 162
    133750298  2974141 22237      9896  74
   3633687626 57832307 15916    150898  42
   2244275284 13140977  5855     90766  40

The "ROUTR-A" is number 4 on this list.

>  b) is it busy running some other process? (try process-max-time 60)

We actually use "process-max-time 50" generally on all these devices.
The affected device is no different from the others in that regard. On
the other hand might this be too low? Maybe the ISIS process needs more
than 50ms to parse the hello packets in some strange instance and the
voluntary yielding (if applicable) means the packets are left unparsed.
Just blind guessing of course. :-)

>  c) is it software defect

We're planning on upgrading to SXJ in the near future and might go with
15.1SY since others (e.g. Phil) seem to like it.

> I also couldn't help noticing you're running L1, why is this? It seems to
> be quite rare these days, you really have separate core L2 and various L1
> islands?

We only have one area and should actually be using L2 only. We hadn't
thought it through when we decided on L1 many years ago. I'm thinking
that L1 only or L2 only is better than L1+L2 everywhere and the only
practical drawback of using L1 seems to be the inability to inject a
default route. Any other gotchas we should be worrying about?

-- 
Peter