[c-nsp] Reasons for "random" ISIS flapping?

Peter Rathlev peter at rathlev.dk
Wed Aug 7 03:40:25 EDT 2013


We're seeing some random ISIS flapping and we can't figure out what the
causes are. It's a really fast "down->up" event lasting less than a
second. We have very few interface errors (~1E-8 worst case) and they
don't appear around the time where these flaps occur. Only ISIS is
affected. ISIS timers are 200ms hello and 1s dead time. The interfaces
also run BFD (200ms, 4x mul) but BFD never reacts to anything here.

The device at the center of the event ("the affected device") has "show
clns traffic" saying there has been 20 LSP retransmissions (from a total
of more than 5m LSPs sent) which is high compared to the other devices
in the 38 node network where the typical value is 1.

It doesn't seem to impact traffic forwarding in any measurable way. It
happens maybe once every one or two weeks. It's almost always one
specific router (ROUTR-A) and typically not all neighbor adjacencies go
down, just some of them. CPU load is negligible around the time of
error. It is higher on many other occasions and we cannot induce the
error by introducing CPU load via e.g. SNMP or VTY activity.

The devices around this error are a mix of Sup720-3B and Sup720-3C
running SXI1 with one device (one of the neighbors) running SXJ3. The
ones running SXI1 are about to be upgraded soon.

During troubleshooting I have come up with a couple of questions that I
cannot easily seem to find answers to. And then I thought maybe someone
here could help. :-)

   1) The "CLNS-5-ADJCHANGE" states a reason at the end of the log
      message. This reason seems to be "hold time expired" for the
      device at the center of the event and "neighbor forgot us"
      for all the neighbors. What's the difference between these
      two? The system message documentation[0] isn't really
      helpful, but maybe I'm looking in the wrong place.

   2) The column "duration" from "show isis spf-log" is
      milliseconds, right? Not seconds? This column normally shows
      0 for PERIODIC events and maybe 4 or 8 for any event on
      other devices. On the affected device this show 20 for the
      "DELADJ TLVCONTENT" event. Is that bad enough to warrant
      further investigation?

The rest is configuration and logs. Thank you for your patience and help.

Typical interface configuration:

! *** ROUTR-A ***
interface GigabitEthernet4/2
 description ROUTR-B Gi4/1 [CDP]
 mtu 9216
 bandwidth 1000000
 ip address 10.85.250.101 255.255.255.252
 ip pim sparse-mode
 ip router isis 
 mls qos trust dscp
 mpls traffic-eng tunnels
 mpls ip
 storm-control broadcast level 2.00
 bfd interval 200 min_rx 100 multiplier 4
 isis circuit-type level-1
 isis network point-to-point 
 isis metric 21000
 isis hello-multiplier 5
 isis hello-interval minimal
 isis bfd
 hold-queue 256 in
 ip rsvp bandwidth 500000
!

Typical ISIS configuration:

router isis
 net 49.fc00.0000.0008.0001.00
 is-type level-1
 authentication mode md5
 authentication key-chain kc-IGP
 ispf level-1 7
 metric-style wide
 fast-flood 7
 set-overload-bit on-startup 120
 max-lsp-lifetime 65000
 lsp-refresh-interval 65535
 spf-interval 5 1 20
 prc-interval 5 1 20
 lsp-gen-interval 5 1 20
 no hello padding
 log-adjacency-changes all
 redistribute connected metric 50 level-1
 redistribute static ip metric 50 level-1
 passive-interface Loopback0
 passive-interface Loopback1
 bfd all-interfaces
 mpls traffic-eng router-id Loopback0
 mpls traffic-eng level-1
!

Logs from around one of the errors:

ROUTR-A:
156800: Jul 27 16:38:43.177 CEST: %CLNS-5-ADJCHANGE: ISIS: Adjacency to ROUTR-B (GigabitEthernet4/2) Down, hold time expired
156801: Jul 27 16:38:43.253 CEST: %CLNS-5-ADJCHANGE: ISIS: Adjacency to ROUTR-C (GigabitEthernet5/1) Down, hold time expired
156802: Jul 27 16:38:43.257 CEST: %CLNS-5-ADJCHANGE: ISIS: Adjacency to ROUTR-D (TenGigabitEthernet5/4) Down, hold time expired
156803: Jul 27 16:38:43.409 CEST: %CLNS-5-ADJCHANGE: ISIS: Adjacency to ROUTR-B (GigabitEthernet4/2) Up, new adjacency
156804: Jul 27 16:38:43.409 CEST: %CLNS-5-ADJCHANGE: ISIS: Adjacency to ROUTR-C (GigabitEthernet5/1) Up, new adjacency
156805: Jul 27 16:38:43.413 CEST: %CLNS-5-ADJCHANGE: ISIS: Adjacency to ROUTR-D (TenGigabitEthernet5/4) Up, new adjacency

ROUTR-B:
000949: Jul 27 16:38:43.312 CEST: %CLNS-5-ADJCHANGE: ISIS: Adjacency to ROUTR-A (GigabitEthernet4/1) Down, neighbor forgot us
000950: Jul 27 16:38:43.412 CEST: %CLNS-5-ADJCHANGE: ISIS: Adjacency to ROUTR-A (GigabitEthernet4/1) Up, new adjacency

ROUTR-C:
002762: Jul 27 16:38:43.269 CEST: %CLNS-5-ADJCHANGE: ISIS: Adjacency to ROUTR-A (GigabitEthernet5/2) Down, neighbor forgot us
002763: Jul 27 16:38:43.413 CEST: %CLNS-5-ADJCHANGE: ISIS: Adjacency to ROUTR-A (GigabitEthernet5/2) Up, new adjacency

ROUTR-D:
004057: Jul 27 16:38:43.266 CEST: %CLNS-5-ADJCHANGE: ISIS: Adjacency to ROUTR-A (TenGigabitEthernet5/5) Down, neighbor forgot us
004058: Jul 27 16:38:43.430 CEST: %CLNS-5-ADJCHANGE: ISIS: Adjacency to ROUTR-A (TenGigabitEthernet5/5) Up, new adjacency

SPF logs (including one event before and after the incident), a few
lines from "show clns traffic" and some interface statistics:

ROUTR-A:
4d00h          0     38      1                       PERIODIC
3d21h         20     38      2        ROUTR-A.00-00  DELADJ TLVCONTENT
3d21h          0      0      1        ROUTR-B.00-00  TLVCONTENT
3d21h          0     38      2        ROUTR-A.00-00  NEWADJ TLVCONTENT
3d21h          0      5      1        ROUTR-B.00-00  TLVCONTENT
3d06h          0     38      1                       PERIODIC

IS-IS: LSP Retransmissions: 20
IS-IS: LSP checksum errors received: 0
IS-IS: Update process packets dropped: 0

GigabitEthernet4/2 is up, line protocol is up (connected)
  Description: ROUTR-B Gi4/1 [CDP]
     453098284310 packets input, 356682852343695 bytes, 1 no buffer
     102 input errors, 22 CRC, 22 frame, 4 overrun, 0 ignored

GigabitEthernet5/1 is up, line protocol is up (connected)
  Description: ROUTR-C Gi5/2 [CDP]
     373105238327 packets input, 202072425988402 bytes, 1 no buffer
     0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored

TenGigabitEthernet5/4 is up, line protocol is up (connected)
  Description: ROUTR-D Te5/5 [CDP]
     32251187100 packets input, 19660975384814 bytes, 1 no buffer
     355 input errors, 1 CRC, 1 frame, 0 overrun, 0 ignored

      ================================================

ROUTR-B:
3d21h          0     38      1                       PERIODIC
3d20h          8     38      2        ROUTR-B.00-00  DELADJ TLVCONTENT
3d20h          0      0      1        ROUTR-A.00-00  TLVCONTENT
3d20h          4     38      3        ROUTR-A.00-00  NEWADJ TLVCONTENT
3d02h          0     38      1                       PERIODIC

IS-IS: LSP Retransmissions: 1
IS-IS: LSP checksum errors received: 0
IS-IS: Update process packets dropped: 0

GigabitEthernet4/1 is up, line protocol is up (connected)
  Description: ROUTR-A Gi4/2 [CDP]
     119726592110 packets input, 65794758629274 bytes, 0 no buffer
     9 input errors, 4 CRC, 4 frame, 0 overrun, 0 ignored

      ================================================

ROUTR-C:
3d23h          0     38      1                       PERIODIC
3d20h          0      4      1        ROUTR-A.00-00  TLVCONTENT
3d20h          0      0      1        ROUTR-B.00-00  TLVCONTENT
3d20h          0      0      1        ROUTR-A.00-00  TLVCONTENT
3d20h          0      5      1        ROUTR-B.00-00  TLVCONTENT
3d05h          0     38      1                       PERIODIC

IS-IS: LSP Retransmissions: 6
IS-IS: LSP checksum errors received: 0
IS-IS: Update process packets dropped: 0

GigabitEthernet5/2 is up, line protocol is up (connected)
  Description: ROUTR-A Gi5/1 [CDP]
     422223542188 packets input, 373263729698271 bytes, 0 no buffer
     0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored

      ================================================

ROUTR-D:
3d23h          0     38      1                       PERIODIC
3d20h          0      4      1        ROUTR-A.00-00  TLVCONTENT
3d20h          0      0      1        ROUTR-B.00-00  TLVCONTENT
3d20h          0      0      1        ROUTR-A.00-00  TLVCONTENT
3d20h          0      5      1        ROUTR-B.00-00  TLVCONTENT
3d04h          0     38      1                       PERIODIC

IS-IS: LSP Retransmissions: 2
IS-IS: LSP checksum errors received: 0
IS-IS: Update process packets dropped: 0

TenGigabitEthernet5/5 is up, line protocol is up (connected)
  Description: ROUTR-A Te5/4 [CDP]
     24459802198 packets input, 10916479689302 bytes, 0 no buffer
     2 input errors, 2 CRC, 1 frame, 0 overrun, 0 ignored


[0]: http://www.cisco.com/en/US/docs/ios/system/messages/guide/sm_cn03.html#wp605882

-- 
Peter




More information about the cisco-nsp mailing list