[c-nsp] Reasons for "random" ISIS flapping?
Peter Rathlev
peter at rathlev.dk
Wed Aug 7 03:40:25 EDT 2013
We're seeing some random ISIS flapping and we can't figure out what the
causes are. It's a really fast "down->up" event lasting less than a
second. We have very few interface errors (~1E-8 worst case) and they
don't appear around the time where these flaps occur. Only ISIS is
affected. ISIS timers are 200ms hello and 1s dead time. The interfaces
also run BFD (200ms, 4x mul) but BFD never reacts to anything here.
The device at the center of the event ("the affected device") has "show
clns traffic" saying there has been 20 LSP retransmissions (from a total
of more than 5m LSPs sent) which is high compared to the other devices
in the 38 node network where the typical value is 1.
It doesn't seem to impact traffic forwarding in any measurable way. It
happens maybe once every one or two weeks. It's almost always one
specific router (ROUTR-A) and typically not all neighbor adjacencies go
down, just some of them. CPU load is negligible around the time of
error. It is higher on many other occasions and we cannot induce the
error by introducing CPU load via e.g. SNMP or VTY activity.
The devices around this error are a mix of Sup720-3B and Sup720-3C
running SXI1 with one device (one of the neighbors) running SXJ3. The
ones running SXI1 are about to be upgraded soon.
During troubleshooting I have come up with a couple of questions that I
cannot easily seem to find answers to. And then I thought maybe someone
here could help. :-)
1) The "CLNS-5-ADJCHANGE" states a reason at the end of the log
message. This reason seems to be "hold time expired" for the
device at the center of the event and "neighbor forgot us"
for all the neighbors. What's the difference between these
two? The system message documentation[0] isn't really
helpful, but maybe I'm looking in the wrong place.
2) The column "duration" from "show isis spf-log" is
milliseconds, right? Not seconds? This column normally shows
0 for PERIODIC events and maybe 4 or 8 for any event on
other devices. On the affected device this show 20 for the
"DELADJ TLVCONTENT" event. Is that bad enough to warrant
further investigation?
The rest is configuration and logs. Thank you for your patience and help.
Typical interface configuration:
! *** ROUTR-A ***
interface GigabitEthernet4/2
description ROUTR-B Gi4/1 [CDP]
mtu 9216
bandwidth 1000000
ip address 10.85.250.101 255.255.255.252
ip pim sparse-mode
ip router isis
mls qos trust dscp
mpls traffic-eng tunnels
mpls ip
storm-control broadcast level 2.00
bfd interval 200 min_rx 100 multiplier 4
isis circuit-type level-1
isis network point-to-point
isis metric 21000
isis hello-multiplier 5
isis hello-interval minimal
isis bfd
hold-queue 256 in
ip rsvp bandwidth 500000
!
Typical ISIS configuration:
router isis
net 49.fc00.0000.0008.0001.00
is-type level-1
authentication mode md5
authentication key-chain kc-IGP
ispf level-1 7
metric-style wide
fast-flood 7
set-overload-bit on-startup 120
max-lsp-lifetime 65000
lsp-refresh-interval 65535
spf-interval 5 1 20
prc-interval 5 1 20
lsp-gen-interval 5 1 20
no hello padding
log-adjacency-changes all
redistribute connected metric 50 level-1
redistribute static ip metric 50 level-1
passive-interface Loopback0
passive-interface Loopback1
bfd all-interfaces
mpls traffic-eng router-id Loopback0
mpls traffic-eng level-1
!
Logs from around one of the errors:
ROUTR-A:
156800: Jul 27 16:38:43.177 CEST: %CLNS-5-ADJCHANGE: ISIS: Adjacency to ROUTR-B (GigabitEthernet4/2) Down, hold time expired
156801: Jul 27 16:38:43.253 CEST: %CLNS-5-ADJCHANGE: ISIS: Adjacency to ROUTR-C (GigabitEthernet5/1) Down, hold time expired
156802: Jul 27 16:38:43.257 CEST: %CLNS-5-ADJCHANGE: ISIS: Adjacency to ROUTR-D (TenGigabitEthernet5/4) Down, hold time expired
156803: Jul 27 16:38:43.409 CEST: %CLNS-5-ADJCHANGE: ISIS: Adjacency to ROUTR-B (GigabitEthernet4/2) Up, new adjacency
156804: Jul 27 16:38:43.409 CEST: %CLNS-5-ADJCHANGE: ISIS: Adjacency to ROUTR-C (GigabitEthernet5/1) Up, new adjacency
156805: Jul 27 16:38:43.413 CEST: %CLNS-5-ADJCHANGE: ISIS: Adjacency to ROUTR-D (TenGigabitEthernet5/4) Up, new adjacency
ROUTR-B:
000949: Jul 27 16:38:43.312 CEST: %CLNS-5-ADJCHANGE: ISIS: Adjacency to ROUTR-A (GigabitEthernet4/1) Down, neighbor forgot us
000950: Jul 27 16:38:43.412 CEST: %CLNS-5-ADJCHANGE: ISIS: Adjacency to ROUTR-A (GigabitEthernet4/1) Up, new adjacency
ROUTR-C:
002762: Jul 27 16:38:43.269 CEST: %CLNS-5-ADJCHANGE: ISIS: Adjacency to ROUTR-A (GigabitEthernet5/2) Down, neighbor forgot us
002763: Jul 27 16:38:43.413 CEST: %CLNS-5-ADJCHANGE: ISIS: Adjacency to ROUTR-A (GigabitEthernet5/2) Up, new adjacency
ROUTR-D:
004057: Jul 27 16:38:43.266 CEST: %CLNS-5-ADJCHANGE: ISIS: Adjacency to ROUTR-A (TenGigabitEthernet5/5) Down, neighbor forgot us
004058: Jul 27 16:38:43.430 CEST: %CLNS-5-ADJCHANGE: ISIS: Adjacency to ROUTR-A (TenGigabitEthernet5/5) Up, new adjacency
SPF logs (including one event before and after the incident), a few
lines from "show clns traffic" and some interface statistics:
ROUTR-A:
4d00h 0 38 1 PERIODIC
3d21h 20 38 2 ROUTR-A.00-00 DELADJ TLVCONTENT
3d21h 0 0 1 ROUTR-B.00-00 TLVCONTENT
3d21h 0 38 2 ROUTR-A.00-00 NEWADJ TLVCONTENT
3d21h 0 5 1 ROUTR-B.00-00 TLVCONTENT
3d06h 0 38 1 PERIODIC
IS-IS: LSP Retransmissions: 20
IS-IS: LSP checksum errors received: 0
IS-IS: Update process packets dropped: 0
GigabitEthernet4/2 is up, line protocol is up (connected)
Description: ROUTR-B Gi4/1 [CDP]
453098284310 packets input, 356682852343695 bytes, 1 no buffer
102 input errors, 22 CRC, 22 frame, 4 overrun, 0 ignored
GigabitEthernet5/1 is up, line protocol is up (connected)
Description: ROUTR-C Gi5/2 [CDP]
373105238327 packets input, 202072425988402 bytes, 1 no buffer
0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
TenGigabitEthernet5/4 is up, line protocol is up (connected)
Description: ROUTR-D Te5/5 [CDP]
32251187100 packets input, 19660975384814 bytes, 1 no buffer
355 input errors, 1 CRC, 1 frame, 0 overrun, 0 ignored
================================================
ROUTR-B:
3d21h 0 38 1 PERIODIC
3d20h 8 38 2 ROUTR-B.00-00 DELADJ TLVCONTENT
3d20h 0 0 1 ROUTR-A.00-00 TLVCONTENT
3d20h 4 38 3 ROUTR-A.00-00 NEWADJ TLVCONTENT
3d02h 0 38 1 PERIODIC
IS-IS: LSP Retransmissions: 1
IS-IS: LSP checksum errors received: 0
IS-IS: Update process packets dropped: 0
GigabitEthernet4/1 is up, line protocol is up (connected)
Description: ROUTR-A Gi4/2 [CDP]
119726592110 packets input, 65794758629274 bytes, 0 no buffer
9 input errors, 4 CRC, 4 frame, 0 overrun, 0 ignored
================================================
ROUTR-C:
3d23h 0 38 1 PERIODIC
3d20h 0 4 1 ROUTR-A.00-00 TLVCONTENT
3d20h 0 0 1 ROUTR-B.00-00 TLVCONTENT
3d20h 0 0 1 ROUTR-A.00-00 TLVCONTENT
3d20h 0 5 1 ROUTR-B.00-00 TLVCONTENT
3d05h 0 38 1 PERIODIC
IS-IS: LSP Retransmissions: 6
IS-IS: LSP checksum errors received: 0
IS-IS: Update process packets dropped: 0
GigabitEthernet5/2 is up, line protocol is up (connected)
Description: ROUTR-A Gi5/1 [CDP]
422223542188 packets input, 373263729698271 bytes, 0 no buffer
0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
================================================
ROUTR-D:
3d23h 0 38 1 PERIODIC
3d20h 0 4 1 ROUTR-A.00-00 TLVCONTENT
3d20h 0 0 1 ROUTR-B.00-00 TLVCONTENT
3d20h 0 0 1 ROUTR-A.00-00 TLVCONTENT
3d20h 0 5 1 ROUTR-B.00-00 TLVCONTENT
3d04h 0 38 1 PERIODIC
IS-IS: LSP Retransmissions: 2
IS-IS: LSP checksum errors received: 0
IS-IS: Update process packets dropped: 0
TenGigabitEthernet5/5 is up, line protocol is up (connected)
Description: ROUTR-A Te5/4 [CDP]
24459802198 packets input, 10916479689302 bytes, 0 no buffer
2 input errors, 2 CRC, 1 frame, 0 overrun, 0 ignored
[0]: http://www.cisco.com/en/US/docs/ios/system/messages/guide/sm_cn03.html#wp605882
--
Peter
More information about the cisco-nsp
mailing list