[c-nsp] MPLS LDP and BGP Neighbor flapping constantly

Thu Mar 5 13:05:52 EST 2009

You appear to have a high number of input queue drops and input errors,
granted the counters have never been cleared, do you haver any PPS
graphs of the link between these two boxes? I would suspect a traffic
spike or link fault causing control messages to be dropped being the
cause here.

Dave.

Justin Shore wrote:
> This afternoon I stumbled across a problem with a LDP session between a
> 7613 and a 7201.  Actually both LDP and iBGP were flapping every 10
> seconds or so.  I had both interfaces configured for MPLS, LDP, IS-IS
> (with AUTH and BFD though BFD isn't enabled on the interface itself yet)
> with an interface MTU of 9000 and CLNS MTU of 1496.  Nothing too fancy.
>  The systems as a whole are configured with MPLS graceful-restart, LDP,
> no mpls ip propagate-ttl, and LDP router-ID on a loopback:
> 
> # 7201
> mpls label protocol ldp
> no mpls ip propagate-ttl
> mpls ldp graceful-restart
> mpls ldp router-id Loopback0 force
> 
> # 7613
> mls mpls tunnel-recir
> mpls traffic-eng tunnels
> mpls ldp graceful-restart
> no mpls ip propagate-ttl
> mpls label protocol ldp
> mpls ldp router-id Loopback0 force
> 
> This morning at 7:05 the router stopped responding to SNMP queries for
> about 15m.  The load was about 13 before.  Cacti shows the load doubling
> in the 10m prior to the 15m of nothing.  When it came back the load was
> just shy of 50 and stayed there for about 30m.  After that it stayed at
> around 30-35 for the next 7.5hrs before I noticed the BGP flapping issue
> and shutdown the peer for troubleshooting.  The load dropped back to
> around 16, higher than it was before the hiccup this morning.  I'm at a
> loss to adequately explain why the load has been so jacked.  I think the
> 30-35 load was because BGP flapping and the slightly higher load now is
> due to the LDP flapping issue.  That's my best guess.
> 
> Anyone know how to troubleshoot a LDP neighbor flapping issue?  The 7613
> is logging this:
> 
> 730278: Mar  4 20:43:48.696 CST: LDP GR: Received FT Sess TLV from
> 10.64.0.34:0  (fl 0x1, rs 0x0, rconn 0, rcov 120000)
> 730279: Mar  4 20:43:48.696 CST: LDP GR: MFI cutover wait delay =
> 600000, Forwarding State Hold Timer = 600000
> 730280: Mar  4 20:43:48.696 CST: LDP GR: searching for down nbr record
> (10.64.0.34:0, 10.64.0.178)
> 730281: Mar  4 20:43:48.696 CST: LDP GR: Added FT Sess TLV (Rconn
> 120000, Rcov 0) to INIT msg to 10.64.0.34:0
> 
> The 7201 is logging this:
> 
> 054705: Mar  5 00:28:19.599 CST: LDP GR: GR session 10.64.0.20:0:: lost
> 054706: Mar  5 00:28:19.599 CST: LDP GR: down nbr 10.64.0.20:0:: created
> [1 total]
> 054707: Mar  5 00:28:19 CST: %LDP-5-GR: GR session 10.64.0.20:0 (inst.
> 3): interrupted--recovery pending
> 054708: Mar  5 00:28:19.599 CST: LDP GR: GR session 10.64.0.20:0::
> bindings retained
> 054709: Mar  5 00:28:19.599 CST: LDP GR: down nbr 10.64.0.20:0:: state
> change (None -> Reconnect-Wait)
> 054710: Mar  5 00:28:19.599 CST: LDP GR: down nbr 10.64.0.20:0::
> reconnect timer started [120000 msecs]
> 054711: Mar  5 00:28:19.599 CST: LDP GR: down nbr 10.64.0.20:0:: added
> to bindings task queue [1 entries]
> 054712: Mar  5 00:28:19 CST: %LDP-5-NBRCHG: LDP Neighbor 10.64.0.20:0
> (0) is DOWN (Received error notification from peer: Shut down)
> 
> 054713: Mar  5 00:28:25.923 CST: LDP GR: searching for down nbr record
> (10.64.0.20:0, 10.64.0.179)
> 054714: Mar  5 00:28:25.923 CST: LDP GR: search for down nbr record
> (10.64.0.20:0, 10.64.0.179) returned 10.64.0.20:0
> 054715: Mar  5 00:28:25.923 CST: LDP GR: Added FT Sess TLV (Rconn 0,
> Rcov 120000) to INIT msg to 10.64.0.20:0
> 054716: Mar  5 00:28:25.947 CST: LDP GR: Received FT Sess TLV from
> 10.64.0.20:0  (fl 0x1, rs 0x0, rconn 120000, rcov 0)
> 054717: Mar  5 00:28:25.947 CST: LDP GR: GR session 10.64.0.20:0::
> established
> 054718: Mar  5 00:28:25.947 CST: LDP GR: GR session 10.64.0.20:0:: found
> down nbr 10.64.0.20:0
> 054719: Mar  5 00:28:25.947 CST: LDP GR: down nbr 10.64.0.20:0::
> reconnect timer stopped
> 054720: Mar  5 00:28:25.947 CST: LDP GR: down nbr 10.64.0.20:0:: state
> change (Reconnect-Wait -> Recovering)
> 054721: Mar  5 00:28:25.947 CST: LDP GR: down nbr 10.64.0.20:0::
> recovery timer started [1 msecs]
> 054722: Mar  5 00:28:25 CST: %LDP-5-GR: GR session 10.64.0.20:0 (inst.
> 4): starting graceful recovery
> 054723: Mar  5 00:28:25 CST: %LDP-5-NBRCHG: LDP Neighbor 10.64.0.20:0
> (4) is UP
> 054724: Mar  5 00:28:25.951 CST: LDP GR: down nbr 10.64.0.20:0::
> recovery timer expired
> 054725: Mar  5 00:28:25 CST: %LDP-5-GR: GR session 10.64.0.20:0 (inst.
> 4): completed graceful recovery
> 054726: Mar  5 00:28:25.951 CST: LDP GR: down nbr 10.64.0.20:0::
> destroying record [0 left]
> 054727: Mar  5 00:28:25.951 CST: LDP GR: down nbr 10.64.0.20:0:: state
> change (Recovering -> Delete-Wait)
> 
> 054728: Mar  5 00:28:28.091 CST: LDP GR: Tagcon querying for up to 12
> bindings update tasks [table 0]
> 054729: Mar  5 00:28:28.091 CST: LDP GR: down nbr 10.64.0.20:0::
> requesting bindings DEL for {10.64.0.20:0, 3}
> 054730: Mar  5 00:28:28.091 CST: LDP GR: down nbr 10.64.0.20:0:: removed
> from bindings task queue [0 entries]
> 054731: Mar  5 00:28:28.091 CST: LDP GR: Requesting 1 bindings update
> tasks [0 left in queue]
> 
> 10.64.0.20 is a loopback on the 7613 and 10.64.0.34 is a loopback on the
> 7201.
> 
> I do have some interface errors which I also can't explain.  They do not
> appear to be incrementing though.  7613:
> 
> GigabitEthernet9/1 is up, line protocol is up (connected)
>   Hardware is C6k 1000Mb 802.3, address is 001a.3063.0a80 (bia
> 001a.3063.0a80)
>   Description: TO 2821-2.dc Gi0/0
>   Internet address is 10.64.0.179/31
>   MTU 9000 bytes, BW 1000000 Kbit, DLY 10 usec,
>      reliability 255/255, txload 1/255, rxload 1/255
>   Encapsulation ARPA, loopback not set
>   Keepalive set (10 sec)
>   Full-duplex, 1000Mb/s
>   input flow-control is off, output flow-control is off
>   Clock mode is auto
>   ARP type: ARPA, ARP Timeout 04:00:00
>   Last input 00:00:02, output 00:00:00, output hang never
>   Last clearing of "show interface" counters never
>   Input queue: 0/75/1936665/7581 (size/max/drops/flushes); Total output
> drops: 4
>   Queueing strategy: fifo
>   Output queue: 0/40 (size/max)
>   5 minute input rate 49000 bits/sec, 17 packets/sec
>   5 minute output rate 56000 bits/sec, 24 packets/sec
>   L2 Switched: ucast: 52903876 pkt, 3771470311 bytes - mcast: 15056043
> pkt, 1653756471 bytes
>   L3 in Switched: ucast: 80170438 pkt, 12709078926 bytes - mcast: 0 pkt,
> 0 bytes mcast
>   L3 out Switched: ucast: 185161821 pkt, 36022953056 bytes mcast: 0 pkt,
> 0 bytes
>      150040994 packets input, 30087625055 bytes, 0 no buffer
>      Received 15660647 broadcasts (0 IP multicasts)
>      30 runts, 4247159 giants, 0 throttles
>      1929071 input errors, 68 CRC, 0 frame, 13 overrun, 0 ignored
>      0 watchdog, 0 multicast, 0 pause input
>      0 input packets with dribble condition detected
>      257650143 packets output, 64726258058 bytes, 0 underruns
>      2 output errors, 0 collisions, 2 interface resets
>      0 babbles, 0 late collision, 0 deferred
>      0 lost carrier, 0 no carrier, 0 PAUSE output
>      0 output buffer failures, 0 output buffers swapped out
> 
> 7201:
> GigabitEthernet0/0 is up, line protocol is up
>   Hardware is MV64460 Internal MAC, address is 0023.5ee9.ac1b (bia
> 0023.5ee9.ac1b)
>   Description: TO 7613-2.clr Gi9/1
>   Internet address is 10.64.0.178/31
>   MTU 9000 bytes, BW 1000000 Kbit/sec, DLY 10 usec,
>      reliability 255/255, txload 1/255, rxload 1/255
>   Encapsulation ARPA, loopback not set
>   Keepalive set (10 sec)
>   Full-duplex, 1000Mb/s, media type is RJ45
>   output flow-control is XON, input flow-control is unsupported
>   ARP type: ARPA, ARP Timeout 04:00:00
>   Last input 00:00:00, output 00:00:00, output hang never
>   Last clearing of "show interface" counters never
>   Input queue: 0/75/3951/0 (size/max/drops/flushes); Total output drops: 6
>   Queueing strategy: fifo
>   Output queue: 0/40 (size/max)
>   5 minute input rate 45000 bits/sec, 19 packets/sec
>   5 minute output rate 64000 bits/sec, 13 packets/sec
>      51466122 packets input, 1916487584 bytes, 0 no buffer
>      Received 1891956 broadcasts, 0 runts, 0 giants, 0 throttles
>      5 input errors, 0 CRC, 0 frame, 0 overrun, 5 ignored
>      0 watchdog, 2247902 multicast, 0 pause input
>      0 input packets with dribble condition detected
>      32927369 packets output, 1549013167 bytes, 0 underruns
>      8 output errors, 0 collisions, 1 interface resets
>      23 unknown protocol drops
>      23 unknown protocol drops
>      0 babbles, 0 late collision, 0 deferred
>      8 lost carrier, 0 no carrier, 0 pause output
>      0 output buffer failures, 0 output buffers swapped out
> 
> 
> Any thoughts as to what's going on here?  I can't tell for certain which
> of the 2 routers is causing LDP and BGP to drop.  Knowing that would
> help me narrow my troubleshooting focus.  The 7600 is running SRB1 and
> the 7201 is running 12.4(15)T7.
> 
> Thanks
>  Justin
> 
> _______________________________________________
> cisco-nsp mailing list  cisco-nsp at puck.nether.net
> https://puck.nether.net/mailman/listinfo/cisco-nsp
> archive at http://puck.nether.net/pipermail/cisco-nsp/
>