[c-nsp] Sporadic loss of LDP neighbor ...

Mon Dec 12 05:20:49 EST 2011

Hi,

So the section of the core you have issues with is like a triangle between the 3 7200s right? -now are the 7200's connected with GigE back-to-back or via switch ? -and how is the AS1002F connected to this setup please? -is it connected to BB1 and BB2 to replace BB3?

You said you ran a debug -have you been lucky to capture the failure 

>From the log output you posted 
Reading the first line it appears that BB2 received an error notification from BB1 saying that BB1 is terminating it's TCP connection to BB2 because it didn't get any TCP keepalives during the holddown time
(so either BB2 stopped sending TCP keepalives for the LDP session to BB1 or BB1 just stopped receiving them from BB2 -to figure out which is the case a debug would be helpful)
TCP keepalives on cisco are by default sent from ldp router-id to ldp-router-id every 6sec with holdtime of 18s

Second line indicates that BB2 has terminated the TCP connection to BB3 because BB2 didn't get any LDP hello messages from BB3 during the Hello Hold Timer
(so either BB3 stopped sending hellos on link to BB1 or BB1 just stopped receiving them from BB3 -once again a debug would be helpful)
LDP UDP Hellos are by default send from interface ip address to 224.0.0.2 with hello interval of 5s and holdtime interval of 15s
-this can be changed using the session protection or targeted hellos features -in which case the LDP UDP Hellos are send from LDP router-id to LDP router-id with a default hello interval of 10s and holdtime infinite

adam

-----Original Message-----
From: cisco-nsp-bounces at puck.nether.net [mailto:cisco-nsp-bounces at puck.nether.net] On Behalf Of Garry
Sent: Monday, December 12, 2011 8:39 AM
To: cisco-nsp at puck.nether.net
Subject: [c-nsp] Sporadic loss of LDP neighbor ...

Hi *,

I've been fighting this problem for quite a while, need some ideas from
the collective intelligence ...

On of our backbone locations has multiple routers that have worked fine
for quite a while ... during the last couple months, we've been
experiencing some sporadic failures in the LAN which I've not been able
to pin-point any logical reason for ...

Basic setup is this ... currently, three 7200 routers (2x NPE300 VXR
[BB1 & 2], 1x NPE150 [BB3] for a couple of L2TP wireless links). We've
added an AS1002F [Core1] to that as new primary router for the location
about a year ago (running a 300M link to our core uplink, 1G dark fiber
link to another backbone location). All of our backbone is running with
MPLS enabled (multiple VRFs for MPLS-VPNs). Everything fine up until
something like 2-3 months ago (don't have an exact date, otherwise it
might be easier to get some correlations to other changes in the configs
or infrastructure). Then it started with sporadic losses of the LAN
interconnections, like this: (log excerpt from BB2)

Dec 11 22:59:31: %LDP-5-NBRCHG: LDP Neighbor [BB1]:0 is DOWN (Received
error notification from peer: Holddown time expired)
Dec 11 22:59:52: %LDP-5-NBRCHG: LDP Neighbor [BB3]:0 is DOWN (Discovery
Hello Hold Timer expired)
Dec 11 23:00:00: %LDP-5-NBRCHG: LDP Neighbor [BB3] is UP
Dec 11 23:00:27: %LDP-5-NBRCHG: LDP Neighbor [BB1]:0 is UP

These interruptions (at least the timestamps between down and up)
sometimes only last 3-4 seconds, the BB1 one above with almost a minute
is just about the longest I've seen to date. Of course this disrupts
routing to a certain degree ... sometimes even bad enough to take down
iBGP/eBGP multihop connections.

Now, at two other backbone locations, we have more or less the identical
setup, without any of these problems. I've already compared interface
configs, but everything seems identical (apart from IP addresses of
course). Problem here is that it's impossible to analyze any of the
problem causes, as for one the problems occur without any predictable
interval, and they're to short to react to the loss of connection in
time ... I've tried activating some debugs on the router, but couldn't
get any helpful information out of it (at least nothing I could identify)

We've recently added an ASR1001 to the site, which (together with the
1002F) will be used to replace two 7200 routers, and already moved about
half of the existing VLANs of the site (~20 of the 40+) to the ASRs.
Didn't change much, though the interval of the interruptions went to
maybe once every 2 or 3 days (from 1-2 per day). One thing I did notice
is that mostly BB1 router is involved, with 1-2 times out of three BB2
also losing LDP connection at the same time, and BB3 usually not showing
any problems reaching either of the Core routers. BB1 and BB2 will also
lose connectivity to each other most of the time, albeit not always. In
attempting to locate the cause, we already moved BB1 to the same switch
as Core1&2, with no results. Needless to say that there are no
disruptions on Layer 2, at least not as far as could be seen in the logs.

If these problems had manifested themselves when we installed the first
ASR, I'd say it's something in the IOS versions that might be
incompatible, but everything ran fine for something like 9 months, so
that shouldn't be it. I've tried going through config diffs from 4-6
months ago and now, but couldn't find any changes that should break MPLS
on the LAN layer.

Anybody have any idea at what might be causing this, or what I should
check into to get to the cause of this problem?

Here's some excerpts from the router configs:

BB1:
interface GigabitEthernet3/0
 mtu 1500
 no ip redirects
 ip route-cache flow
 negotiation auto
 mpls label protocol ldp
 tag-switching mtu 1520
 tag-switching ip

BB2: identical settings

Core1:
interface GigabitEthernet0/0/0
 no ip redirects
 ip flow ingress
 negotiation auto
 mpls ip
 mpls label protocol ldp
 mpls mtu 1520

Thanks, Garry
_______________________________________________
cisco-nsp mailing list  cisco-nsp at puck.nether.net
https://puck.nether.net/mailman/listinfo/cisco-nsp
archive at http://puck.nether.net/pipermail/cisco-nsp/