[j-nsp] eBGP neighbor link failure detection

Thu Mar 20 01:00:08 EDT 2014

That would be one hell of a coincidence to have the same bug across different implementations of NSR/NSF across two different vendors.  That said, stranger things literally have happened.  There are a bunch of other possible causes though.

What happened in the rest of the network?  Was all traffic black-holed for 3min even though the other border router was able to reach the internet?  The juniper gear doesn’t prefer eBGP over iBGP by default so AS path will win unless you’re using local preference internally.  Are your other routers using BGP to reach the edge or another protocol?  Could be IGP or VRRP misconfiguration if you’re not using iBGP or even a bad static route.  Have you checked to make sure you’re return routes failed over?  Assuming you have an ASN and are advertising the same routes to both providers in an active/passive fashion (a big assumption).  If there are advertisement issues one provider could have continued to draw traffic towards the down link.

On Mar 14, 2014, at 11:40 AM, Andy Litzinger <andy.litzinger.lists at gmail.com> wrote:

> Hi Payam,
>  yes the logs clearly show that the failures start after the link goes
> down.
> 
> We have external monitors set up via a 3rd party monitoring ASP that watch
> our various services.  Some, but not all, of them reported connection
> issues for about 4 minutes.  We also run 'rpm' probes from the MX80s toward
> 4 different destinations.  The MX80 that had the peer go down logged all 4
> of these probes failing for 2m 48s.  The probes are set with a
> test-interval of 60, probe-count of 10 and a total-loss threshold of 1.
> 
> Our 2nd MX80 did not have any rpm probes fail.  It is an eBGP peer with
> another provider taking full routes and an iBGP peer with the other MX80.
> 
> Here is the sequence of events (fictional time starting at 0s)
> 0m0s - MX80 A logs interface toward neighbor is down - tfeb0 UPDN msg to
> kernel for ifd:xe-x/x/x, flag:2, speed: 0, duplex:0
> 0m0s - MX80 A logs BGP neighbor state change - rpd[1344]:
> RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer x.x.x.x (External AS YYYYY)
> changed state from Established to Idle (event Stop)
> 0m2s - First rpm icmp probe logs failure - rmopd[1348]: PING_PROBE_FAILED:
> pingCtlOwnerIndex = ICMP-Probe, pingCtlTestName = our-probe-name
> 0m2s - 3m48s - all icmp probes fail
> 3m48s - First rpm icmp probe succeeds mopd[1348]: PING_TEST_COMPLETED:
> pingCtlOwnerIndex = ICMP-Probe, pingCtlTestName = our-probe-name
> 12m12s - MX80 A logs interface toward neighbor is up -  tfeb0 UPDN msg to
> kernel for ifd:xe-x/x/x, flag:1, speed: 0, duplex:0
> 12m15s - MX80 A logs BGP neighbor state change to Established - rpd[1344]:
> RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer x.x.x.x (External AS YYYYY)
> changed state from OpenConfirm to Established (event RecvKeepAlive)
> 
> -andy
> 
> 
> On Thu, Mar 13, 2014 at 5:17 PM, Payam Chychi <pchychi at gmail.com> wrote:
> 
>> Are you sure? Ive never seen an update to remove routes take 3min.
>> 
>> Andy, are you sure the 3min outage was  after the link hard down and not
>> just prior? Guessing this is easy to find due to time stamps. Ive seen this
>> due to line protocol down and everything blackholes until bgp fails but
>> never a 3mim wait for route withdraw (unless this is a peering router with
>> dozens of peers and full routes on each)
>> 
>> Hope you fins root cause
>> 
>> --
>> Payam Chychi
>> Network Engineer / Security Specialist
>> 
>> On Thursday, March 13, 2014 at 4:50 PM, Andy Litzinger wrote:
>> 
>> Hi Chris,
>> yes, i am taking full routes from this neighbor.
>> 
>> is there any way to reduce the time it takes to handle the updates? if i
>> wanted to test this behavior in my lab, what would i want to watch?
>> (logs/traceoptions)
>> 
>> i don't think I've seen this behavior during scheduled maintenance- for
>> example during times when i've deactivated the neighbor config. Am I
>> correct in thinking this is because in this scenario even though the RE is
>> taking awhile to remove the routes from the FIB the actual next hop router
>> is still available and thus the routes are still valid?
>> 
>> -andy
>> 
>> 
>> On Thu, Mar 13, 2014 at 3:54 PM, Chris Adams <cma at cmadams.net> wrote:
>> 
>> Once upon a time, Andy Litzinger <andy.litzinger.lists at gmail.com> said:
>> 
>> what surprised me is that it looks like routes toward that provider were
>> not immediately removed from my routing table. Instead i see evidence of
>> blackholing for almost 3 minutes.
>> 
>> 
>> Are you taking full routes from this neighbor? It takes a while for the
>> routing engine to send updates to remove/replace 200k+ prefixes to the
>> forwarding engine.
>> 
>> --
>> Chris Adams <cma at cmadams.net>
>> _______________________________________________
>> juniper-nsp mailing list juniper-nsp at puck.nether.net
>> https://puck.nether.net/mailman/listinfo/juniper-nsp
>> 
>> _______________________________________________
>> juniper-nsp mailing list juniper-nsp at puck.nether.net
>> https://puck.nether.net/mailman/listinfo/juniper-nsp
>> 
>> 
>> 
> _______________________________________________
> juniper-nsp mailing list juniper-nsp at puck.nether.net
> https://puck.nether.net/mailman/listinfo/juniper-nsp