[j-nsp] eBGP neighbor link failure detection

Andy Litzinger andy.litzinger.lists at gmail.com
Fri Mar 14 11:40:17 EDT 2014


Hi Payam,
  yes the logs clearly show that the failures start after the link goes
down.

We have external monitors set up via a 3rd party monitoring ASP that watch
our various services.  Some, but not all, of them reported connection
issues for about 4 minutes.  We also run 'rpm' probes from the MX80s toward
4 different destinations.  The MX80 that had the peer go down logged all 4
of these probes failing for 2m 48s.  The probes are set with a
test-interval of 60, probe-count of 10 and a total-loss threshold of 1.

Our 2nd MX80 did not have any rpm probes fail.  It is an eBGP peer with
another provider taking full routes and an iBGP peer with the other MX80.

Here is the sequence of events (fictional time starting at 0s)
0m0s - MX80 A logs interface toward neighbor is down - tfeb0 UPDN msg to
kernel for ifd:xe-x/x/x, flag:2, speed: 0, duplex:0
0m0s - MX80 A logs BGP neighbor state change - rpd[1344]:
RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer x.x.x.x (External AS YYYYY)
changed state from Established to Idle (event Stop)
0m2s - First rpm icmp probe logs failure - rmopd[1348]: PING_PROBE_FAILED:
pingCtlOwnerIndex = ICMP-Probe, pingCtlTestName = our-probe-name
0m2s - 3m48s - all icmp probes fail
3m48s - First rpm icmp probe succeeds mopd[1348]: PING_TEST_COMPLETED:
pingCtlOwnerIndex = ICMP-Probe, pingCtlTestName = our-probe-name
12m12s - MX80 A logs interface toward neighbor is up -  tfeb0 UPDN msg to
kernel for ifd:xe-x/x/x, flag:1, speed: 0, duplex:0
12m15s - MX80 A logs BGP neighbor state change to Established - rpd[1344]:
RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer x.x.x.x (External AS YYYYY)
changed state from OpenConfirm to Established (event RecvKeepAlive)

-andy


On Thu, Mar 13, 2014 at 5:17 PM, Payam Chychi <pchychi at gmail.com> wrote:

>  Are you sure? Ive never seen an update to remove routes take 3min.
>
> Andy, are you sure the 3min outage was  after the link hard down and not
> just prior? Guessing this is easy to find due to time stamps. Ive seen this
> due to line protocol down and everything blackholes until bgp fails but
> never a 3mim wait for route withdraw (unless this is a peering router with
> dozens of peers and full routes on each)
>
> Hope you fins root cause
>
> --
> Payam Chychi
> Network Engineer / Security Specialist
>
> On Thursday, March 13, 2014 at 4:50 PM, Andy Litzinger wrote:
>
> Hi Chris,
> yes, i am taking full routes from this neighbor.
>
> is there any way to reduce the time it takes to handle the updates? if i
> wanted to test this behavior in my lab, what would i want to watch?
> (logs/traceoptions)
>
> i don't think I've seen this behavior during scheduled maintenance- for
> example during times when i've deactivated the neighbor config. Am I
> correct in thinking this is because in this scenario even though the RE is
> taking awhile to remove the routes from the FIB the actual next hop router
> is still available and thus the routes are still valid?
>
> -andy
>
>
> On Thu, Mar 13, 2014 at 3:54 PM, Chris Adams <cma at cmadams.net> wrote:
>
> Once upon a time, Andy Litzinger <andy.litzinger.lists at gmail.com> said:
>
> what surprised me is that it looks like routes toward that provider were
> not immediately removed from my routing table. Instead i see evidence of
> blackholing for almost 3 minutes.
>
>
> Are you taking full routes from this neighbor? It takes a while for the
> routing engine to send updates to remove/replace 200k+ prefixes to the
> forwarding engine.
>
> --
> Chris Adams <cma at cmadams.net>
> _______________________________________________
> juniper-nsp mailing list juniper-nsp at puck.nether.net
> https://puck.nether.net/mailman/listinfo/juniper-nsp
>
> _______________________________________________
> juniper-nsp mailing list juniper-nsp at puck.nether.net
> https://puck.nether.net/mailman/listinfo/juniper-nsp
>
>
>


More information about the juniper-nsp mailing list