[j-nsp] eBGP neighbor link failure detection

Andy Litzinger andy.litzinger.lists at gmail.com
Thu Mar 20 15:48:54 EDT 2014


Hi Keegan,
  I think it would turn out to be a different bug than John described (in
his case GR wasn't even configured on one end of the neighbor
relationship), but still with the same feature.

As far as we can tell the rest of the network mostly went along happily.
 So these MX80s sit at the edge of our network.  Towards the internal part
of our network they host a VRRP link which is used as the default route for
downstream devices.  MX80-A is generally the VRRP master, but there was a
VRRP switchover, as expected, when the link to the ISP-A went down as we
are tracking that interface.

We are using our ISPs in an active/active fashion.  for the most part we
don't influence the routes too much.  we do apply a higher local pref to
some routes received from ISP-A (the ISP that had the issue).  At least one
of the IPs we use for rpm probes relies on a route that we apply the higher
local pref to.  or at least today it does- i have to presume a little that
the same was true during the issue.

as i mentioned before only the rpm probes on MX80-A connected to ISPA
failed (and they all failed).  On MX80-B they all continued to succeed.
 This is a little odd since I would expect the probe that relied on the
route with the higher local pref via MX80-A would fail from MX80-B if my
current theory of stale routes is correct.

-andy


On Wed, Mar 19, 2014 at 10:00 PM, Keegan Holley <no.spam at comcast.net> wrote:

> That would be one hell of a coincidence to have the same bug across
> different implementations of NSR/NSF across two different vendors.  That
> said, stranger things literally have happened.  There are a bunch of other
> possible causes though.
>
> What happened in the rest of the network?  Was all traffic black-holed for
> 3min even though the other border router was able to reach the internet?
>  The juniper gear doesn't prefer eBGP over iBGP by default so AS path will
> win unless you're using local preference internally.  Are your other
> routers using BGP to reach the edge or another protocol?  Could be IGP or
> VRRP misconfiguration if you're not using iBGP or even a bad static route.
>  Have you checked to make sure you're return routes failed over?  Assuming
> you have an ASN and are advertising the same routes to both providers in an
> active/passive fashion (a big assumption).  If there are advertisement
> issues one provider could have continued to draw traffic towards the down
> link.
>
> On Mar 14, 2014, at 11:40 AM, Andy Litzinger <
> andy.litzinger.lists at gmail.com> wrote:
>
> > Hi Payam,
> >  yes the logs clearly show that the failures start after the link goes
> > down.
> >
> > We have external monitors set up via a 3rd party monitoring ASP that
> watch
> > our various services.  Some, but not all, of them reported connection
> > issues for about 4 minutes.  We also run 'rpm' probes from the MX80s
> toward
> > 4 different destinations.  The MX80 that had the peer go down logged all
> 4
> > of these probes failing for 2m 48s.  The probes are set with a
> > test-interval of 60, probe-count of 10 and a total-loss threshold of 1.
> >
> > Our 2nd MX80 did not have any rpm probes fail.  It is an eBGP peer with
> > another provider taking full routes and an iBGP peer with the other MX80.
> >
> > Here is the sequence of events (fictional time starting at 0s)
> > 0m0s - MX80 A logs interface toward neighbor is down - tfeb0 UPDN msg to
> > kernel for ifd:xe-x/x/x, flag:2, speed: 0, duplex:0
> > 0m0s - MX80 A logs BGP neighbor state change - rpd[1344]:
> > RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer x.x.x.x (External AS YYYYY)
> > changed state from Established to Idle (event Stop)
> > 0m2s - First rpm icmp probe logs failure - rmopd[1348]:
> PING_PROBE_FAILED:
> > pingCtlOwnerIndex = ICMP-Probe, pingCtlTestName = our-probe-name
> > 0m2s - 3m48s - all icmp probes fail
> > 3m48s - First rpm icmp probe succeeds mopd[1348]: PING_TEST_COMPLETED:
> > pingCtlOwnerIndex = ICMP-Probe, pingCtlTestName = our-probe-name
> > 12m12s - MX80 A logs interface toward neighbor is up -  tfeb0 UPDN msg to
> > kernel for ifd:xe-x/x/x, flag:1, speed: 0, duplex:0
> > 12m15s - MX80 A logs BGP neighbor state change to Established -
> rpd[1344]:
> > RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer x.x.x.x (External AS YYYYY)
> > changed state from OpenConfirm to Established (event RecvKeepAlive)
> >
> > -andy
> >
> >
> > On Thu, Mar 13, 2014 at 5:17 PM, Payam Chychi <pchychi at gmail.com> wrote:
> >
> >> Are you sure? Ive never seen an update to remove routes take 3min.
> >>
> >> Andy, are you sure the 3min outage was  after the link hard down and not
> >> just prior? Guessing this is easy to find due to time stamps. Ive seen
> this
> >> due to line protocol down and everything blackholes until bgp fails but
> >> never a 3mim wait for route withdraw (unless this is a peering router
> with
> >> dozens of peers and full routes on each)
> >>
> >> Hope you fins root cause
> >>
> >> --
> >> Payam Chychi
> >> Network Engineer / Security Specialist
> >>
> >> On Thursday, March 13, 2014 at 4:50 PM, Andy Litzinger wrote:
> >>
> >> Hi Chris,
> >> yes, i am taking full routes from this neighbor.
> >>
> >> is there any way to reduce the time it takes to handle the updates? if i
> >> wanted to test this behavior in my lab, what would i want to watch?
> >> (logs/traceoptions)
> >>
> >> i don't think I've seen this behavior during scheduled maintenance- for
> >> example during times when i've deactivated the neighbor config. Am I
> >> correct in thinking this is because in this scenario even though the RE
> is
> >> taking awhile to remove the routes from the FIB the actual next hop
> router
> >> is still available and thus the routes are still valid?
> >>
> >> -andy
> >>
> >>
> >> On Thu, Mar 13, 2014 at 3:54 PM, Chris Adams <cma at cmadams.net> wrote:
> >>
> >> Once upon a time, Andy Litzinger <andy.litzinger.lists at gmail.com> said:
> >>
> >> what surprised me is that it looks like routes toward that provider were
> >> not immediately removed from my routing table. Instead i see evidence of
> >> blackholing for almost 3 minutes.
> >>
> >>
> >> Are you taking full routes from this neighbor? It takes a while for the
> >> routing engine to send updates to remove/replace 200k+ prefixes to the
> >> forwarding engine.
> >>
> >> --
> >> Chris Adams <cma at cmadams.net>
> >> _______________________________________________
> >> juniper-nsp mailing list juniper-nsp at puck.nether.net
> >> https://puck.nether.net/mailman/listinfo/juniper-nsp
> >>
> >> _______________________________________________
> >> juniper-nsp mailing list juniper-nsp at puck.nether.net
> >> https://puck.nether.net/mailman/listinfo/juniper-nsp
> >>
> >>
> >>
> > _______________________________________________
> > juniper-nsp mailing list juniper-nsp at puck.nether.net
> > https://puck.nether.net/mailman/listinfo/juniper-nsp
>
>


More information about the juniper-nsp mailing list