[j-nsp] route BGP stall bug

Mon Jul 23 03:06:45 EDT 2012

Hi Tim,

Do you happen to have port mirroring/sampling enabled on the router?
We encountered a similar issue, JTAC found out that sampled process was
causing this behavior and it is solved in 11.4R4 (we did not upgrade yet to
test in our environment and it also doesn't appear in the release notes
however the JTAC engineer said it is solved)
The relevant PR is PR726841, while it is with the details of our specific
case test, the issue is (according to JTAC) "Sampled being the slow daemon
lead to the slow operation of route updation from KRT --- PFE and KRT was
stuck for some time."

Regards,
Ido

-----Original Message-----
From: juniper-nsp-bounces at puck.nether.net
[mailto:juniper-nsp-bounces at puck.nether.net] On Behalf Of Tim Vollebregt
Sent: Wednesday, July 18, 2012 1:04 AM
To: Juniper-NSP
Subject: [j-nsp] route BGP stall bug

Hi All,

This morning during a maintenance I experienced the route stall bug Richard
mentioned a few times already on j-nsp.

Hardware kit:
-MX480 with SCB (non-e)
-2 x RE-S-1800x4
-4 x MPC 3D 16x 10GE
Software version: 10.4R8.5
During this maintenance I was placing 2 new routing engines into the router,
replacing the 'old' RE-S-2000. This router is pushing a lot of traffic and
receiving 14 x full BGP tables from eBGP peers/1 RR session to it's
'mate'/several iBGP peers with partial tables

After replacing the RE's the FPC's initialized and BGP sessions were being
established it took quite some time before the RIB was completely filled.
After checking some hosts I came to the conclusion that there were
unreachable destinations however the RIB was looking fine.

When checking the FIB by issuing command: show route forwarding-table
summary I saw that there were only 11K prefixes pushed to the FIB and it was
hanging.
As I was aware of the bug I waited for some time. And it eventually took
about 30 minutes to fill the FIB with 414K prefixes. During these 30 minutes
a lot of destinations were unreachable and traffic was being blackholed as
exchanging RIB with peers was fine.

As there was still some time left in the maintenance window and I really
wanted to have some workaround for dealing with this bug I did the
following.
I deactivated all eBGP peer groups and did a switchover to the other routing
engine. When the PFC's were initialized the router started building it's
iBGP sessions towards the core routers, and it's RR session (full table).

This worked out quite well, the FIB was being filled with the full table
within 5 minutes. Afterwards I activated all eBGP peergroups again and
monitored the FIB, eventually it took about 30 minutes to fill the FIB with
the correct next-hops. But this time the blackholing was just for a limited
amount of time.

It seems this bug is there since release 10.0 (MPC), and there doesn't seem
to be a fix yet. Does anyone have more information about it, PR number etc?

IMHO this is a really bad one, and can be a showstopper in some cases.

Thanks for your time.

BR, Tim

_______________________________________________
juniper-nsp mailing list juniper-nsp at puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp