[j-nsp] route BGP stall bug

Tim Vollebregt tim at interworx.nl
Tue Jul 17 18:03:39 EDT 2012


Hi All,

This morning during a maintenance I experienced the route stall bug Richard mentioned a few times already on j-nsp.

Hardware kit:
-MX480 with SCB (non-e)
-2 x RE-S-1800x4
-4 x MPC 3D 16x 10GE
Software version: 10.4R8.5
During this maintenance I was placing 2 new routing engines into the router, replacing the 'old' RE-S-2000. This router is pushing a lot of traffic and receiving 14 x full BGP tables from eBGP peers/1 RR session to it's 'mate'/several iBGP peers with partial tables

After replacing the RE's the FPC's initialized and BGP sessions were being established it took quite some time before the RIB was completely filled. After checking some hosts I came to the conclusion that there were unreachable destinations however the RIB was looking fine.

When checking the FIB by issuing command: show route forwarding-table summary I saw that there were only 11K prefixes pushed to the FIB and it was hanging.
As I was aware of the bug I waited for some time. And it eventually took about 30 minutes to fill the FIB with 414K prefixes. During these 30 minutes a lot of destinations were unreachable and traffic was being blackholed as exchanging RIB with peers was fine.

As there was still some time left in the maintenance window and I really wanted to have some workaround for dealing with this bug I did the following.
I deactivated all eBGP peer groups and did a switchover to the other routing engine. When the PFC's were initialized the router started building it's iBGP sessions towards the core routers, and it's RR session (full table).

This worked out quite well, the FIB was being filled with the full table within 5 minutes. Afterwards I activated all eBGP peergroups again and monitored the FIB, eventually it took about 30 minutes to fill the FIB with the correct next-hops. But this time the blackholing was just for a limited amount of time.

It seems this bug is there since release 10.0 (MPC), and there doesn't seem to be a fix yet. Does anyone have more information about it, PR number etc?

IMHO this is a really bad one, and can be a showstopper in some cases.

Thanks for your time.

BR, Tim








More information about the juniper-nsp mailing list