[j-nsp] route BGP stall bug

Wed Jul 18 05:35:49 EDT 2012

Hi,

Is there any suspicious messages logged at that moment ?

There are some PRs related to krt queue stuck, so probably you want to 
upgrade to 10.4R10 or investigate this issue with jtac.

https://prsearch.juniper.net/InfoCenter/index?page=prcontent&id=PR722890

On 18.07.2012 2:03, Tim Vollebregt wrote:
> Hi All,
>
> This morning during a maintenance I experienced the route stall bug Richard mentioned a few times already on j-nsp.
>
> Hardware kit:
> -MX480 with SCB (non-e)
> -2 x RE-S-1800x4
> -4 x MPC 3D 16x 10GE
> Software version: 10.4R8.5
> During this maintenance I was placing 2 new routing engines into the router, replacing the 'old' RE-S-2000. This router is pushing a lot of traffic and receiving 14 x full BGP tables from eBGP peers/1 RR session to it's 'mate'/several iBGP peers with partial tables
>
> After replacing the RE's the FPC's initialized and BGP sessions were being established it took quite some time before the RIB was completely filled. After checking some hosts I came to the conclusion that there were unreachable destinations however the RIB was looking fine.
>
> When checking the FIB by issuing command: show route forwarding-table summary I saw that there were only 11K prefixes pushed to the FIB and it was hanging.
> As I was aware of the bug I waited for some time. And it eventually took about 30 minutes to fill the FIB with 414K prefixes. During these 30 minutes a lot of destinations were unreachable and traffic was being blackholed as exchanging RIB with peers was fine.
>
> As there was still some time left in the maintenance window and I really wanted to have some workaround for dealing with this bug I did the following.
> I deactivated all eBGP peer groups and did a switchover to the other routing engine. When the PFC's were initialized the router started building it's iBGP sessions towards the core routers, and it's RR session (full table).
>
> This worked out quite well, the FIB was being filled with the full table within 5 minutes. Afterwards I activated all eBGP peergroups again and monitored the FIB, eventually it took about 30 minutes to fill the FIB with the correct next-hops. But this time the blackholing was just for a limited amount of time.
>
> It seems this bug is there since release 10.0 (MPC), and there doesn't seem to be a fix yet. Does anyone have more information about it, PR number etc?
>
> IMHO this is a really bad one, and can be a showstopper in some cases.
>
> Thanks for your time.