[j-nsp] RPD Crash on M320

Mon Jan 4 11:01:26 EST 2016

>From your comments I understand there was no CPU spike, and traceoptions aren’t the cause either.

By this point* I would have raised a JTAC case for analysis of the core dump, and taken their lead.

* assuming you’ve checked all sources of information and found no clues as to the cause, ie: logfile analysis, resource exhaustion checks, analysis of config, eg: are you using suspected buggy features, or anything non-standard/complex/advanced?

We are running 14.1R5.5 on MX series and have lots of features turned on, and several workarounds in place. We have found a few bugs for JNPR...

Kind regards,

Niall

From: Alireza Soltanian [mailto:soltanian at gmail.com] 
Sent: 04 January 2016 15:18
To: Niall Donaghy
Cc: juniper-nsp at puck.nether.net
Subject: RE: [j-nsp] RPD Crash on M320

Just asking. Anyway any idea about my comments? Also is there any mechanism or approach for dealing with these kind of situations?

On Jan 4, 2016 6:45 PM, "Niall Donaghy" <Niall.Donaghy at geant.org <mailto:Niall.Donaghy at geant.org> > wrote:

Reading the core dump is beyond my expertise I’m afraid.

Br,

Niall

From: Alireza Soltanian [mailto:soltanian at gmail.com <mailto:soltanian at gmail.com> ] 
Sent: 04 January 2016 15:14
To: Niall Donaghy
Cc: juniper-nsp at puck.nether.net <mailto:juniper-nsp at puck.nether.net> 
Subject: RE: [j-nsp] RPD Crash on M320

Hi
Yes I checked the CPU graph and there was a spike on CPU load. 
The link was flappy 20 minutes before crash. Also it remained flappy two hours after this crash. During this time we can see LDP sessions go UP DOWN over and over. But the only time there was a crash was this time and there is no spike on CPU.
I must mention we had another issue with another M320. Whenever a link flapped, CPU of RPD went high and all OSPF sessions reset. I found out the root cause for that. It was traceoption for LDP. For this box we dont use traceoption.
Is there any way to read the dump?

Thank you

On Jan 4, 2016 6:34 PM, "Niall Donaghy" <Niall.Donaghy at geant.org <mailto:Niall.Donaghy at geant.org> > wrote:

Hi Alireza,

It seemed to me this event could be related to the core dump: Jan  3
00:31:28  apa-rtr-028 /kernel: jsr_prl_recv_ack_msg(): received PRL ACK
message on non-active socket w/handle 0x10046fa0000004e
However upon further investigation
(http://kb.juniper.net/InfoCenter/index?page=content <http://kb.juniper.net/InfoCenter/index?page=content&id=KB18195> &id=KB18195) I see these
messages are normal/harmless.

Do you have Cacti graphs of CPU utilisation for both REs, before the rpd
crash? Link flapping may be giving rise to CPU hogging, leading to
instability and subsequent rpd crash.
Was the link particularly flappy just before the crash?

Kind regards,
Niall

> -----Original Message-----
> From: juniper-nsp [mailto:juniper-nsp-bounces at puck.nether.net <mailto:juniper-nsp-bounces at puck.nether.net> ] On Behalf
Of
> Alireza Soltanian
> Sent: 04 January 2016 11:04
> To: juniper-nsp at puck.nether.net <mailto:juniper-nsp at puck.nether.net> 
> Subject: [j-nsp] RPD Crash on M320
>
> Hi everybody
>
> Recently, we had continuous link flap between our M320 and remote sites.
We
> have a lot of L2Circuits between these sites on our M320. At one point we
had
> crash on RPD process which lead to following log. I must mention the link
flap
> started at 12:10AM and it was continued until 2:30AM. But Crash was
occurred
> at 12:30AM.
>
>
>
> Jan  3 00:31:04  apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session
> 10.237.253.168 is down, reason: received notification from peer
>
> Jan  3 00:31:05  apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session
> 10.237.254.1 is down, reason: received notification from peer
>
> Jan  3 00:31:05  apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session
> 10.237.253.120 is down, reason: received notification from peer
>
> Jan  3 00:31:05  apa-rtr-028 /kernel: jsr_prl_recv_ack_msg(): received PRL
ACK
> message on non-active socket w/handle 0x1008af8000001c6
>
> Jan  3 00:31:06  apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session
> 10.237.253.192 is down, reason: received notification from peer
>
> Jan  3 00:31:28  apa-rtr-028 /kernel: jsr_prl_recv_ack_msg(): received PRL
ACK
> message on non-active socket w/handle 0x10046fa0000004e
>
>
>
> Jan  3 00:32:18  apa-rtr-028 init: routing (PID 42128) terminated by
signal
> number 6. Core dumped!
>
> Jan  3 00:32:18  apa-rtr-028 init: routing (PID 18307) started
>
> Jan  3 00:32:18  apa-rtr-028 rpd[18307]: L2CKT acquiring mastership for
primary
>
> Jan  3 00:32:18  apa-rtr-028 rpd[18307]: L2VPN acquiring mastership for
primary
>
> Jan  3 00:32:20  apa-rtr-028 rpd[18307]: RPD_KRT_KERNEL_BAD_ROUTE: KRT:
> lost ifl 0 for route (null)
>
> Jan  3 00:32:20  apa-rtr-028 last message repeated 65 times
>
> Jan  3 00:32:20  apa-rtr-028 rpd[18307]: L2CKT acquiring mastership for
primary
>
> Jan  3 00:32:20  apa-rtr-028 rpd[18307]: Primary starts deleting all
L2circuit IFL
> Repository
>
> Jan  3 00:32:20  apa-rtr-028 rpd[18307]: RPD_TASK_BEGIN: Commencing
routing
> updates, version 11.2R2.4, built 2011-09-01 06:53:31 UTC by builder
>
>
>
> Jan  3 00:32:21  apa-rtr-028 mib2d[33413]: SNMP_TRAP_LINK_DOWN: ifIndex
> 1329, ifAdminStatus up(1), ifOperStatus down(2), ifName ae1.1041
>
> Jan  3 00:32:21  apa-rtr-028 mib2d[33413]: SNMP_TRAP_LINK_DOWN: ifIndex
> 1311, ifAdminStatus up(1), ifOperStatus down(2), ifName ae1.1039
>
> Jan  3 00:32:21  apa-rtr-028 mib2d[33413]: SNMP_TRAP_LINK_DOWN: ifIndex
> 1312, ifAdminStatus up(1), ifOperStatus down(2), ifName ae1.1038
>
>
>
> The case is we always have this kind of log (except the Crash) on the
device. Is
> there any clue why RPD process crashed? I don't have access to JTAC so I
cannot
> analyze the dump.
>
> The JunOS version is : 11.2R2.4
>
>
>
> Thank you for your help and support
>
> _______________________________________________
> juniper-nsp mailing list juniper-nsp at puck.nether.net <mailto:juniper-nsp at puck.nether.net> 
> https://puck.nether.net/mailman/listinfo/juniper-nsp