[j-nsp] RPD Crash on M320

Mon Jan 4 10:17:49 EST 2016

Just asking. Anyway any idea about my comments? Also is there any mechanism
or approach for dealing with these kind of situations?
On Jan 4, 2016 6:45 PM, "Niall Donaghy" <Niall.Donaghy at geant.org> wrote:

> Reading the core dump is beyond my expertise I’m afraid.
>
>
>
> Br,
>
> Niall
>
>
>
> *From:* Alireza Soltanian [mailto:soltanian at gmail.com]
> *Sent:* 04 January 2016 15:14
> *To:* Niall Donaghy
> *Cc:* juniper-nsp at puck.nether.net
> *Subject:* RE: [j-nsp] RPD Crash on M320
>
>
>
> Hi
> Yes I checked the CPU graph and there was a spike on CPU load.
> The link was flappy 20 minutes before crash. Also it remained flappy two
> hours after this crash. During this time we can see LDP sessions go UP DOWN
> over and over. But the only time there was a crash was this time and there
> is no spike on CPU.
> I must mention we had another issue with another M320. Whenever a link
> flapped, CPU of RPD went high and all OSPF sessions reset. I found out the
> root cause for that. It was traceoption for LDP. For this box we dont use
> traceoption.
> Is there any way to read the dump?
>
> Thank you
>
> On Jan 4, 2016 6:34 PM, "Niall Donaghy" <Niall.Donaghy at geant.org> wrote:
>
> Hi Alireza,
>
> It seemed to me this event could be related to the core dump: Jan  3
> 00:31:28  apa-rtr-028 /kernel: jsr_prl_recv_ack_msg(): received PRL ACK
> message on non-active socket w/handle 0x10046fa0000004e
> However upon further investigation
> (http://kb.juniper.net/InfoCenter/index?page=content&id=KB18195) I see
> these
> messages are normal/harmless.
>
> Do you have Cacti graphs of CPU utilisation for both REs, before the rpd
> crash? Link flapping may be giving rise to CPU hogging, leading to
> instability and subsequent rpd crash.
> Was the link particularly flappy just before the crash?
>
> Kind regards,
> Niall
>
>
>
>
> > -----Original Message-----
> > From: juniper-nsp [mailto:juniper-nsp-bounces at puck.nether.net] On Behalf
> Of
> > Alireza Soltanian
> > Sent: 04 January 2016 11:04
> > To: juniper-nsp at puck.nether.net
> > Subject: [j-nsp] RPD Crash on M320
> >
> > Hi everybody
> >
> > Recently, we had continuous link flap between our M320 and remote sites.
> We
> > have a lot of L2Circuits between these sites on our M320. At one point we
> had
> > crash on RPD process which lead to following log. I must mention the link
> flap
> > started at 12:10AM and it was continued until 2:30AM. But Crash was
> occurred
> > at 12:30AM.
> >
> >
> >
> > Jan  3 00:31:04  apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session
> > 10.237.253.168 is down, reason: received notification from peer
> >
> > Jan  3 00:31:05  apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session
> > 10.237.254.1 is down, reason: received notification from peer
> >
> > Jan  3 00:31:05  apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session
> > 10.237.253.120 is down, reason: received notification from peer
> >
> > Jan  3 00:31:05  apa-rtr-028 /kernel: jsr_prl_recv_ack_msg(): received
> PRL
> ACK
> > message on non-active socket w/handle 0x1008af8000001c6
> >
> > Jan  3 00:31:06  apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session
> > 10.237.253.192 is down, reason: received notification from peer
> >
> > Jan  3 00:31:28  apa-rtr-028 /kernel: jsr_prl_recv_ack_msg(): received
> PRL
> ACK
> > message on non-active socket w/handle 0x10046fa0000004e
> >
> >
> >
> > Jan  3 00:32:18  apa-rtr-028 init: routing (PID 42128) terminated by
> signal
> > number 6. Core dumped!
> >
> > Jan  3 00:32:18  apa-rtr-028 init: routing (PID 18307) started
> >
> > Jan  3 00:32:18  apa-rtr-028 rpd[18307]: L2CKT acquiring mastership for
> primary
> >
> > Jan  3 00:32:18  apa-rtr-028 rpd[18307]: L2VPN acquiring mastership for
> primary
> >
> > Jan  3 00:32:20  apa-rtr-028 rpd[18307]: RPD_KRT_KERNEL_BAD_ROUTE: KRT:
> > lost ifl 0 for route (null)
> >
> > Jan  3 00:32:20  apa-rtr-028 last message repeated 65 times
> >
> > Jan  3 00:32:20  apa-rtr-028 rpd[18307]: L2CKT acquiring mastership for
> primary
> >
> > Jan  3 00:32:20  apa-rtr-028 rpd[18307]: Primary starts deleting all
> L2circuit IFL
> > Repository
> >
> > Jan  3 00:32:20  apa-rtr-028 rpd[18307]: RPD_TASK_BEGIN: Commencing
> routing
> > updates, version 11.2R2.4, built 2011-09-01 06:53:31 UTC by builder
> >
> >
> >
> > Jan  3 00:32:21  apa-rtr-028 mib2d[33413]: SNMP_TRAP_LINK_DOWN: ifIndex
> > 1329, ifAdminStatus up(1), ifOperStatus down(2), ifName ae1.1041
> >
> > Jan  3 00:32:21  apa-rtr-028 mib2d[33413]: SNMP_TRAP_LINK_DOWN: ifIndex
> > 1311, ifAdminStatus up(1), ifOperStatus down(2), ifName ae1.1039
> >
> > Jan  3 00:32:21  apa-rtr-028 mib2d[33413]: SNMP_TRAP_LINK_DOWN: ifIndex
> > 1312, ifAdminStatus up(1), ifOperStatus down(2), ifName ae1.1038
> >
> >
> >
> > The case is we always have this kind of log (except the Crash) on the
> device. Is
> > there any clue why RPD process crashed? I don't have access to JTAC so I
> cannot
> > analyze the dump.
> >
> > The JunOS version is : 11.2R2.4
> >
> >
> >
> > Thank you for your help and support
> >
> > _______________________________________________
> > juniper-nsp mailing list juniper-nsp at puck.nether.net
> > https://puck.nether.net/mailman/listinfo/juniper-nsp
>
>