[j-nsp] RPD Crash on M320

Mon Jan 4 12:31:31 EST 2016

Hi,

11.2 is end of support so my guess is that's there no point in raising a
case. As first step  I'd  try the upgrade to some supported release and
then check if that helps.

Regards,
Wojciech
4 sty 2016 17:02 "Niall Donaghy" <Niall.Donaghy at geant.org> napisał(a):

>
>
> From your comments I understand there was no CPU spike, and traceoptions
> aren’t the cause either.
>
> By this point* I would have raised a JTAC case for analysis of the core
> dump, and taken their lead.
>
>
>
> * assuming you’ve checked all sources of information and found no clues as
> to the cause, ie: logfile analysis, resource exhaustion checks, analysis of
> config, eg: are you using suspected buggy features, or anything
> non-standard/complex/advanced?
>
>
>
> We are running 14.1R5.5 on MX series and have lots of features turned on,
> and several workarounds in place. We have found a few bugs for JNPR...
>
>
>
> Kind regards,
>
> Niall
>
>
>
> From: Alireza Soltanian [mailto:soltanian at gmail.com]
> Sent: 04 January 2016 15:18
> To: Niall Donaghy
> Cc: juniper-nsp at puck.nether.net
> Subject: RE: [j-nsp] RPD Crash on M320
>
>
>
> Just asking. Anyway any idea about my comments? Also is there any
> mechanism or approach for dealing with these kind of situations?
>
> On Jan 4, 2016 6:45 PM, "Niall Donaghy" <Niall.Donaghy at geant.org <mailto:
> Niall.Donaghy at geant.org> > wrote:
>
> Reading the core dump is beyond my expertise I’m afraid.
>
>
>
> Br,
>
> Niall
>
>
>
> From: Alireza Soltanian [mailto:soltanian at gmail.com <mailto:
> soltanian at gmail.com> ]
> Sent: 04 January 2016 15:14
> To: Niall Donaghy
> Cc: juniper-nsp at puck.nether.net <mailto:juniper-nsp at puck.nether.net>
> Subject: RE: [j-nsp] RPD Crash on M320
>
>
>
> Hi
> Yes I checked the CPU graph and there was a spike on CPU load.
> The link was flappy 20 minutes before crash. Also it remained flappy two
> hours after this crash. During this time we can see LDP sessions go UP DOWN
> over and over. But the only time there was a crash was this time and there
> is no spike on CPU.
> I must mention we had another issue with another M320. Whenever a link
> flapped, CPU of RPD went high and all OSPF sessions reset. I found out the
> root cause for that. It was traceoption for LDP. For this box we dont use
> traceoption.
> Is there any way to read the dump?
>
> Thank you
>
> On Jan 4, 2016 6:34 PM, "Niall Donaghy" <Niall.Donaghy at geant.org <mailto:
> Niall.Donaghy at geant.org> > wrote:
>
> Hi Alireza,
>
> It seemed to me this event could be related to the core dump: Jan  3
> 00:31:28  apa-rtr-028 /kernel: jsr_prl_recv_ack_msg(): received PRL ACK
> message on non-active socket w/handle 0x10046fa0000004e
> However upon further investigation
> (http://kb.juniper.net/InfoCenter/index?page=content <
> http://kb.juniper.net/InfoCenter/index?page=content&id=KB18195>
> &id=KB18195) I see these
> messages are normal/harmless.
>
> Do you have Cacti graphs of CPU utilisation for both REs, before the rpd
> crash? Link flapping may be giving rise to CPU hogging, leading to
> instability and subsequent rpd crash.
> Was the link particularly flappy just before the crash?
>
> Kind regards,
> Niall
>
>
>
>
> > -----Original Message-----
> > From: juniper-nsp [mailto:juniper-nsp-bounces at puck.nether.net <mailto:
> juniper-nsp-bounces at puck.nether.net> ] On Behalf
> Of
> > Alireza Soltanian
> > Sent: 04 January 2016 11:04
> > To: juniper-nsp at puck.nether.net <mailto:juniper-nsp at puck.nether.net>
> > Subject: [j-nsp] RPD Crash on M320
> >
> > Hi everybody
> >
> > Recently, we had continuous link flap between our M320 and remote sites.
> We
> > have a lot of L2Circuits between these sites on our M320. At one point we
> had
> > crash on RPD process which lead to following log. I must mention the link
> flap
> > started at 12:10AM and it was continued until 2:30AM. But Crash was
> occurred
> > at 12:30AM.
> >
> >
> >
> > Jan  3 00:31:04  apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session
> > 10.237.253.168 is down, reason: received notification from peer
> >
> > Jan  3 00:31:05  apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session
> > 10.237.254.1 is down, reason: received notification from peer
> >
> > Jan  3 00:31:05  apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session
> > 10.237.253.120 is down, reason: received notification from peer
> >
> > Jan  3 00:31:05  apa-rtr-028 /kernel: jsr_prl_recv_ack_msg(): received
> PRL
> ACK
> > message on non-active socket w/handle 0x1008af8000001c6
> >
> > Jan  3 00:31:06  apa-rtr-028 rpd[42128]: RPD_LDP_SESSIONDOWN: LDP session
> > 10.237.253.192 is down, reason: received notification from peer
> >
> > Jan  3 00:31:28  apa-rtr-028 /kernel: jsr_prl_recv_ack_msg(): received
> PRL
> ACK
> > message on non-active socket w/handle 0x10046fa0000004e
> >
> >
> >
> > Jan  3 00:32:18  apa-rtr-028 init: routing (PID 42128) terminated by
> signal
> > number 6. Core dumped!
> >
> > Jan  3 00:32:18  apa-rtr-028 init: routing (PID 18307) started
> >
> > Jan  3 00:32:18  apa-rtr-028 rpd[18307]: L2CKT acquiring mastership for
> primary
> >
> > Jan  3 00:32:18  apa-rtr-028 rpd[18307]: L2VPN acquiring mastership for
> primary
> >
> > Jan  3 00:32:20  apa-rtr-028 rpd[18307]: RPD_KRT_KERNEL_BAD_ROUTE: KRT:
> > lost ifl 0 for route (null)
> >
> > Jan  3 00:32:20  apa-rtr-028 last message repeated 65 times
> >
> > Jan  3 00:32:20  apa-rtr-028 rpd[18307]: L2CKT acquiring mastership for
> primary
> >
> > Jan  3 00:32:20  apa-rtr-028 rpd[18307]: Primary starts deleting all
> L2circuit IFL
> > Repository
> >
> > Jan  3 00:32:20  apa-rtr-028 rpd[18307]: RPD_TASK_BEGIN: Commencing
> routing
> > updates, version 11.2R2.4, built 2011-09-01 06:53:31 UTC by builder
> >
> >
> >
> > Jan  3 00:32:21  apa-rtr-028 mib2d[33413]: SNMP_TRAP_LINK_DOWN: ifIndex
> > 1329, ifAdminStatus up(1), ifOperStatus down(2), ifName ae1.1041
> >
> > Jan  3 00:32:21  apa-rtr-028 mib2d[33413]: SNMP_TRAP_LINK_DOWN: ifIndex
> > 1311, ifAdminStatus up(1), ifOperStatus down(2), ifName ae1.1039
> >
> > Jan  3 00:32:21  apa-rtr-028 mib2d[33413]: SNMP_TRAP_LINK_DOWN: ifIndex
> > 1312, ifAdminStatus up(1), ifOperStatus down(2), ifName ae1.1038
> >
> >
> >
> > The case is we always have this kind of log (except the Crash) on the
> device. Is
> > there any clue why RPD process crashed? I don't have access to JTAC so I
> cannot
> > analyze the dump.
> >
> > The JunOS version is : 11.2R2.4
> >
> >
> >
> > Thank you for your help and support
> >
> > _______________________________________________
> > juniper-nsp mailing list juniper-nsp at puck.nether.net <mailto:
> juniper-nsp at puck.nether.net>
> > https://puck.nether.net/mailman/listinfo/juniper-nsp
>
> _______________________________________________
> juniper-nsp mailing list juniper-nsp at puck.nether.net
> https://puck.nether.net/mailman/listinfo/juniper-nsp