[j-nsp] MPC4D-32*GE Major Alarms

Tue Feb 16 04:31:54 EST 2016

Hello Steven,

Thank you for answering.

To begin with, we already replaced the MPC. The alarm still get raised at
random intervals. Besides the alarm, there's no preceding or trailing
errors. There's another MX960 configured exactly alike, dealing with the
2nd half of our end users, which never had such errors. The alarm however,
seems to cut the cumulative traffic passing through that MPC by half.
Traffic levels are restored be restarting that MPC. So I seriously doubt
that's a s/w issue.

As I said, I neither against nor for,  adding another case for JTAC to deal
with. Obviously, there might be something new and complicated going on out
there.

As for the documentation, let begin with some knowledge base article
outlining initial steps for alarms troubleshooting steps for MX. I'd like
to read that one, to begin with.

Thank you.
On 16 Feb 2016 10:15 a.m., "Steven Wong" <swong at juniper.net> wrote:

> Hi Alex,
>
> When there is an alarm raised on the FPC, there must be an issue. If the
> issue is an obvious one, for example, sensor failure, you can always see
> that from the messages log and that should be good enough to determine if
> the board should be replaced or not. Unfortunately, there are some non
> obvious cases. Taking the MX as an example, we do have some monitor logic
> in the box to monitor different parts of the FPC. If something goes wrong,
> for example, an ASIC status doesn’t look good, we will raise an alarm to
> alert the customer as that might have impact on the traffic. However, there
> is no way for us to tell the customer what to do as the “abnormal state” is
> just a result of something bad - either sw bug or a hw failure. That’s why
> we need to get a case opened for JTAC to analyze what’s wrong. If you
> replace the board, most of the time, you can fix the problem for sure.
> However, as you could expect, if the issue was indeed caused by a sw bug,
> it might just come again. That’s why Diogo suggested a JTAC case to check
> that up.
>
> If you see that our document is not good enough to tell what to do when we
> see an alarm. Please do point it out and let us know. We will try to
> improve that.
>
> Thanks,
> Steven
>
>
>
> On 16/2/16 15:59, "juniper-nsp on behalf of Alex K." <
> juniper-nsp-bounces at puck.nether.net on behalf of nsp.lists at gmail.com>
> wrote:
>
> >Hello Diogo,
> >
> >Thank you for answering. Unfortunately, in my humble opinion, Juniper has
> >no clear procedure for us to follow.
> >
> >The cumulative effect of all the test we ran and those you and others
> >courteously pointed out, is basically none. This in my opinion, due the
> the
> >very fact that Juniper has no clear procedure published, to deal with
> those
> >kind of errors. Yes, there are some procedures out there for dealing with
> >clearly defined hardware errors but those are few. Recently, I was dealing
> >with some hardware related issues on some Cisco gear and I can clearly see
> >now the plenty of documented hardware show commands and such on Ciscos'
> >side and the lack of such for Juniper.
> >
> >I maybe wrong, but that sees to me as like Juniper would like me to add a
> >case for their JTAC pile, for every issue. Nevermind the fact that we all
> >have replacement stock available and could replace every part by
> ourselves,
> >given we have a way to recognize the faulty part.
> >
> >I have nothing for or against opening a case with JTAC, besides it's
> proven
> >to be a relic of the past. Many other vendors recognized a long time ago,
> >there's professional force out there and it works quite well. In fact, I
> >can hardly remember a case I was forced to open with Cisco, since I wasn't
> >sure what hardware part need replacement. And this given that we have much
> >more Cisco gear.
> >
> >Anyhow, I'll welcome any additional ideas from everyone.
> >
> >Thank you.
> >On 14 Feb 2016 11:04 a.m., "Diogo Montagner" <diogo.montagner at gmail.com>
> >wrote:
> >
> >> That should give you some indication of which subsystem is having
> problem.
> >>
> >> Also, check if there are no core-dumps generated fornthe FPC.
> >>
> >> Without additional information will be very hard to pinpoint where to
> look.
> >>
> >> On Sunday, 14 February 2016, Alex K. <nsp.lists at gmail.com> wrote:
> >>
> >>> Hello Diogo,
> >>>
> >>> I'm currently not on site, so I'll definitely try it when I'll get
> there.
> >>> Now I'm considering a plan of actions. What should I look for in that
> >>> command?
> >>>
> >>> Thank you.
> >>> On 14 Feb 2016 10:00, "Diogo Montagner" <diogo.montagner at gmail.com>
> >>> wrote:
> >>>
> >>>> Alex,
> >>>>
> >>>> What do you see in the show nvram at the FPC shell ?
> >>>>
> >>>> Do you have a case open with JTAC ?
> >>>>
> >>>> Thanks
> >>>>
> >>>> On Sunday, 14 February 2016, Alex K. <nsp.lists at gmail.com> wrote:
> >>>>
> >>>>> Hello everyone,
> >>>>>
> >>>>> For some time now, one of my customers are getting "major alarms"
> from
> >>>>> the
> >>>>> MPC mentioned above on one of their MX960s.
> >>>>>
> >>>>> The issue is that nothing more than that message (+alarm) seems to be
> >>>>> present. Nothing preceding that error, neither in "log messages" nor
> in
> >>>>> "chassisd". There seems to be output rate drop, at the time of those
> >>>>> incidents till the MPC get restarted (by the appropriate network
> team)
> >>>>> and
> >>>>> than everything gets back to normal.
> >>>>>
> >>>>> It's worth mentioning that they have a second MX960 serving the other
> >>>>> half
> >>>>> of their end-users, but configured exactly the same - which never had
> >>>>> that
> >>>>> issue (therefore it's probably not traffic related).
> >>>>>
> >>>>> They are running 12.3R6.6. The linecard was already replaced. There
> is
> >>>>> seems to be no trace options available for monitoring MPCs and their
> >>>>> internal status and Juniper web site lacks potential explanations and
> >>>>> leads, therefore I'm addressing the community -  any advice for
> getting
> >>>>> to
> >>>>> the bottom of this, will be welcomed! Additionally, any experience
> with
> >>>>> troubleshooting similar hardware issues might be as helpful as any
> >>>>> advice.
> >>>>>
> >>>>> Thank you.
> >>>>> _______________________________________________
> >>>>> juniper-nsp mailing list juniper-nsp at puck.nether.net
> >>>>> https://puck.nether.net/mailman/listinfo/juniper-nsp
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>> ./diogo -montagner
> >>>> JNCIE-SP 0x41A
> >>>>
> >>>
> >>
> >> --
> >> ./diogo -montagner
> >> JNCIE-SP 0x41A
> >>
> >_______________________________________________
> >juniper-nsp mailing list juniper-nsp at puck.nether.net
> >https://puck.nether.net/mailman/listinfo/juniper-nsp
>