[j-nsp] MX480 MS-MPC-128G CHASSISD_SNMP_TRAP10 jnxFruOfflineReason 8 but no button press

Wed Feb 8 23:37:11 EST 2017

Hi Dave,

We had such an issue with the PTX and it turned out they had some bad quality of the buttons so that the normal shaking from the fan trays can lead to a button press. You need to go to the JTAC for further investigation.

-- 
Sebastian Becker
sb at lab.dtag.de

> Am 08.02.2017 um 22:31 schrieb Michael Gehrmann <mgehrmann at atlassian.com>:
> 
> 
> Hi David,
> 
> Might be worth checking for core dumps. I'd also do a PR search for and
> check on release notes for later releases. I have previously found on rare
> occasion MS cards can get into weird corner cases which normally involve
> JTAC to resolve.
> 
> Regards
> Mike
> 
> On 9 February 2017 at 14:14, David B Funk <dbfunk at engineering.uiowa.edu <mailto:dbfunk at engineering.uiowa.edu>>
> wrote:
> 
>> We have a MX480 with a pair of MS-MPC-128G service boards that are tied
>> together as a 'ams' (mams-2 & mams-3 ) service aggregation for reliability.
>> 
>> Occasionally one of them, for no apparent reason, will go offline and then
>> back online while logging in 'chassid' log:
>> 
>> CHASSISD_SNMP_TRAP10: SNMP trap generated: FRU power on
>> (jnxFruContentsIndex 8, jnxFruL1Index 4, jnxFruL2Index 1, jnxFruL3Index 0,
>> jnxFruName PIC: MS-MPC-PIC @ 3/0/*, jnxFruType 11, jnxFruSlot 3,
>> jnxFruOfflineReason 2, jnxFruLastPowerOff 1052212977, jnxFruLastPowerOn
>> 1052213068)
>> (as well as a bunch of other stuff).
>> 
>> According to Junos docs, "jnxFruOfflineReason 8" -> "buttonPress(8), --
>> offlined by button press"
>> But I know that nobody was in the room at the time of those incidents, so
>> the button couldn't have been pressed.
>> 
>> I hadn't paid too much attention to this as it was only happening
>> occasionally and was either one board or the other. But today there was a
>> whole spate of such incidents (20 in less than 45 minutes) and at one point
>> it took both MPCs off line at the same time (thus noticable
>> service-interruptus ).
>> 
>> In the 'messages' log there are lines that correspond:
>> 
>>  /kernel: peer_input_pending_internal:[4506] VKS0 for peer type 22 indx
>> 12 reported a sb_state 32 = SBS_CANTRCVMORE
>>  /kernel: peer_inputs:4766 VKS0 closing connection peer type 22 indx 12
>> err 5
>>  /kernel: pfe_listener_disconnect: conn dropped: listener idx=7,
>> tnpaddr=0x13010080, reason: generic peer error
>>  datapath-traced[3960]: datapath_traced_connection_event_handler:
>> Disconnected from MSPMAND
>>  mspd[3958]: Removed PIC connection state for fpc=3 pic=0
>> session=0x827a180
>>  (FPC Slot 3, PIC Slot 0)  ms30 kernel: svcs_ms2_app_sigcore_exit:
>> sending UKERN_ST_DOWN (pid=190, td=0xc00000000291f960, sig=6)
>>  (FPC Slot 3, PIC Slot 0)  ms30 mspsmd[178]: mspsmd_connection_shutdown:
>> Unexpected shutdown of connection, try reconnecting.
>>  /kernel: if_pfe_services_health_status: Generating Health status (down)
>> msg for ifd : ms-3/0/0
>>  /kernel: if_pfe_services_health_status: Generating health status (down)
>> for AMS member mams-3/0/0
>>  /kernel: if_pfe_ams_process_single_event: ifd:mams-3/0/0, ev =
>> AMS_EV_MEMBER_HSTATUS_DOWN agg_state UP, member_state: ACTIVE,
>> member_present_count = 2
>>  /kernel: if_pfe_ams_process_member_down_event:Starting Discard Timer
>>  /kernel: aggr_link_op: link mams-3/0/0.1 (lidx=1) detached from bundle
>> ams0.1
>>  /kernel: if_pfe_ams_process_single_event:Done:mams-3/0/0, ev =
>> AMS_EV_MEMBER_HSTATUS_DOWN agg_state UP, member_state: DISCARD,
>> member_present_count = 2
>>  /kernel: if_pfe_services_send_lb_options: PEER_BUILD_IPC_SLOT return
>> NULL
>>  last message repeated 4 times
>>  mib2d[3969]: SNMP_TRAP_LINK_DOWN: ifIndex 641, ifAdminStatus up(1),
>> ifOperStatus down(2), ifName ms-3/0/0.0
>>  mib2d[3969]: SNMP_TRAP_LINK_DOWN: ifIndex 734, ifAdminStatus up(1),
>> ifOperStatus down(2), ifName mams-3/0/0.1
>>  (FPC Slot 3, PIC Slot 0)  ms30 kernel: msgring_drain_process: bind
>> thread to hwtid (5) cpuid(5)
>>  (FPC Slot 3, PIC Slot 0)  ms30 kernel: Kmernel thread "msgdrainthr5"
>> (pid 21832) exited prematurely.
>> 
>> Usually it runs for days at a time with out a single one of these
>> incidents.
>> So I cannot tell if I've got a hardware flakey or a software bug that is
>> being triggered by some external events.
>> 
>> Any suggestions? (other than opening a jtac case).
>> 
>> --
>> Dave Funk                                  University of Iowa
>> <dbfunk (at) engineering.uiowa.edu>        College of Engineering
>> 319/335-5751   FAX: 319/384-0549           1256 Seamans Center
>> Sys_admin/Postmaster/cell_admin            Iowa City, IA 52242-1527
>> #include <std_disclaimer.h>
>> Better is not better, 'standard' is better. B{
>> _______________________________________________
>> juniper-nsp mailing list juniper-nsp at puck.nether.net
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__puck.ne <https://urldefense.proofpoint.com/v2/url?u=https-3A__puck.ne>
>> ther.net_mailman_listinfo_juniper-2Dnsp&d=DwICAg&c=wBUwXtM9s
>> Khff6UeHOQgvw&r=iCARHrCSMVMu5fNENyuQGdvoQJpwI5WIbiqe9jFEMFg&
>> m=XA7G1eLizI_SB_PtEfaugLI3dfFDoy-OpLfVObS3k2s&s=8_SDm_
>> ZHLrndQoPMH2Xuvf0V2n-l-UiOloc3VthxWHY&e=
> 
> 
> 
> 
> -- 
> Michael Gehrmann
> Senior Network Engineer - Atlassian
> m: +61 407 570 658
> _______________________________________________
> juniper-nsp mailing list juniper-nsp at puck.nether.net <mailto:juniper-nsp at puck.nether.net>
> https://puck.nether.net/mailman/listinfo/juniper-nsp <https://puck.nether.net/mailman/listinfo/juniper-nsp>