[j-nsp] MX480 MS-MPC-128G CHASSISD_SNMP_TRAP10 jnxFruOfflineReason 8 but no button press

Wed Feb 8 22:14:54 EST 2017

We have a MX480 with a pair of MS-MPC-128G service boards that are tied together 
as a 'ams' (mams-2 & mams-3 ) service aggregation for reliability.

Occasionally one of them, for no apparent reason, will go offline and then back 
online while logging in 'chassid' log:

CHASSISD_SNMP_TRAP10: SNMP trap generated: FRU power on (jnxFruContentsIndex 8, jnxFruL1Index 4, jnxFruL2Index 1, jnxFruL3Index 0, jnxFruName PIC: MS-MPC-PIC @ 3/0/*, jnxFruType 11, jnxFruSlot 3, jnxFruOfflineReason 2, jnxFruLastPowerOff 1052212977, jnxFruLastPowerOn 1052213068)
(as well as a bunch of other stuff).

According to Junos docs, "jnxFruOfflineReason 8" -> "buttonPress(8), -- offlined by button press"
But I know that nobody was in the room at the time of those incidents, so the 
button couldn't have been pressed.

I hadn't paid too much attention to this as it was only happening occasionally 
and was either one board or the other. But today there was a whole spate of such 
incidents (20 in less than 45 minutes) and at one point it took both MPCs off 
line at the same time (thus noticable service-interruptus ).

In the 'messages' log there are lines that correspond:

   /kernel: peer_input_pending_internal:[4506] VKS0 for peer type 22 indx 12 reported a sb_state 32 = SBS_CANTRCVMORE
   /kernel: peer_inputs:4766 VKS0 closing connection peer type 22 indx 12 err 5
   /kernel: pfe_listener_disconnect: conn dropped: listener idx=7, tnpaddr=0x13010080, reason: generic peer error
   datapath-traced[3960]: datapath_traced_connection_event_handler: Disconnected from MSPMAND
   mspd[3958]: Removed PIC connection state for fpc=3 pic=0 session=0x827a180
   (FPC Slot 3, PIC Slot 0)  ms30 kernel: svcs_ms2_app_sigcore_exit: sending UKERN_ST_DOWN (pid=190, td=0xc00000000291f960, sig=6)
   (FPC Slot 3, PIC Slot 0)  ms30 mspsmd[178]: mspsmd_connection_shutdown: Unexpected shutdown of connection, try reconnecting.
   /kernel: if_pfe_services_health_status: Generating Health status (down) msg for ifd : ms-3/0/0
   /kernel: if_pfe_services_health_status: Generating health status (down) for AMS member mams-3/0/0
   /kernel: if_pfe_ams_process_single_event: ifd:mams-3/0/0, ev = AMS_EV_MEMBER_HSTATUS_DOWN agg_state UP, member_state: ACTIVE, member_present_count = 2
   /kernel: if_pfe_ams_process_member_down_event:Starting Discard Timer
   /kernel: aggr_link_op: link mams-3/0/0.1 (lidx=1) detached from bundle ams0.1
   /kernel: if_pfe_ams_process_single_event:Done:mams-3/0/0, ev = AMS_EV_MEMBER_HSTATUS_DOWN agg_state UP, member_state: DISCARD, member_present_count = 2
   /kernel: if_pfe_services_send_lb_options: PEER_BUILD_IPC_SLOT return NULL
   last message repeated 4 times
   mib2d[3969]: SNMP_TRAP_LINK_DOWN: ifIndex 641, ifAdminStatus up(1), ifOperStatus down(2), ifName ms-3/0/0.0
   mib2d[3969]: SNMP_TRAP_LINK_DOWN: ifIndex 734, ifAdminStatus up(1), ifOperStatus down(2), ifName mams-3/0/0.1
   (FPC Slot 3, PIC Slot 0)  ms30 kernel: msgring_drain_process: bind thread to hwtid (5) cpuid(5)
   (FPC Slot 3, PIC Slot 0)  ms30 kernel: Kmernel thread "msgdrainthr5" (pid 21832) exited prematurely.

Usually it runs for days at a time with out a single one of these incidents.
So I cannot tell if I've got a hardware flakey or a software bug that is being triggered 
by some external events.

Any suggestions? (other than opening a jtac case).

-- 
Dave Funk                                  University of Iowa
<dbfunk (at) engineering.uiowa.edu>        College of Engineering
319/335-5751   FAX: 319/384-0549           1256 Seamans Center
Sys_admin/Postmaster/cell_admin            Iowa City, IA 52242-1527
#include <std_disclaimer.h>
Better is not better, 'standard' is better. B{