[j-nsp] MX480 MS-MPC-128G CHASSISD_SNMP_TRAP10 jnxFruOfflineReason 8 but no button press
David B Funk
dbfunk at engineering.uiowa.edu
Wed Feb 8 22:14:54 EST 2017
We have a MX480 with a pair of MS-MPC-128G service boards that are tied together
as a 'ams' (mams-2 & mams-3 ) service aggregation for reliability.
Occasionally one of them, for no apparent reason, will go offline and then back
online while logging in 'chassid' log:
CHASSISD_SNMP_TRAP10: SNMP trap generated: FRU power on (jnxFruContentsIndex 8, jnxFruL1Index 4, jnxFruL2Index 1, jnxFruL3Index 0, jnxFruName PIC: MS-MPC-PIC @ 3/0/*, jnxFruType 11, jnxFruSlot 3, jnxFruOfflineReason 2, jnxFruLastPowerOff 1052212977, jnxFruLastPowerOn 1052213068)
(as well as a bunch of other stuff).
According to Junos docs, "jnxFruOfflineReason 8" -> "buttonPress(8), -- offlined by button press"
But I know that nobody was in the room at the time of those incidents, so the
button couldn't have been pressed.
I hadn't paid too much attention to this as it was only happening occasionally
and was either one board or the other. But today there was a whole spate of such
incidents (20 in less than 45 minutes) and at one point it took both MPCs off
line at the same time (thus noticable service-interruptus ).
In the 'messages' log there are lines that correspond:
/kernel: peer_input_pending_internal:[4506] VKS0 for peer type 22 indx 12 reported a sb_state 32 = SBS_CANTRCVMORE
/kernel: peer_inputs:4766 VKS0 closing connection peer type 22 indx 12 err 5
/kernel: pfe_listener_disconnect: conn dropped: listener idx=7, tnpaddr=0x13010080, reason: generic peer error
datapath-traced[3960]: datapath_traced_connection_event_handler: Disconnected from MSPMAND
mspd[3958]: Removed PIC connection state for fpc=3 pic=0 session=0x827a180
(FPC Slot 3, PIC Slot 0) ms30 kernel: svcs_ms2_app_sigcore_exit: sending UKERN_ST_DOWN (pid=190, td=0xc00000000291f960, sig=6)
(FPC Slot 3, PIC Slot 0) ms30 mspsmd[178]: mspsmd_connection_shutdown: Unexpected shutdown of connection, try reconnecting.
/kernel: if_pfe_services_health_status: Generating Health status (down) msg for ifd : ms-3/0/0
/kernel: if_pfe_services_health_status: Generating health status (down) for AMS member mams-3/0/0
/kernel: if_pfe_ams_process_single_event: ifd:mams-3/0/0, ev = AMS_EV_MEMBER_HSTATUS_DOWN agg_state UP, member_state: ACTIVE, member_present_count = 2
/kernel: if_pfe_ams_process_member_down_event:Starting Discard Timer
/kernel: aggr_link_op: link mams-3/0/0.1 (lidx=1) detached from bundle ams0.1
/kernel: if_pfe_ams_process_single_event:Done:mams-3/0/0, ev = AMS_EV_MEMBER_HSTATUS_DOWN agg_state UP, member_state: DISCARD, member_present_count = 2
/kernel: if_pfe_services_send_lb_options: PEER_BUILD_IPC_SLOT return NULL
last message repeated 4 times
mib2d[3969]: SNMP_TRAP_LINK_DOWN: ifIndex 641, ifAdminStatus up(1), ifOperStatus down(2), ifName ms-3/0/0.0
mib2d[3969]: SNMP_TRAP_LINK_DOWN: ifIndex 734, ifAdminStatus up(1), ifOperStatus down(2), ifName mams-3/0/0.1
(FPC Slot 3, PIC Slot 0) ms30 kernel: msgring_drain_process: bind thread to hwtid (5) cpuid(5)
(FPC Slot 3, PIC Slot 0) ms30 kernel: Kmernel thread "msgdrainthr5" (pid 21832) exited prematurely.
Usually it runs for days at a time with out a single one of these incidents.
So I cannot tell if I've got a hardware flakey or a software bug that is being triggered
by some external events.
Any suggestions? (other than opening a jtac case).
--
Dave Funk University of Iowa
<dbfunk (at) engineering.uiowa.edu> College of Engineering
319/335-5751 FAX: 319/384-0549 1256 Seamans Center
Sys_admin/Postmaster/cell_admin Iowa City, IA 52242-1527
#include <std_disclaimer.h>
Better is not better, 'standard' is better. B{
More information about the juniper-nsp
mailing list