[j-nsp] m10i Nastiness Friday night

Nilesh Khambal nkhambal at juniper.net
Mon Aug 17 16:49:20 EDT 2009


Multibit Errors usually points to a bad memory component. Usual way to 
go forward is to perform a precautionary RMA of the board. There is 
nothing that JUNOS upgrade can fix in this case. JUNOS already has ECC 
protection against single-bit errors. It can automatically detect and 
correct them. However, not much can be done inside JUNOS for multibit 
ECC errors. Any troubleshooting you would do won't give any conclusive 
results as the coredump might not point to any error condition that 
triggered the reset. It could be just that, the CFEB was performing its 
normal operation and it encountered a hardware errors due to multibit 
errors and accessed a bad location in memory that triggered an exception.

As you mentioned, contact JTAC and they would be able to sort it out for 
you.

Thanks,
Nilesh.


Clue Store wrote:
> Thanks all for the replies. I'll get with JTAC and get or sorted out. As Dan mentioned, the ECC multibit error kinda scares me as I do not wish to have to drive 200+ miles and change out the memory. So lets hope for a Junos fix  :)
> 
> Thanks,
> Clue
> 
> On Mon, Aug 17, 2009 at 12:19 PM, Dan Rautio <drautio at juniper.net<mailto:drautio at juniper.net>> wrote:
> This message stands out:
> 
>> Aug 14 23:38:51  JuniperM10i-HMNDLAMA cfeb mpc106 error detection reg2: ECC multibit
> 
> 
> 
>> -----Original Message-----
>> From: juniper-nsp-bounces at puck.nether.net<mailto:juniper-nsp-bounces at puck.nether.net> [mailto:juniper-nsp-<mailto:juniper-nsp->
>> bounces at puck.nether.net<mailto:bounces at puck.nether.net>] On Behalf Of Nilesh Khambal
>> Sent: Monday, August 17, 2009 10:57 AM
>> To: Clue Store
>> Cc: juniper-nsp at puck.nether.net<mailto:juniper-nsp at puck.nether.net>
>> Subject: Re: [j-nsp] m10i Nastiness Friday night
>>
>> It looks like CFEB dumped core and restarted. Please open a JTAC case
>> and let me them figure out what went wrong with CFEB. Please gather all
>> logs around the time of the problem. Usually following logs should be a
>> good start.
>>
>> - show log messages[.(0-9).gz] (From RE)
>> - show syslog messages (from CFEB)
>> - show nvram (from CFEB).
>> - CFEB coredump file generated under "/var/tmp"
>> - Any other surrounding information such temperature, memory, CPU
>> information about RE and CFEB around the time of the problem.
>>
>> Given the old version of code you are running on the box, this may be a
>> known issue fixed in later release such as 8.5 which you are running on
>> the other box. Let JTAC analyze that.
>>
>> Thanks,
>> Nilesh.
>>
>> Clue Store wrote:
>>> Hi All,
>>>
>>> Last friday we had some nastiness on one of our m10i's. As I am not a
>>> Juniper expert, I was wondering if someone could decipher the log
>> messages
>>> and determine if is possibly a CFEB issue, or just a fluke Junos issue
>> and
>>> whether I should do anything or let it be and see if it does it again. I
>>> have another m10i running 8.5, so I am thinking of just upgrading this
>> box
>>> to the same as my other, but i'd like to hear what some of you on the
>> list
>>> think.
>>>
>>> TIA,
>>> Clue
>>>
>>> Hostname: JuniperM10i-HMNDLAMA
>>> Model: m10i
>>> JUNOS Base OS boot [8.0R2.8]
>>> JUNOS Base OS Software Suite [8.0R2.8]
>>> JUNOS Kernel Software Suite [8.0R2.8]
>>> JUNOS Packet Forwarding Engine Support (M7i/M10i) [8.0R2.8]
>>> JUNOS Routing Software Suite [8.0R2.8]
>>> JUNOS Online Documentation [8.0R2.8]
>>>
>>>
>>> Aug 14 23:38:51  JuniperM10i-HMNDLAMA cfeb mpc106 machine check caused
>> by
>>> error on the Processor Bus
>>> Aug 14 23:38:51  JuniperM10i-HMNDLAMA cfeb mpc106 PCI status register:
>>> 0x0020, error detect register 1: 0x00, 2: 0x08
>>> Aug 14 23:38:51  JuniperM10i-HMNDLAMA cfeb mpc106 error ack count = 0
>>> Aug 14 23:38:51  JuniperM10i-HMNDLAMA cfeb mpc106 error address:
>> 0x0f3827f8
>>> Aug 14 23:38:51  JuniperM10i-HMNDLAMA cfeb mpc106 Processor bus error
>> status
>>> register: 0x52
>>> Aug 14 23:38:51  JuniperM10i-HMNDLAMA cfeb transfer type 0b01010,
>> transfer
>>> size 2
>>> Aug 14 23:38:51  JuniperM10i-HMNDLAMA cfeb mpc106 error detection reg2:
>> ECC
>>> multibit
>>> Aug 14 23:38:52  JuniperM10i-HMNDLAMA cfeb ^B
>>> Aug 14 23:38:52  JuniperM10i-HMNDLAMA cfeb last message repeated 6 times
>>> Aug 14 23:38:52  JuniperM10i-HMNDLAMA cfeb Context: Interrupt Level (0)
>>> Aug 14 23:38:52  JuniperM10i-HMNDLAMA cfeb Registers:
>>> Aug 14 23:38:52  JuniperM10i-HMNDLAMA cfeb R00: 0x00000446 R01:
>> 0x00799450
>>> R02: 0x00000000 R03: 0x4f3827fc
>>> Aug 14 23:38:52  JuniperM10i-HMNDLAMA cfeb R04: 0x00000552 R05:
>> 0x00000000
>>> R06: 0x007994a0 R07: 0x00000004
>>> Aug 14 23:38:52  JuniperM10i-HMNDLAMA cfeb R08: 0x00000548 R09:
>> 0x0017f48b
>>> R10: 0x00000002 R11: 0xb0c7d8ec
>>> Aug 14 23:38:52  JuniperM10i-HMNDLAMA cfeb R12: 0x28002044 R13:
>> 0x02420020
>>> R14: 0xf1ae2100 R15: 0x82600020
>>> Aug 14 23:38:52  JuniperM10i-HMNDLAMA cfeb R16: 0x442104c2 R17:
>> 0x2248000b
>>> R18: 0x00670000 R19: 0x00670000
>>> Aug 14 23:38:52  JuniperM10i-HMNDLAMA cfeb R20: 0x00670000 R21:
>> 0x006ce5a0
>>> R22: 0x007902d0 R23: 0x00670000
>>> Aug 14 23:38:52  JuniperM10i-HMNDLAMA cfeb R24: 0x00000002 R25:
>> 0x00000004
>>> R26: 0x0080bd40 R27: 0x0000ffff
>>> Aug 14 23:38:52  JuniperM10i-HMNDLAMA cfeb R28: 0x00000001 R29:
>> 0x00000001
>>> R30: 0x4f38271c R31: 0x4f382714
>>> Aug 14 23:38:52  JuniperM10i-HMNDLAMA cfeb MSR: 0x00089030 CTR:
>> 0x00000239
>>> Link:0x002e34c8 SP:  0x00799450
>>> Aug 14 23:38:52  JuniperM10i-HMNDLAMA cfeb CCR: 0x48002028 XER:
>> 0x20000000
>>> PC:  0x00460320
>>> Aug 14 23:38:52  JuniperM10i-HMNDLAMA cfeb DSISR: 0x00000000 DAR:
>> 0x00000000
>>> K_MSR: 0x00000030
>>> Aug 14 23:38:52  JuniperM10i-HMNDLAMA cfeb Stack Traceback:
>>> Aug 14 23:38:52  JuniperM10i-HMNDLAMA cfeb Frame 01: sp = 0x00799450, pc
>> =
>>> 0x0000c001
>>> Aug 14 23:38:52  JuniperM10i-HMNDLAMA cfeb Frame 02: sp = 0x00799468, pc
>> =
>>> 0x002e4d74
>>> Aug 14 23:38:52  JuniperM10i-HMNDLAMA cfeb Frame 03: sp = 0x00799498, pc
>> =
>>> 0x002e35e0
>>> Aug 14 23:38:52  JuniperM10i-HMNDLAMA cfeb Frame 04: sp = 0x007994b8, pc
>> =
>>> 0x002e3bb0
>>> Aug 14 23:38:52  JuniperM10i-HMNDLAMA cfeb Frame 05: sp = 0x007994c0, pc
>> =
>>> 0x00058818
>>> Aug 14 23:38:52  JuniperM10i-HMNDLAMA cfeb Frame 06: sp = 0x007994d8, pc
>> =
>>> 0x0003df34
>>> Aug 14 23:38:52  JuniperM10i-HMNDLAMA cfeb Frame 07: sp = 0x00799500, pc
>> =
>>> 0x003b4488
>>> Aug 14 23:38:52  JuniperM10i-HMNDLAMA cfeb Frame 08: sp = 0x00799530, pc
>> =
>>> 0x003b4660
>>> Aug 14 23:38:52  JuniperM10i-HMNDLAMA cfeb Frame 09: sp = 0x00799548, pc
>> =
>>> 0x003b3ed0
>>> Aug 14 23:38:52  JuniperM10i-HMNDLAMA cfeb Frame 10: sp = 0x007995c8, pc
>> =
>>> 0x003b3d30
>>> Aug 14 23:38:52  JuniperM10i-HMNDLAMA cfeb Frame 11: sp = 0x007995e8, pc
>> =
>>> 0x000b9f6c
>>> Aug 14 23:38:52  JuniperM10i-HMNDLAMA cfeb Frame 12: sp = 0x00799610, pc
>> =
>>> 0x000b8928
>>> Aug 14 23:38:52  JuniperM10i-HMNDLAMA cfeb Frame 13: sp = 0x00799628, pc
>> =
>>> 0x00448518
>>> Aug 14 23:38:52  JuniperM10i-HMNDLAMA cfeb Frame 14: sp = 0x00799678, pc
>> =
>>> 0x00442d00
>>> Aug 14 23:38:52  JuniperM10i-HMNDLAMA cfeb Frame 15: sp = 0x00799698, pc
>> =
>>> 0x0003a500
>>> Aug 14 23:38:52  JuniperM10i-HMNDLAMA cfeb Frame 16: sp = 0x007996b0, pc
>> =
>>> 0x0003b268
>>> Aug 14 23:38:56  JuniperM10i-HMNDLAMA /kernel: rdp keepalive expired,
>>> connection dropped - src 1:1021 dest 2:15360
>>> Aug 14 23:38:56  JuniperM10i-HMNDLAMA craftd[2999]:  Major alarm set,
>> CFEB
>>> not online, the box is not forwarding
>>> Aug 14 23:38:56  JuniperM10i-HMNDLAMA alarmd[2998]: Alarm set: CFEB
>>> color=RED, class=CHASSIS, reason=CFEB not online, the box is not
>> forwarding
>>> Aug 14 23:38:56  JuniperM10i-HMNDLAMA craftd[2999]: forwarding display
>>> request to chassisd: type = 4, subtype = 43
>>> Aug 14 23:38:56  JuniperM10i-HMNDLAMA chassisd[2997]:
>>> CHASSISD_SHUTDOWN_NOTICE: Shutdown reason: CFEB connection lost
>>> Aug 14 23:38:56  JuniperM10i-HMNDLAMA chassisd[2997]:
>>> CHASSISD_IFDEV_DETACH_FPC: ifdev_detach(0)
>>> Aug 14 23:38:56  JuniperM10i-HMNDLAMA mib2d[3111]: SNMP_TRAP_LINK_DOWN:
>>> ifIndex 77, ifAdminStatus up(1), ifOperStatus down(2), ifName ge-0/0/0
>>>
>>> (Lots of BGP notifications due to interface down issues)
>>>
>>> Aug 14 23:38:56  JuniperM10i-HMNDLAMA snmpd[3132]: SNMPD_SEND_FAILURE:
>>> trap_io_send_trap_now: send to (207.29.223.55) failure: Network is down
>>> Aug 14 23:38:56  JuniperM10i-HMNDLAMA alarmd[2998]: shutting down
>> chassisd
>>> connection: chassisd ipc pipe read error
>>> Aug 14 23:38:56  JuniperM10i-HMNDLAMA craftd[2999]:
>>> craftd_user_conn_shutdown: socket 5, errno = 0
>>> Aug 14 23:38:56  JuniperM10i-HMNDLAMA craftd[2999]: chassisd connection
>>> succeeded after 0 retries
>>> Aug 14 23:38:56  JuniperM10i-HMNDLAMA alarmd[2998]: chassisd connection
>>> succeeded after 0 retries
>>> Aug 14 23:38:56  JuniperM10i-HMNDLAMA mib2d[3111]: SNMP_TRAP_LINK_DOWN:
>>> ifIndex 80, ifAdminStatus down(2), ifOperStatus down(2), ifName ge-
>> 1/0/0.462
>>> Aug 14 23:38:56  JuniperM10i-HMNDLAMA alarmd[2998]: resending alarm
>> state
>>> Aug 14 23:38:56  JuniperM10i-HMNDLAMA craftd[2999]: forwarding display
>>> request to chassisd: type = 4, subtype = 43
>>> Aug 14 23:38:56  JuniperM10i-HMNDLAMA alarmd[2998]: Alarm set: CFEB
>>> color=RED, class=CHASSIS, reason=CFEB not online, the box is not
>> forwarding
>>> Aug 14 23:38:56  JuniperM10i-HMNDLAMA alarmd[2998]: Alarm set: RE
>> color=RED,
>>> class=CHASSIS, reason=Host 0 fxp0: Ethernet Link Down
>>> Aug 14 23:38:56  JuniperM10i-HMNDLAMA craftd[2999]: forwarding display
>>> request to chassisd: type = 4, subtype = 43
>>> Aug 14 23:38:56  JuniperM10i-HMNDLAMA alarmd[2998]: Alarm set: RE
>> color=RED,
>>> class=CHASSIS, reason=Host 1 fxp0: Ethernet Link Down
>>> Aug 14 23:38:56  JuniperM10i-HMNDLAMA craftd[2999]: forwarding display
>>> request to chassisd: type = 4, subtype = 43
>>> Aug 14 23:38:56  JuniperM10i-HMNDLAMA /kernel: rdp keepalive expired,
>>> connection dropped - src 1:1020 dest 2:15361
>>> Aug 14 23:38:56  JuniperM10i-HMNDLAMA /kernel: pfe_listener_disconnect:
>> conn
>>> dropped: listener idx=0, tnpaddr=0x2, reason: socket error
>>> Aug 14 23:39:41  JuniperM10i-HMNDLAMA chassisd[2997]:
>>> CHASSISD_BLOWERS_SPEED_FULL: Fans and impellers being set to full speed
>>> [system warm]
>>> Aug 14 23:40:09  JuniperM10i-HMNDLAMA chassisd[2997]:
>> CHASSISD_SNMP_TRAP10:
>>> SNMP trap generated: FRU power on (jnxFruContentsIndex 6, jnxFruL1Index
>> 1,
>>> jnxFruL2Index 0, jnxFruL3Index 0, jnxFruName CFEB 0, jnxFruType 4,
>>> jnxFruSlot 1, jnxFruOfflineReason 2, jnxFruLastPowerOff 0,
>> jnxFruLastPowerOn
>>> 0)
>>> Aug 14 23:40:09  JuniperM10i-HMNDLAMA chassisd[2997]:
>>> CHASSISD_IFDEV_DETACH_FPC: ifdev_detach(0)
>>> Aug 14 23:40:09  JuniperM10i-HMNDLAMA chassisd[2997]:
>>> CHASSISD_IFDEV_DETACH_FPC: ifdev_detach(1)
>>> Aug 14 23:40:09  JuniperM10i-HMNDLAMA chassisd[2997]:
>>> CHASSISD_IFDEV_DETACH_ALL_PSEUDO: ifdev_detach(pseudo devices: all)
>>> Aug 14 23:40:09  JuniperM10i-HMNDLAMA craftd[2999]: Major alarm cleared,
>>> Host 0 fxp0: Ethernet Link Down
>>> Aug 14 23:40:09  JuniperM10i-HMNDLAMA alarmd[2998]: Alarm cleared: RE
>>> color=RED, class=CHASSIS, reason=Host 0 fxp0: Ethernet Link Down
>>> Aug 14 23:40:09  JuniperM10i-HMNDLAMA craftd[2999]: forwarding display
>>> request to chassisd: type = 4, subtype = 44
>>> Aug 14 23:40:09  JuniperM10i-HMNDLAMA cfeb CM: ALARM SET: (Major) Slot
>> 0:
>>> CFEB not online, the box is not forwarding
>>> Aug 14 23:40:09  JuniperM10i-HMNDLAMA cfeb CM: ALARM SET: (Major) Slot
>> 0:
>>> Host 0 fxp0: Ethernet Link Down
>>> Aug 14 23:40:09  JuniperM10i-HMNDLAMA cfeb CM: ALARM SET: (Major) Slot
>> 1:
>>> Host 1 fxp0: Ethernet Link Down
>>> Aug 14 23:40:10  JuniperM10i-HMNDLAMA chassisd[2997]:
>> CHASSISD_FRU_EVENT:
>>> fpc_m40_recv_restart: restarted FPC 0
>>> Aug 14 23:40:10  JuniperM10i-HMNDLAMA chassisd[2997]:
>> CHASSISD_FRU_EVENT:
>>> fpc_m40_recv_restart: restarted FPC 1
>>> Aug 14 23:40:12  JuniperM10i-HMNDLAMA craftd[2999]:  Major alarm set,
>> Host 0
>>> fxp0: Ethernet Link Down
>>> Aug 14 23:40:12  JuniperM10i-HMNDLAMA alarmd[2998]: Alarm set: RE
>> color=RED,
>>> class=CHASSIS, reason=Host 0 fxp0: Ethernet Link Down
>>> Aug 14 23:40:12  JuniperM10i-HMNDLAMA craftd[2999]: forwarding display
>>> request to chassisd: type = 4, subtype = 43
>>> Aug 14 23:40:12  JuniperM10i-HMNDLAMA cfeb CM: ALARM CLEAR: Slot 0: Host
>> 0
>>> fxp0: Ethernet Link Down
>>> Aug 14 23:40:17  JuniperM10i-HMNDLAMA craftd[2999]: Major alarm cleared,
>>> CFEB not online, the box is not forwarding
>>> Aug 14 23:40:17  JuniperM10i-HMNDLAMA alarmd[2998]: Alarm cleared: CFEB
>>> color=RED, class=CHASSIS, reason=CFEB not online, the box is not
>> forwarding
>>> Aug 14 23:40:17  JuniperM10i-HMNDLAMA craftd[2999]: forwarding display
>>> request to chassisd: type = 4, subtype = 44
>>> Aug 14 23:40:17  JuniperM10i-HMNDLAMA cfeb CM: ALARM SET: (Major) Slot
>> 0:
>>> Host 0 fxp0: Ethernet Link Down
>>> Aug 14 23:40:32  JuniperM10i-HMNDLAMA chassisd[2997]:
>>> CHASSISD_BLOWERS_SPEED: Fans and impellers are now running at normal
>> speed
>>> Aug 14 23:40:33  JuniperM10i-HMNDLAMA chassisd[2997]:
>> CHASSISD_FRU_EVENT:
>>> scb_recv_slot_attach: attached FPC 0
>>> Aug 14 23:40:55  JuniperM10i-HMNDLAMA chassisd[2997]:
>> CHASSISD_FRU_EVENT:
>>> scb_recv_slot_attach: attached FPC 1
>>> Aug 14 23:40:57  JuniperM10i-HMNDLAMA chassisd[2997]:
>> CHASSISD_SNMP_TRAP10:
>>> SNMP trap generated: FRU power on (jnxFruContentsIndex 8, jnxFruL1Index
>> 1,
>>> jnxFruL2Index 1, jnxFruL3Index 0, jnxFruName PIC: 1x G/E, 1000 BASE-SX @
>>> 0/0/*, jnxFruType 11, jnxFruSlot 1, jnxFruOfflineReason 2,
>>> jnxFruLastPowerOff 0, jnxFruLastPowerOn 0)
>>> Aug 14 23:40:57  JuniperM10i-HMNDLAMA chassisd[2997]:
>> CHASSISD_SNMP_TRAP10:
>>> SNMP trap generated: FRU power on (jnxFruContentsIndex 8, jnxFruL1Index
>> 2,
>>> jnxFruL2Index 1, jnxFruL3Index 0, jnxFruName PIC: 1x G/E, 1000 BASE-SX @
>>> 1/0/*, jnxFruType 11, jnxFruSlot 2, jnxFruOfflineReason 2,
>>> jnxFruLastPowerOff 0, jnxFruLastPowerOn 0)
>>> Aug 14 23:40:57  JuniperM10i-HMNDLAMA chassisd[2997]:
>>> CHASSISD_IFDEV_CREATE_NOTICE: create_pics: created interface device for
>>> ge-0/0/0
>>> Aug 14 23:40:58  JuniperM10i-HMNDLAMA chassisd[2997]:
>>> CHASSISD_IFDEV_CREATE_NOTICE: create_pics: created interface device for
>>> ge-1/0/0
>>> Aug 14 23:40:58  JuniperM10i-HMNDLAMA chassisd[2997]:
>> CHASSISD_SNMP_TRAP10:
>>> SNMP trap generated: FRU power on (jnxFruContentsIndex 7, jnxFruL1Index
>> 1,
>>> jnxFruL2Index 0, jnxFruL3Index 0, jnxFruName FPC:  @ 0/*/*, jnxFruType
>> 3,
>>> jnxFruSlot 1, jnxFruOfflineReason 2, jnxFruLastPowerOff 0,
>> jnxFruLastPowerOn
>>> 0)
>>>
>>> (BGP notifications that peers are responding)
>>>
>>>
>>> Aug 14 23:42:22  JuniperM10i-HMNDLAMA chassisd[2997]:
>>> CHASSISD_BLOWERS_SPEED_FULL: Fans and impellers being set to full speed
>>> [system warm]
>>> Aug 14 23:43:22  JuniperM10i-HMNDLAMA chassisd[2997]:
>>> CHASSISD_BLOWERS_SPEED: Fans and impellers are now running at normal
>> speed
>>> Aug 14 23:44:02  JuniperM10i-HMNDLAMA chassisd[2997]:
>>> CHASSISD_BLOWERS_SPEED_FULL: Fans and impellers being set to full speed
>>> [system warm]
>>> Aug 14 23:44:37  JuniperM10i-HMNDLAMA chassisd[2997]:
>>> CHASSISD_BLOWERS_SPEED: Fans and impellers are now running at normal
>> speed
>>> _______________________________________________
>>> juniper-nsp mailing list juniper-nsp at puck.nether.net<mailto:juniper-nsp at puck.nether.net>
>>> https://puck.nether.net/mailman/listinfo/juniper-nsp
>>> .
>>>
>>
>> _______________________________________________
>> juniper-nsp mailing list juniper-nsp at puck.nether.net<mailto:juniper-nsp at puck.nether.net>
>> https://puck.nether.net/mailman/listinfo/juniper-nsp
> 
> 




More information about the juniper-nsp mailing list