[c-nsp] ASR9k A9K-8T-L LC crash and reload
Wyatt Mattias Gyllenvarg
wyatt.eliasson at gmail.com
Thu Jun 16 08:25:58 EDT 2011
Hi All
We are having an issue with a ring of 3 ASR9010 and one 7606
Sup7203BXL with 6704-DFC3BLX.
The LCs facing the 7606 crash and reload randomly (once they have
reloaded at the same time).
Both cards are in slot 0/2 and have if Te0/2/0/0 facing the 7606.
All the ASR machines have the same physical configuration.
0/0 A9K-40GE-L
0/1 A9K-8T-L LC
0/2 A9K-8T-L LC
Dual RSPs 4G Running 4.1.0 all fpd are updated
Very little traffic is being forwarded as we have not yet migrated
fully too this new setup.
Running Protocolls are:
OSPF
MPLS LDP
PIM
BGP
IPv6 PE
CDP
All interfaces are routed.
Log shows:
LC/0/2/CPU0:Jun 16 11:27:54.502 : pfm_node_lc[267]:
%PLATFORM-DIAGS-0-LC_NP_LOOPBACK_FAILED :
Set|online_diag_lc[163921]|Line card NPU loopback Test(0x2000006)|
LC/0/2/CPU0:Jun 16 11:27:54.509 : pfm_node_lc[267]:
prm_fast_reset_subset fast reset api succeeded for chan 4
LC/0/2/CPU0:Jun 16 11:27:54.510 : pfm_node_lc[267]: NP loopback
recovery action: Succeded (NP bitmask:0x10)
LC/0/2/CPU0:Jun 16 11:27:57.975 : prm_server[278]:
%PLATFORM-NP-0-INIT_ERR : *** Error 0xA0003F03 : prm_np_fast_reset :
Channel 4 Config Start Fast Reset failed, line
LC/0/2/CPU0:Jun 16 11:27:57.976 : prm_server[278]: Line card needs to
be reloaded, a reboot is being requested
RP/0/RSP0/CPU0:Jun 16 11:27:58.031 : shelfmgr[352]:
%PLATFORM-SHELFMGR-3-NODE_CPU_RESET : Node 0/2/CPU0 CPU reset
detected.
RP/0/RSP0/CPU0:Jun 16 11:27:58.032 : shelfmgr[352]:
%PLATFORM-SHELFMGR-6-NODE_STATE_CHANGE : 0/2/CPU0 A9K-8T-L
state:BRINGDOWN
RP/0/RSP0/CPU0:Jun 16 11:27:58.075 : invmgr[234]:
%PLATFORM-INV-6-NODE_STATE_CHANGE : Node: 0/2/CPU0, state: BRINGDOWN
RP/0/RSP0/CPU0:Jun 16 11:28:04.026 : shelfmgr[352]:
%PLATFORM-SHELFMGR-6-NODE_STATE_CHANGE : 0/2/CPU0 A9K-8T-L
state:ROMMON
RP/0/RSP0/CPU0:Jun 16 11:28:26.636 : shelfmgr[352]:
%PLATFORM-SHELFMGR_HAL-6-BOOT_REQ_RECEIVED : Boot Request from
0/2/CPU0, RomMon Version: 1.3
RP/0/RSP0/CPU0:Jun 16 11:28:26.639 : shelfmgr[352]:
%PLATFORM-MBIMGR-7-IMAGE_VALIDATED : Remote location 0/2/CPU0: : MBI
tftp:/disk0/asr9k-os-mbi-4.1.0/lc/mbiasr9k-lc
RP/0/RSP0/CPU0:Jun 16 11:28:26.639 : shelfmgr[352]:
%PLATFORM-SHELFMGR-6-NODE_STATE_CHANGE : 0/2/CPU0 A9K-8T-L
state:MBI-BOOTING
RP/0/RSP0/CPU0:Jun 16 11:29:26.295 : shelfmgr[352]:
%PLATFORM-SHELFMGR-6-NODE_STATE_CHANGE : 0/2/CPU0 A9K-8T-L
state:MBI-RUNNING
LC/0/2/CPU0:16: init[65540]: %OS-INIT-7-MBI_STARTED : total time 10.058 seconds
LC/0/2/CPU0:Jun 16 11:29:29.619 : insthelper[61]:
%INSTALL-INSTHELPER-7-PKG_DOWNLOAD : MBI running; starting software
download
LC/0/2/CPU0:Jun 16 11:29:47.569 : sysmgr[89]: %OS-SYSMGR-5-NOTICE :
Card is COLD started
LC/0/2/CPU0:Jun 16 11:29:47.833 : init[65540]:
%OS-INIT-7-INSTALL_READY : total time 32.328 seconds
LC/0/2/CPU0:Jun 16 11:29:49.240 : sysmgr[320]: %OS-SYSMGR-6-INFO :
Backup system manager is ready
LC/0/2/CPU0:Jun 16 11:29:50.345 : syslog_dev[87]: dumper_config[148]:
LC/0/2/CPU0:Jun 16 11:29:50.356 : syslog_dev[87]: dumper_config[148]:
The node id is 2081
And the normal reload of the LC and everything goes back to normal.
TAC case has been created but no awnser so far.
We have not found any relevant SMU or know bugs.
I found the following in one of the ASRs.
RP/0/RSP0/CPU0:core-foo-bar-1#sh asic-errors fia 0 all location 0/RSP0/CPU0
************************************************************
* Generic Errors *
************************************************************
Name : OC_INTERNAL_LOG_RF_UNEXP_SEG-GENERIC
Node Key : 0x1050015
Thresh/period(s): 10/2 Alarm state: OFF
Error count : 2
Last clearing : Sat Jun 11 08:04:44 2011
Last N errors : 2
--------------------------------------------------------------
First N errors.
@Time, Error-Data
------------------------------------------
Jun 11 08:04:44.498: RF unexp seg log
oc 0, addr 0x0, src 2
fa000000 fafafafa 0ffafafa 0f020f02 - 020e0f02 020e020e 0f020f0e 0f020f02
00020202
Jun 16 11:27:58.019: RF unexp seg log
oc 0, addr 0x0, src 2
e15b5b5b e1e1e1e1 0fe1e1e1 0f020f02 - 020e0f02 020e020e 0e020e0e 0e020e02
00020202
--------------------------------------------------------------
Name : OC_RF1_INT_LO_UNEXP_SEG-GENERIC
Node Key : 0x10501c7
Thresh/period(s): 10/2 Alarm state: OFF
Error count : 2
Last clearing : Sat Jun 11 08:04:44 2011
Last N errors : 2
--------------------------------------------------------------
First N errors.
@Time, Error-Data
------------------------------------------
Jun 11 08:04:44.498: OC_RF1_INT_MSK
Jun 16 11:27:58.019: OC_RF1_INT_MSK
--------------------------------------------------------------
************************************************************
* ASIC Reset Errors *
************************************************************
Any opinions or comments appreciated!
Best Regards
Mattias Gyllenvarg
Bredband2
Sweden
More information about the cisco-nsp
mailing list