[c-nsp] whoa - asr9k wierd message AND 13 me3600's all rebooted at once!!
Aaron
aaron1 at gvtc.com
Fri Mar 15 10:06:10 EDT 2013
2 tac cases opened...one with ios team for me3600's and one opened with ios
xr team....
Ios Cisco tac is still investigating (they want more crashinfo's and running
configs from me).... but thus far I have been told that my 2/20 linecard in
my asr9010 reloaded due to a double bit error (double ecc (I believe is
error correcting code)). Syslogs and cli output below.
Ios xr cisco tac team says that he recommends replacing linecard if/when it
happens a second time
Ios Tac eng said that when a bit changes in memory, it's correctable, but
when two bits change then it's uncorrectable and a reload on that linecard
occurs. Ios Tac eng said that the lincecard in the asr9k seems to have
crashed prior to the me3600's reloading. This seems to be seen also in that
the syslog messages regarding the bgp down messages with those me3600's
started happening a few minutes after 14:22:38 (when the asr9k linecard
crashed)....i think bgp keepalives default to 60 seconds and a bgp neighbor
session doesn't time out until 180 seconds ( I think 3*keepalives)
Here is the cli output for that card ... Last Reset :
pfm_dev_sm_perform_recovery_action, Card reset requested by: Process ID:
155724 (prm_server) : Thu Mar 14 19:24:00 2013
Did you see that process id number ? 155724.....you will also see that pid
in the syslog messages.
That's when the asr9k linecard reloaded and seems to have caused (13) of my
me3600's to reboot! These 13 me3600's are as follows....
All run 15.3(1)S. they are scattered throughout my network...sparsely
located here and there....no real physical commonality among them.
All of these 13 me3600's run Mp-iBGP with dual RouteReflectors....one of the
RR's is on that asr9010. This mpibgp is for mpls l3vpn's. the pe-ce on the
me3600's is directly connected routing...that's it. The pe-ce in my core to
connect to my legacy ip net is ospf from dual pe-ce feeds for redundancy.
The pe-ce dual links are between dual asr9k/7609-s pairs.....the asr9k's are
in fact the dual rr's also. One of them is that asr9010 that had a
lincecard crash. Speculation I heard from ios tac yesterday reqarding the
me3600 crash was maybe related to a cef route change bug in 15.3(1)S. seems
that perhaps when the asr9010 linecard crashed, the several hundred routes
learned via that pe-ce connection to the legacy 7609 propogated over the
l3vpn and into the me3600's, thus causing them to do cef/fib convergence and
possible converge over to the other asr9k/7609 location....BUT this is only
speculation about that being the cause of the me3600 reloads for now....
more on that to come later hopefully from ios tac when I give them more
crashinfo's and running configs...
Bare in mind, I have (4) more me3600's config'd same way as the crashed ones
and the DID NOT reboot....those (4) run 15.2.2S or 15.2.4.S1
Syslog messages...
2013-03-14 14:22:38 Local7.Emerg 9k 16328: LC/0/1/CPU0:Mar 14
14:24:00.733 : pfm_node_lc[267]: %PLATFORM-NP-0-HW_DOUBLE_ECC_ERROR :
Set|prm_server[155724]|Network Processor Unit(0x1007001)|NP DOUBLE ECC
ERROR, NP=1, memId=18, subMemId=0x1
2013-03-14 14:22:38 Local7.Emerg 9k 16329: LC/0/1/CPU0:Mar 14
14:24:00.736 : pfm_node_lc[267]: %PLATFORM-PFM-0-CARD_RESET_REQ :
pfm_dev_sm_perform_recovery_action, Card reset requested by: Process ID:
155724 (prm_server), Fault Sev: 0, Target node: 0/1/CPU0, CompId: 0x1f,
Device Handle: 0x1007001, CondID: 1001, Fault Reason: NP DOUBLE ECC ERROR,
NP=1, memId=18, subMemId=0x1
2013-03-14 14:22:38 Local7.Critical 9k 16330: LC/0/1/CPU0:Mar 14
14:24:00.737 : sysmgr[89]: %OS-SYSMGR-2-REBOOT : reboot required, process
(pfm_node_lc) reason (pfm_dev_sm_perform_recovery_action, Card reset
requested by: Process ID: 155724 (prm_server), Fault Sev: 0, Target node:
0/1/CPU0, CompId: 0x1f, Device Handle: 0x1007001, CondID: 1001, Fault
Reason: NP DOUBLE ECC ERROR, NP=1, memId=18, subMemId=0x1)
2013-03-14 14:22:38 Local7.Error 9k 16331: LC/0/1/CPU0:Mar 14
14:24:00.741 : sysmgr[89]: %OS-LIBSYSMGR-3-PARSE : parse_args: parse error:
unmatched "
2013-03-14 14:22:38 Local7.Error 9k 16333: LC/0/1/CPU0:Mar 14
14:24:00.742 : sysmgr[89]: %OS-SYSMGR-3-ERROR :
sysmgr_shutdown_cleanup_handler: shutdown script execution timed-out! Node
will reset
2013-03-14 14:22:38 Local7.Error 9k 16335: LC/0/1/CPU0:Mar 14
14:24:00.743 : sysmgr[89]: %OS-SYSMGR-3-ERROR :
sysmgr_shutdown_cleanup_handler: shutdown triggered by (pfm_node_lc) did not
complete in 45 seconds, shutting down
RP/0/RSP0/CPU0:9k#admin sh plat summ location 0/1/CPU0
Fri Mar 15 08:17:12.824 CDT
----------------------------------------------------------------------------
---
Platform Node : 0/1/CPU0 (slot 1)
PID : A9K-2T20GE-L
Card Type : 2-Port 10GE, 20-Port GE Low Queue LC, Req. XFPs and
SFPs
VID/SN : V03 / FOC15078GST
Oper State : IOS XR RUN
Last Reset : pfm_dev_sm_perform_recovery_action, Card reset
requested by: Process ID: 155724 (prm_server)
: Thu Mar 14 19:24:00 2013
Configuration : Power is enabled
Bootup enabled.
Monitoring enabled
Rommon Ver : Version 1.03(20100212:011148)
IOS SW Ver : 4.1.2
Main Power : Power state Enabled. Estimate power 350 Watts of power
required.
Faults : N/A
----------------------------------------------------------------------------
---
RP/0/RSP0/CPU0:9k#sh instal summ
Fri Mar 15 08:17:44.055 CDT
Active Packages:
disk0:asr9k-mini-p-4.1.2
disk0:asr9k-doc-p-4.1.2
disk0:asr9k-k9sec-p-4.1.2
disk0:asr9k-mpls-p-4.1.2
disk0:asr9k-mgbl-p-4.1.2
disk0:asr9k-mcast-p-4.1.2
aaron
-----Original Message-----
From: cisco-nsp-bounces at puck.nether.net
[mailto:cisco-nsp-bounces at puck.nether.net] On Behalf Of Jason Lixfeld
Sent: Thursday, March 14, 2013 5:09 PM
To: cisco-nsp at puck.nether.net NSP
Subject: Re: [c-nsp] whoa - asr9k wierd message AND 13 me3600's all rebooted
at once!!
What XR version are you running?
Trident or Typhoon cards?
ME3600s all rebooted at the exact moment the LC crashed?
ME3600 crashes with errors/crashinfo?
OSPF is your IGP or IGP is something else and OSPF was inside a VRF facing
the CE?
Is BFD for IGP and/or BFD for BGP enabled?
BGP is straight BGP or MP-BPG to the ME3600s?
LDP between ASR and ME3600s?
I don't have an answer for you, but there are some common elements on my
network based on the description you have provided here about your network,
so I'm asking probing questions to determine any other similarities.
--
Sent from my mobile device
On 2013-03-14, at 5:35 PM, "Aaron" <aaron1 at gvtc.com> wrote:
> Y'all know anything about this?
>
>
>
> Something bad just happened in my network
>
>
>
> I have an asr9010 that just showed a 2/20 module fail and come back
> up. the pe-ce link on that card also showed ospf neighbor state bounce
> at that moment.AND that asr9010 is a route reflector for several of my
> pe's throughout my network.. Of those pe's (13) ME3600's running
> 15.3(1)S ALL REBOOTED!!!
>
>
>
> ..i have another me3600 running 15.3(1)S that is not running bgp that
> did not reboot
>
>
>
> ..i have several other me3600's running pre 15.3 (so 15.2.something)
> that are running similar config as the rebooted me's, which did NOT
> reboot
>
>
>
> Aaron
>
>
>
> _______________________________________________
> cisco-nsp mailing list cisco-nsp at puck.nether.net
> https://puck.nether.net/mailman/listinfo/cisco-nsp
> archive at http://puck.nether.net/pipermail/cisco-nsp/
_______________________________________________
cisco-nsp mailing list cisco-nsp at puck.nether.net
https://puck.nether.net/mailman/listinfo/cisco-nsp
archive at http://puck.nether.net/pipermail/cisco-nsp/
More information about the cisco-nsp
mailing list