[c-nsp] whoa - asr9k wierd message AND 13 me3600's all rebooted at once!!

Aaron aaron1 at gvtc.com
Fri Mar 15 13:03:29 EDT 2013


Another commonality the tac pointed out to me amongst my me's that crashed
is that they are all running the l2vpn vpls address family.

What's 16T?  ...16 Ten gig ?

Aaron


-----Original Message-----
From: Jason Lixfeld [mailto:jason at lixfeld.ca] 
Sent: Friday, March 15, 2013 10:01 AM
To: Aaron
Cc: cisco-nsp at puck.nether.net
Subject: Re: [c-nsp] whoa - asr9k wierd message AND 13 me3600's all rebooted
at once!!

Interesting.  I just checked my archives and I have had two instances where
LCs have rebooted due to that same error.  XR versions spanned 4.2.0 -
4.2.3.  You are running older code than I am.  Both instances of my LCs
f**king off were on two separate ASR9Ks and actually the first time was a
2/20 (on 4.2.0) the second time was a 16T (on 4.2.3) on Jan. 1 (Happy New
Year to me! :|)

SRs 622594207 and 624325505.  Cards were RMAd both times.

15.3(1)S has been out since November and at the time of the LC crash on
January 1, I only had 1 ME3600 deployed running 15.3(1)S.  It has been up
for 100 days, so it lasted beyond the LC crash.

At this point, I'm more interested in the "theory" TAC has about the
15.3(1)S bug that they think might have triggered the reboots.  If you can
pass me the SR or drop me a note when you find out one way or the other, I'd
be grateful.  Also, if 15.3(1)S1 fixes that bug, that would be good
information as well.

On 2013-03-15, at 10:06 AM, "Aaron" <aaron1 at gvtc.com> wrote:

> 2 tac cases opened...one with ios team for me3600's and one opened 
> with ios xr team....
> 
> Ios Cisco tac is still investigating (they want more crashinfo's and 
> running configs from me).... but thus far I have been told that my 
> 2/20 linecard in my asr9010 reloaded due to a double bit error (double 
> ecc (I believe is error correcting code)).  Syslogs and cli output below.
> 
> Ios xr cisco tac team says that he recommends replacing linecard 
> if/when it happens a second time
> 
> Ios Tac eng said that when a bit changes in memory, it's correctable, 
> but when two bits change then it's uncorrectable and a reload on that 
> linecard occurs.  Ios Tac eng said that the lincecard in the asr9k 
> seems to have crashed prior to the me3600's reloading.  This seems to 
> be seen also in that the syslog messages regarding the bgp down 
> messages with those me3600's started happening a few minutes after 
> 14:22:38 (when the asr9k linecard crashed)....i think bgp keepalives 
> default to 60 seconds and a bgp neighbor session doesn't time out 
> until 180 seconds ( I think 3*keepalives)
> 
> Here is the cli output for that card ...        Last Reset :
> pfm_dev_sm_perform_recovery_action, Card reset requested by: Process ID:
> 155724 (prm_server)                   : Thu Mar 14 19:24:00 2013
> 
> Did you see that process id number ?  155724.....you will also see 
> that pid in the syslog messages.
> 
> That's when the asr9k linecard reloaded and seems to have caused (13) 
> of my me3600's to reboot!  These 13 me3600's are as follows....
> 
> All run 15.3(1)S.  they are scattered throughout my network...sparsely 
> located here and there....no real physical commonality among them.
> All of these 13 me3600's run Mp-iBGP with dual RouteReflectors....one 
> of the RR's is on that asr9010.  This mpibgp is for mpls l3vpn's.  the 
> pe-ce on the me3600's is directly connected routing...that's it.  The 
> pe-ce in my core to connect to my legacy ip net is ospf from dual pe-ce
feeds for redundancy.
> The pe-ce dual links are between dual asr9k/7609-s pairs.....the 
> asr9k's are in fact the dual rr's also.  One of them is that asr9010 
> that had a lincecard crash.  Speculation I heard from ios tac 
> yesterday reqarding the
> me3600 crash was maybe related to a cef route change bug in 15.3(1)S.  
> seems that perhaps when the asr9010 linecard crashed, the several 
> hundred routes learned via that pe-ce connection to the legacy 7609 
> propogated over the l3vpn and into the me3600's, thus causing them to 
> do cef/fib convergence and possible converge over to the other 
> asr9k/7609 location....BUT this is only speculation about that being the
cause of the me3600 reloads for now....
> more on that to come later hopefully from ios tac when I give them 
> more crashinfo's and running configs...
> 
> Bare in mind, I have (4) more me3600's config'd same way as the 
> crashed ones and the DID NOT reboot....those (4) run 15.2.2S or 
> 15.2.4.S1
> 
> Syslog messages...
> 
> 2013-03-14 14:22:38	Local7.Emerg	9k	16328: LC/0/1/CPU0:Mar 14
> 14:24:00.733 : pfm_node_lc[267]: %PLATFORM-NP-0-HW_DOUBLE_ECC_ERROR :
> Set|prm_server[155724]|Network Processor Unit(0x1007001)|NP DOUBLE ECC
> ERROR, NP=1, memId=18, subMemId=0x1
> 2013-03-14 14:22:38	Local7.Emerg	9k	16329: LC/0/1/CPU0:Mar 14
> 14:24:00.736 : pfm_node_lc[267]: %PLATFORM-PFM-0-CARD_RESET_REQ :
> pfm_dev_sm_perform_recovery_action, Card reset requested by: Process ID:
> 155724 (prm_server), Fault Sev: 0, Target node: 0/1/CPU0, CompId: 
> 0x1f, Device Handle: 0x1007001, CondID: 1001, Fault Reason: NP DOUBLE 
> ECC ERROR, NP=1, memId=18, subMemId=0x1
> 2013-03-14 14:22:38	Local7.Critical	9k	16330: LC/0/1/CPU0:Mar 14
> 14:24:00.737 : sysmgr[89]: %OS-SYSMGR-2-REBOOT : reboot required, 
> process
> (pfm_node_lc) reason (pfm_dev_sm_perform_recovery_action, Card reset 
> requested by: Process ID: 155724 (prm_server), Fault Sev: 0, Target node:
> 0/1/CPU0, CompId: 0x1f, Device Handle: 0x1007001, CondID: 1001, Fault
> Reason: NP DOUBLE ECC ERROR, NP=1, memId=18, subMemId=0x1)
> 2013-03-14 14:22:38	Local7.Error	9k	16331: LC/0/1/CPU0:Mar 14
> 14:24:00.741 : sysmgr[89]: %OS-LIBSYSMGR-3-PARSE : parse_args: parse
error:
> unmatched "
> 2013-03-14 14:22:38	Local7.Error	9k	16333: LC/0/1/CPU0:Mar 14
> 14:24:00.742 : sysmgr[89]: %OS-SYSMGR-3-ERROR :
> sysmgr_shutdown_cleanup_handler: shutdown script execution timed-out! 
> Node will reset
> 2013-03-14 14:22:38	Local7.Error	9k	16335: LC/0/1/CPU0:Mar 14
> 14:24:00.743 : sysmgr[89]: %OS-SYSMGR-3-ERROR :
> sysmgr_shutdown_cleanup_handler: shutdown triggered by (pfm_node_lc) 
> did not complete in 45 seconds, shutting down
> 
> 
> RP/0/RSP0/CPU0:9k#admin sh plat summ location 0/1/CPU0 Fri Mar 15 
> 08:17:12.824 CDT
> ----------------------------------------------------------------------
> ------
> ---
>     Platform Node : 0/1/CPU0 (slot 1)
>               PID : A9K-2T20GE-L
>         Card Type : 2-Port 10GE, 20-Port GE Low Queue LC, Req. XFPs 
> and SFPs
>            VID/SN : V03 / FOC15078GST
>        Oper State : IOS XR RUN
>        Last Reset : pfm_dev_sm_perform_recovery_action, Card reset 
> requested by: Process ID: 155724 (prm_server)
>                   : Thu Mar 14 19:24:00 2013
>     Configuration : Power is enabled
>                     Bootup enabled.
>                     Monitoring enabled
>        Rommon Ver : Version 1.03(20100212:011148)
>        IOS SW Ver : 4.1.2
>        Main Power : Power state Enabled. Estimate power 350 Watts of 
> power required.
>            Faults : N/A
> ----------------------------------------------------------------------
> ------
> ---
> 
> RP/0/RSP0/CPU0:9k#sh instal summ
> Fri Mar 15 08:17:44.055 CDT
>  Active Packages:
>    disk0:asr9k-mini-p-4.1.2
>    disk0:asr9k-doc-p-4.1.2
>    disk0:asr9k-k9sec-p-4.1.2
>    disk0:asr9k-mpls-p-4.1.2
>    disk0:asr9k-mgbl-p-4.1.2
>    disk0:asr9k-mcast-p-4.1.2
> 
> 
> 
> aaron
> 
> 
> 
> 
> -----Original Message-----
> From: cisco-nsp-bounces at puck.nether.net 
> [mailto:cisco-nsp-bounces at puck.nether.net] On Behalf Of Jason Lixfeld
> Sent: Thursday, March 14, 2013 5:09 PM
> To: cisco-nsp at puck.nether.net NSP
> Subject: Re: [c-nsp] whoa - asr9k wierd message AND 13 me3600's all 
> rebooted at once!!
> 
> What XR version are you running?
> Trident or Typhoon cards?
> ME3600s all rebooted at the exact moment the LC crashed?
> ME3600 crashes with errors/crashinfo?
> OSPF is your IGP or IGP is something else and OSPF was inside a VRF 
> facing the CE?
> Is BFD for IGP and/or BFD for BGP enabled?
> BGP is straight BGP or MP-BPG to the ME3600s?
> LDP between ASR and ME3600s?
> 
> I don't have an answer for you, but there are some common elements on 
> my network based on the description you have provided here about your 
> network, so I'm asking probing questions to determine any other
similarities.
> 
> --
> 
> Sent from my mobile device
> 
> 
> On 2013-03-14, at 5:35 PM, "Aaron" <aaron1 at gvtc.com> wrote:
> 
>> Y'all know anything about this?
>> 
>> 
>> 
>> Something bad just happened in my network
>> 
>> 
>> 
>> I have an asr9010 that just showed a 2/20 module fail and come back 
>> up. the pe-ce link on that card also showed ospf neighbor state 
>> bounce at that moment.AND that asr9010 is a route reflector for 
>> several of my pe's throughout my network.. Of those pe's (13) 
>> ME3600's running 15.3(1)S ALL REBOOTED!!!
>> 
>> 
>> 
>> ..i have another me3600 running 15.3(1)S that is not running bgp that 
>> did not reboot
>> 
>> 
>> 
>> ..i have several other me3600's running pre 15.3 (so 15.2.something) 
>> that are running similar config as the rebooted me's, which did NOT 
>> reboot
>> 
>> 
>> 
>> Aaron
>> 
>> 
>> 
>> _______________________________________________
>> cisco-nsp mailing list  cisco-nsp at puck.nether.net 
>> https://puck.nether.net/mailman/listinfo/cisco-nsp
>> archive at http://puck.nether.net/pipermail/cisco-nsp/
> 
> _______________________________________________
> cisco-nsp mailing list  cisco-nsp at puck.nether.net 
> https://puck.nether.net/mailman/listinfo/cisco-nsp
> archive at http://puck.nether.net/pipermail/cisco-nsp/
> 



More information about the cisco-nsp mailing list