[c-nsp] whoa - asr9k wierd message AND 13 me3600's all rebooted at once!!

Jason Lixfeld jason at lixfeld.ca
Fri Mar 15 11:00:50 EDT 2013


Interesting.  I just checked my archives and I have had two instances where LCs have rebooted due to that same error.  XR versions spanned 4.2.0 - 4.2.3.  You are running older code than I am.  Both instances of my LCs f**king off were on two separate ASR9Ks and actually the first time was a 2/20 (on 4.2.0) the second time was a 16T (on 4.2.3) on Jan. 1 (Happy New Year to me! :|)

SRs 622594207 and 624325505.  Cards were RMAd both times.

15.3(1)S has been out since November and at the time of the LC crash on January 1, I only had 1 ME3600 deployed running 15.3(1)S.  It has been up for 100 days, so it lasted beyond the LC crash.

At this point, I'm more interested in the "theory" TAC has about the 15.3(1)S bug that they think might have triggered the reboots.  If you can pass me the SR or drop me a note when you find out one way or the other, I'd be grateful.  Also, if 15.3(1)S1 fixes that bug, that would be good information as well.

On 2013-03-15, at 10:06 AM, "Aaron" <aaron1 at gvtc.com> wrote:

> 2 tac cases opened...one with ios team for me3600's and one opened with ios
> xr team....
> 
> Ios Cisco tac is still investigating (they want more crashinfo's and running
> configs from me).... but thus far I have been told that my 2/20 linecard in
> my asr9010 reloaded due to a double bit error (double ecc (I believe is
> error correcting code)).  Syslogs and cli output below.
> 
> Ios xr cisco tac team says that he recommends replacing linecard if/when it
> happens a second time
> 
> Ios Tac eng said that when a bit changes in memory, it's correctable, but
> when two bits change then it's uncorrectable and a reload on that linecard
> occurs.  Ios Tac eng said that the lincecard in the asr9k seems to have
> crashed prior to the me3600's reloading.  This seems to be seen also in that
> the syslog messages regarding the bgp down messages with those me3600's
> started happening a few minutes after 14:22:38 (when the asr9k linecard
> crashed)....i think bgp keepalives default to 60 seconds and a bgp neighbor
> session doesn't time out until 180 seconds ( I think 3*keepalives)
> 
> Here is the cli output for that card ...        Last Reset :
> pfm_dev_sm_perform_recovery_action, Card reset requested by: Process ID:
> 155724 (prm_server)                   : Thu Mar 14 19:24:00 2013
> 
> Did you see that process id number ?  155724.....you will also see that pid
> in the syslog messages.
> 
> That's when the asr9k linecard reloaded and seems to have caused (13) of my
> me3600's to reboot!  These 13 me3600's are as follows....
> 
> All run 15.3(1)S.  they are scattered throughout my network...sparsely
> located here and there....no real physical commonality among them.
> All of these 13 me3600's run Mp-iBGP with dual RouteReflectors....one of the
> RR's is on that asr9010.  This mpibgp is for mpls l3vpn's.  the pe-ce on the
> me3600's is directly connected routing...that's it.  The pe-ce in my core to
> connect to my legacy ip net is ospf from dual pe-ce feeds for redundancy.
> The pe-ce dual links are between dual asr9k/7609-s pairs.....the asr9k's are
> in fact the dual rr's also.  One of them is that asr9010 that had a
> lincecard crash.  Speculation I heard from ios tac yesterday reqarding the
> me3600 crash was maybe related to a cef route change bug in 15.3(1)S.  seems
> that perhaps when the asr9010 linecard crashed, the several hundred routes
> learned via that pe-ce connection to the legacy 7609 propogated over the
> l3vpn and into the me3600's, thus causing them to do cef/fib convergence and
> possible converge over to the other asr9k/7609 location....BUT this is only
> speculation about that being the cause of the me3600 reloads for now....
> more on that to come later hopefully from ios tac when I give them more
> crashinfo's and running configs...
> 
> Bare in mind, I have (4) more me3600's config'd same way as the crashed ones
> and the DID NOT reboot....those (4) run 15.2.2S or 15.2.4.S1
> 
> Syslog messages...
> 
> 2013-03-14 14:22:38	Local7.Emerg	9k	16328: LC/0/1/CPU0:Mar 14
> 14:24:00.733 : pfm_node_lc[267]: %PLATFORM-NP-0-HW_DOUBLE_ECC_ERROR :
> Set|prm_server[155724]|Network Processor Unit(0x1007001)|NP DOUBLE ECC
> ERROR, NP=1, memId=18, subMemId=0x1
> 2013-03-14 14:22:38	Local7.Emerg	9k	16329: LC/0/1/CPU0:Mar 14
> 14:24:00.736 : pfm_node_lc[267]: %PLATFORM-PFM-0-CARD_RESET_REQ :
> pfm_dev_sm_perform_recovery_action, Card reset requested by: Process ID:
> 155724 (prm_server), Fault Sev: 0, Target node: 0/1/CPU0, CompId: 0x1f,
> Device Handle: 0x1007001, CondID: 1001, Fault Reason: NP DOUBLE ECC ERROR,
> NP=1, memId=18, subMemId=0x1
> 2013-03-14 14:22:38	Local7.Critical	9k	16330: LC/0/1/CPU0:Mar 14
> 14:24:00.737 : sysmgr[89]: %OS-SYSMGR-2-REBOOT : reboot required, process
> (pfm_node_lc) reason (pfm_dev_sm_perform_recovery_action, Card reset
> requested by: Process ID: 155724 (prm_server), Fault Sev: 0, Target node:
> 0/1/CPU0, CompId: 0x1f, Device Handle: 0x1007001, CondID: 1001, Fault
> Reason: NP DOUBLE ECC ERROR, NP=1, memId=18, subMemId=0x1)
> 2013-03-14 14:22:38	Local7.Error	9k	16331: LC/0/1/CPU0:Mar 14
> 14:24:00.741 : sysmgr[89]: %OS-LIBSYSMGR-3-PARSE : parse_args: parse error:
> unmatched "
> 2013-03-14 14:22:38	Local7.Error	9k	16333: LC/0/1/CPU0:Mar 14
> 14:24:00.742 : sysmgr[89]: %OS-SYSMGR-3-ERROR :
> sysmgr_shutdown_cleanup_handler: shutdown script execution timed-out! Node
> will reset
> 2013-03-14 14:22:38	Local7.Error	9k	16335: LC/0/1/CPU0:Mar 14
> 14:24:00.743 : sysmgr[89]: %OS-SYSMGR-3-ERROR :
> sysmgr_shutdown_cleanup_handler: shutdown triggered by (pfm_node_lc) did not
> complete in 45 seconds, shutting down
> 
> 
> RP/0/RSP0/CPU0:9k#admin sh plat summ location 0/1/CPU0
> Fri Mar 15 08:17:12.824 CDT
> ----------------------------------------------------------------------------
> ---
>     Platform Node : 0/1/CPU0 (slot 1)
>               PID : A9K-2T20GE-L
>         Card Type : 2-Port 10GE, 20-Port GE Low Queue LC, Req. XFPs and
> SFPs
>            VID/SN : V03 / FOC15078GST
>        Oper State : IOS XR RUN
>        Last Reset : pfm_dev_sm_perform_recovery_action, Card reset
> requested by: Process ID: 155724 (prm_server)
>                   : Thu Mar 14 19:24:00 2013
>     Configuration : Power is enabled
>                     Bootup enabled.
>                     Monitoring enabled
>        Rommon Ver : Version 1.03(20100212:011148)
>        IOS SW Ver : 4.1.2
>        Main Power : Power state Enabled. Estimate power 350 Watts of power
> required.
>            Faults : N/A
> ----------------------------------------------------------------------------
> ---
> 
> RP/0/RSP0/CPU0:9k#sh instal summ
> Fri Mar 15 08:17:44.055 CDT
>  Active Packages:
>    disk0:asr9k-mini-p-4.1.2
>    disk0:asr9k-doc-p-4.1.2
>    disk0:asr9k-k9sec-p-4.1.2
>    disk0:asr9k-mpls-p-4.1.2
>    disk0:asr9k-mgbl-p-4.1.2
>    disk0:asr9k-mcast-p-4.1.2
> 
> 
> 
> aaron
> 
> 
> 
> 
> -----Original Message-----
> From: cisco-nsp-bounces at puck.nether.net
> [mailto:cisco-nsp-bounces at puck.nether.net] On Behalf Of Jason Lixfeld
> Sent: Thursday, March 14, 2013 5:09 PM
> To: cisco-nsp at puck.nether.net NSP
> Subject: Re: [c-nsp] whoa - asr9k wierd message AND 13 me3600's all rebooted
> at once!!
> 
> What XR version are you running?
> Trident or Typhoon cards?
> ME3600s all rebooted at the exact moment the LC crashed?
> ME3600 crashes with errors/crashinfo?
> OSPF is your IGP or IGP is something else and OSPF was inside a VRF facing
> the CE?
> Is BFD for IGP and/or BFD for BGP enabled?
> BGP is straight BGP or MP-BPG to the ME3600s?
> LDP between ASR and ME3600s?
> 
> I don't have an answer for you, but there are some common elements on my
> network based on the description you have provided here about your network,
> so I'm asking probing questions to determine any other similarities.
> 
> --
> 
> Sent from my mobile device
> 
> 
> On 2013-03-14, at 5:35 PM, "Aaron" <aaron1 at gvtc.com> wrote:
> 
>> Y'all know anything about this?
>> 
>> 
>> 
>> Something bad just happened in my network
>> 
>> 
>> 
>> I have an asr9010 that just showed a 2/20 module fail and come back 
>> up. the pe-ce link on that card also showed ospf neighbor state bounce 
>> at that moment.AND that asr9010 is a route reflector for several of my 
>> pe's throughout my network.. Of those pe's (13) ME3600's running 
>> 15.3(1)S ALL REBOOTED!!!
>> 
>> 
>> 
>> ..i have another me3600 running 15.3(1)S that is not running bgp that 
>> did not reboot
>> 
>> 
>> 
>> ..i have several other me3600's running pre 15.3 (so 15.2.something) 
>> that are running similar config as the rebooted me's, which did NOT 
>> reboot
>> 
>> 
>> 
>> Aaron
>> 
>> 
>> 
>> _______________________________________________
>> cisco-nsp mailing list  cisco-nsp at puck.nether.net 
>> https://puck.nether.net/mailman/listinfo/cisco-nsp
>> archive at http://puck.nether.net/pipermail/cisco-nsp/
> 
> _______________________________________________
> cisco-nsp mailing list  cisco-nsp at puck.nether.net
> https://puck.nether.net/mailman/listinfo/cisco-nsp
> archive at http://puck.nether.net/pipermail/cisco-nsp/
> 




More information about the cisco-nsp mailing list