[c-nsp] whoa - asr9k wierd message AND 13 me3600's all rebooted at once!!

Jason Lixfeld jason at lixfeld.ca
Fri Mar 15 13:34:26 EDT 2013


I have no VPLS in my network... Yet.

16T is a 16 port TenG blade, yes. 

--

Sent from my mobile device


On 2013-03-15, at 1:03 PM, "Aaron" <aaron1 at gvtc.com> wrote:

> Another commonality the tac pointed out to me amongst my me's that crashed
> is that they are all running the l2vpn vpls address family.
> 
> What's 16T?  ...16 Ten gig ?
> 
> Aaron
> 
> 
> -----Original Message-----
> From: Jason Lixfeld [mailto:jason at lixfeld.ca] 
> Sent: Friday, March 15, 2013 10:01 AM
> To: Aaron
> Cc: cisco-nsp at puck.nether.net
> Subject: Re: [c-nsp] whoa - asr9k wierd message AND 13 me3600's all rebooted
> at once!!
> 
> Interesting.  I just checked my archives and I have had two instances where
> LCs have rebooted due to that same error.  XR versions spanned 4.2.0 -
> 4.2.3.  You are running older code than I am.  Both instances of my LCs
> f**king off were on two separate ASR9Ks and actually the first time was a
> 2/20 (on 4.2.0) the second time was a 16T (on 4.2.3) on Jan. 1 (Happy New
> Year to me! :|)
> 
> SRs 622594207 and 624325505.  Cards were RMAd both times.
> 
> 15.3(1)S has been out since November and at the time of the LC crash on
> January 1, I only had 1 ME3600 deployed running 15.3(1)S.  It has been up
> for 100 days, so it lasted beyond the LC crash.
> 
> At this point, I'm more interested in the "theory" TAC has about the
> 15.3(1)S bug that they think might have triggered the reboots.  If you can
> pass me the SR or drop me a note when you find out one way or the other, I'd
> be grateful.  Also, if 15.3(1)S1 fixes that bug, that would be good
> information as well.
> 
> On 2013-03-15, at 10:06 AM, "Aaron" <aaron1 at gvtc.com> wrote:
> 
>> 2 tac cases opened...one with ios team for me3600's and one opened 
>> with ios xr team....
>> 
>> Ios Cisco tac is still investigating (they want more crashinfo's and 
>> running configs from me).... but thus far I have been told that my 
>> 2/20 linecard in my asr9010 reloaded due to a double bit error (double 
>> ecc (I believe is error correcting code)).  Syslogs and cli output below.
>> 
>> Ios xr cisco tac team says that he recommends replacing linecard 
>> if/when it happens a second time
>> 
>> Ios Tac eng said that when a bit changes in memory, it's correctable, 
>> but when two bits change then it's uncorrectable and a reload on that 
>> linecard occurs.  Ios Tac eng said that the lincecard in the asr9k 
>> seems to have crashed prior to the me3600's reloading.  This seems to 
>> be seen also in that the syslog messages regarding the bgp down 
>> messages with those me3600's started happening a few minutes after 
>> 14:22:38 (when the asr9k linecard crashed)....i think bgp keepalives 
>> default to 60 seconds and a bgp neighbor session doesn't time out 
>> until 180 seconds ( I think 3*keepalives)
>> 
>> Here is the cli output for that card ...        Last Reset :
>> pfm_dev_sm_perform_recovery_action, Card reset requested by: Process ID:
>> 155724 (prm_server)                   : Thu Mar 14 19:24:00 2013
>> 
>> Did you see that process id number ?  155724.....you will also see 
>> that pid in the syslog messages.
>> 
>> That's when the asr9k linecard reloaded and seems to have caused (13) 
>> of my me3600's to reboot!  These 13 me3600's are as follows....
>> 
>> All run 15.3(1)S.  they are scattered throughout my network...sparsely 
>> located here and there....no real physical commonality among them.
>> All of these 13 me3600's run Mp-iBGP with dual RouteReflectors....one 
>> of the RR's is on that asr9010.  This mpibgp is for mpls l3vpn's.  the 
>> pe-ce on the me3600's is directly connected routing...that's it.  The 
>> pe-ce in my core to connect to my legacy ip net is ospf from dual pe-ce
> feeds for redundancy.
>> The pe-ce dual links are between dual asr9k/7609-s pairs.....the 
>> asr9k's are in fact the dual rr's also.  One of them is that asr9010 
>> that had a lincecard crash.  Speculation I heard from ios tac 
>> yesterday reqarding the
>> me3600 crash was maybe related to a cef route change bug in 15.3(1)S.  
>> seems that perhaps when the asr9010 linecard crashed, the several 
>> hundred routes learned via that pe-ce connection to the legacy 7609 
>> propogated over the l3vpn and into the me3600's, thus causing them to 
>> do cef/fib convergence and possible converge over to the other 
>> asr9k/7609 location....BUT this is only speculation about that being the
> cause of the me3600 reloads for now....
>> more on that to come later hopefully from ios tac when I give them 
>> more crashinfo's and running configs...
>> 
>> Bare in mind, I have (4) more me3600's config'd same way as the 
>> crashed ones and the DID NOT reboot....those (4) run 15.2.2S or 
>> 15.2.4.S1
>> 
>> Syslog messages...
>> 
>> 2013-03-14 14:22:38    Local7.Emerg    9k    16328: LC/0/1/CPU0:Mar 14
>> 14:24:00.733 : pfm_node_lc[267]: %PLATFORM-NP-0-HW_DOUBLE_ECC_ERROR :
>> Set|prm_server[155724]|Network Processor Unit(0x1007001)|NP DOUBLE ECC
>> ERROR, NP=1, memId=18, subMemId=0x1
>> 2013-03-14 14:22:38    Local7.Emerg    9k    16329: LC/0/1/CPU0:Mar 14
>> 14:24:00.736 : pfm_node_lc[267]: %PLATFORM-PFM-0-CARD_RESET_REQ :
>> pfm_dev_sm_perform_recovery_action, Card reset requested by: Process ID:
>> 155724 (prm_server), Fault Sev: 0, Target node: 0/1/CPU0, CompId: 
>> 0x1f, Device Handle: 0x1007001, CondID: 1001, Fault Reason: NP DOUBLE 
>> ECC ERROR, NP=1, memId=18, subMemId=0x1
>> 2013-03-14 14:22:38    Local7.Critical    9k    16330: LC/0/1/CPU0:Mar 14
>> 14:24:00.737 : sysmgr[89]: %OS-SYSMGR-2-REBOOT : reboot required, 
>> process
>> (pfm_node_lc) reason (pfm_dev_sm_perform_recovery_action, Card reset 
>> requested by: Process ID: 155724 (prm_server), Fault Sev: 0, Target node:
>> 0/1/CPU0, CompId: 0x1f, Device Handle: 0x1007001, CondID: 1001, Fault
>> Reason: NP DOUBLE ECC ERROR, NP=1, memId=18, subMemId=0x1)
>> 2013-03-14 14:22:38    Local7.Error    9k    16331: LC/0/1/CPU0:Mar 14
>> 14:24:00.741 : sysmgr[89]: %OS-LIBSYSMGR-3-PARSE : parse_args: parse
> error:
>> unmatched "
>> 2013-03-14 14:22:38    Local7.Error    9k    16333: LC/0/1/CPU0:Mar 14
>> 14:24:00.742 : sysmgr[89]: %OS-SYSMGR-3-ERROR :
>> sysmgr_shutdown_cleanup_handler: shutdown script execution timed-out! 
>> Node will reset
>> 2013-03-14 14:22:38    Local7.Error    9k    16335: LC/0/1/CPU0:Mar 14
>> 14:24:00.743 : sysmgr[89]: %OS-SYSMGR-3-ERROR :
>> sysmgr_shutdown_cleanup_handler: shutdown triggered by (pfm_node_lc) 
>> did not complete in 45 seconds, shutting down
>> 
>> 
>> RP/0/RSP0/CPU0:9k#admin sh plat summ location 0/1/CPU0 Fri Mar 15 
>> 08:17:12.824 CDT
>> ----------------------------------------------------------------------
>> ------
>> ---
>>    Platform Node : 0/1/CPU0 (slot 1)
>>              PID : A9K-2T20GE-L
>>        Card Type : 2-Port 10GE, 20-Port GE Low Queue LC, Req. XFPs 
>> and SFPs
>>           VID/SN : V03 / FOC15078GST
>>       Oper State : IOS XR RUN
>>       Last Reset : pfm_dev_sm_perform_recovery_action, Card reset 
>> requested by: Process ID: 155724 (prm_server)
>>                  : Thu Mar 14 19:24:00 2013
>>    Configuration : Power is enabled
>>                    Bootup enabled.
>>                    Monitoring enabled
>>       Rommon Ver : Version 1.03(20100212:011148)
>>       IOS SW Ver : 4.1.2
>>       Main Power : Power state Enabled. Estimate power 350 Watts of 
>> power required.
>>           Faults : N/A
>> ----------------------------------------------------------------------
>> ------
>> ---
>> 
>> RP/0/RSP0/CPU0:9k#sh instal summ
>> Fri Mar 15 08:17:44.055 CDT
>> Active Packages:
>>   disk0:asr9k-mini-p-4.1.2
>>   disk0:asr9k-doc-p-4.1.2
>>   disk0:asr9k-k9sec-p-4.1.2
>>   disk0:asr9k-mpls-p-4.1.2
>>   disk0:asr9k-mgbl-p-4.1.2
>>   disk0:asr9k-mcast-p-4.1.2
>> 
>> 
>> 
>> aaron
>> 
>> 
>> 
>> 
>> -----Original Message-----
>> From: cisco-nsp-bounces at puck.nether.net 
>> [mailto:cisco-nsp-bounces at puck.nether.net] On Behalf Of Jason Lixfeld
>> Sent: Thursday, March 14, 2013 5:09 PM
>> To: cisco-nsp at puck.nether.net NSP
>> Subject: Re: [c-nsp] whoa - asr9k wierd message AND 13 me3600's all 
>> rebooted at once!!
>> 
>> What XR version are you running?
>> Trident or Typhoon cards?
>> ME3600s all rebooted at the exact moment the LC crashed?
>> ME3600 crashes with errors/crashinfo?
>> OSPF is your IGP or IGP is something else and OSPF was inside a VRF 
>> facing the CE?
>> Is BFD for IGP and/or BFD for BGP enabled?
>> BGP is straight BGP or MP-BPG to the ME3600s?
>> LDP between ASR and ME3600s?
>> 
>> I don't have an answer for you, but there are some common elements on 
>> my network based on the description you have provided here about your 
>> network, so I'm asking probing questions to determine any other
> similarities.
>> 
>> --
>> 
>> Sent from my mobile device
>> 
>> 
>> On 2013-03-14, at 5:35 PM, "Aaron" <aaron1 at gvtc.com> wrote:
>> 
>>> Y'all know anything about this?
>>> 
>>> 
>>> 
>>> Something bad just happened in my network
>>> 
>>> 
>>> 
>>> I have an asr9010 that just showed a 2/20 module fail and come back 
>>> up. the pe-ce link on that card also showed ospf neighbor state 
>>> bounce at that moment.AND that asr9010 is a route reflector for 
>>> several of my pe's throughout my network.. Of those pe's (13) 
>>> ME3600's running 15.3(1)S ALL REBOOTED!!!
>>> 
>>> 
>>> 
>>> ..i have another me3600 running 15.3(1)S that is not running bgp that 
>>> did not reboot
>>> 
>>> 
>>> 
>>> ..i have several other me3600's running pre 15.3 (so 15.2.something) 
>>> that are running similar config as the rebooted me's, which did NOT 
>>> reboot
>>> 
>>> 
>>> 
>>> Aaron
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> cisco-nsp mailing list  cisco-nsp at puck.nether.net 
>>> https://puck.nether.net/mailman/listinfo/cisco-nsp
>>> archive at http://puck.nether.net/pipermail/cisco-nsp/
>> 
>> _______________________________________________
>> cisco-nsp mailing list  cisco-nsp at puck.nether.net 
>> https://puck.nether.net/mailman/listinfo/cisco-nsp
>> archive at http://puck.nether.net/pipermail/cisco-nsp/
> 



More information about the cisco-nsp mailing list