[c-nsp] whoa - asr9k wierd message AND 13 me3600's all rebooted at once!!
Jason Lixfeld
jason at lixfeld.ca
Fri Mar 15 13:34:26 EDT 2013
I have no VPLS in my network... Yet.
16T is a 16 port TenG blade, yes.
--
Sent from my mobile device
On 2013-03-15, at 1:03 PM, "Aaron" <aaron1 at gvtc.com> wrote:
> Another commonality the tac pointed out to me amongst my me's that crashed
> is that they are all running the l2vpn vpls address family.
>
> What's 16T? ...16 Ten gig ?
>
> Aaron
>
>
> -----Original Message-----
> From: Jason Lixfeld [mailto:jason at lixfeld.ca]
> Sent: Friday, March 15, 2013 10:01 AM
> To: Aaron
> Cc: cisco-nsp at puck.nether.net
> Subject: Re: [c-nsp] whoa - asr9k wierd message AND 13 me3600's all rebooted
> at once!!
>
> Interesting. I just checked my archives and I have had two instances where
> LCs have rebooted due to that same error. XR versions spanned 4.2.0 -
> 4.2.3. You are running older code than I am. Both instances of my LCs
> f**king off were on two separate ASR9Ks and actually the first time was a
> 2/20 (on 4.2.0) the second time was a 16T (on 4.2.3) on Jan. 1 (Happy New
> Year to me! :|)
>
> SRs 622594207 and 624325505. Cards were RMAd both times.
>
> 15.3(1)S has been out since November and at the time of the LC crash on
> January 1, I only had 1 ME3600 deployed running 15.3(1)S. It has been up
> for 100 days, so it lasted beyond the LC crash.
>
> At this point, I'm more interested in the "theory" TAC has about the
> 15.3(1)S bug that they think might have triggered the reboots. If you can
> pass me the SR or drop me a note when you find out one way or the other, I'd
> be grateful. Also, if 15.3(1)S1 fixes that bug, that would be good
> information as well.
>
> On 2013-03-15, at 10:06 AM, "Aaron" <aaron1 at gvtc.com> wrote:
>
>> 2 tac cases opened...one with ios team for me3600's and one opened
>> with ios xr team....
>>
>> Ios Cisco tac is still investigating (they want more crashinfo's and
>> running configs from me).... but thus far I have been told that my
>> 2/20 linecard in my asr9010 reloaded due to a double bit error (double
>> ecc (I believe is error correcting code)). Syslogs and cli output below.
>>
>> Ios xr cisco tac team says that he recommends replacing linecard
>> if/when it happens a second time
>>
>> Ios Tac eng said that when a bit changes in memory, it's correctable,
>> but when two bits change then it's uncorrectable and a reload on that
>> linecard occurs. Ios Tac eng said that the lincecard in the asr9k
>> seems to have crashed prior to the me3600's reloading. This seems to
>> be seen also in that the syslog messages regarding the bgp down
>> messages with those me3600's started happening a few minutes after
>> 14:22:38 (when the asr9k linecard crashed)....i think bgp keepalives
>> default to 60 seconds and a bgp neighbor session doesn't time out
>> until 180 seconds ( I think 3*keepalives)
>>
>> Here is the cli output for that card ... Last Reset :
>> pfm_dev_sm_perform_recovery_action, Card reset requested by: Process ID:
>> 155724 (prm_server) : Thu Mar 14 19:24:00 2013
>>
>> Did you see that process id number ? 155724.....you will also see
>> that pid in the syslog messages.
>>
>> That's when the asr9k linecard reloaded and seems to have caused (13)
>> of my me3600's to reboot! These 13 me3600's are as follows....
>>
>> All run 15.3(1)S. they are scattered throughout my network...sparsely
>> located here and there....no real physical commonality among them.
>> All of these 13 me3600's run Mp-iBGP with dual RouteReflectors....one
>> of the RR's is on that asr9010. This mpibgp is for mpls l3vpn's. the
>> pe-ce on the me3600's is directly connected routing...that's it. The
>> pe-ce in my core to connect to my legacy ip net is ospf from dual pe-ce
> feeds for redundancy.
>> The pe-ce dual links are between dual asr9k/7609-s pairs.....the
>> asr9k's are in fact the dual rr's also. One of them is that asr9010
>> that had a lincecard crash. Speculation I heard from ios tac
>> yesterday reqarding the
>> me3600 crash was maybe related to a cef route change bug in 15.3(1)S.
>> seems that perhaps when the asr9010 linecard crashed, the several
>> hundred routes learned via that pe-ce connection to the legacy 7609
>> propogated over the l3vpn and into the me3600's, thus causing them to
>> do cef/fib convergence and possible converge over to the other
>> asr9k/7609 location....BUT this is only speculation about that being the
> cause of the me3600 reloads for now....
>> more on that to come later hopefully from ios tac when I give them
>> more crashinfo's and running configs...
>>
>> Bare in mind, I have (4) more me3600's config'd same way as the
>> crashed ones and the DID NOT reboot....those (4) run 15.2.2S or
>> 15.2.4.S1
>>
>> Syslog messages...
>>
>> 2013-03-14 14:22:38 Local7.Emerg 9k 16328: LC/0/1/CPU0:Mar 14
>> 14:24:00.733 : pfm_node_lc[267]: %PLATFORM-NP-0-HW_DOUBLE_ECC_ERROR :
>> Set|prm_server[155724]|Network Processor Unit(0x1007001)|NP DOUBLE ECC
>> ERROR, NP=1, memId=18, subMemId=0x1
>> 2013-03-14 14:22:38 Local7.Emerg 9k 16329: LC/0/1/CPU0:Mar 14
>> 14:24:00.736 : pfm_node_lc[267]: %PLATFORM-PFM-0-CARD_RESET_REQ :
>> pfm_dev_sm_perform_recovery_action, Card reset requested by: Process ID:
>> 155724 (prm_server), Fault Sev: 0, Target node: 0/1/CPU0, CompId:
>> 0x1f, Device Handle: 0x1007001, CondID: 1001, Fault Reason: NP DOUBLE
>> ECC ERROR, NP=1, memId=18, subMemId=0x1
>> 2013-03-14 14:22:38 Local7.Critical 9k 16330: LC/0/1/CPU0:Mar 14
>> 14:24:00.737 : sysmgr[89]: %OS-SYSMGR-2-REBOOT : reboot required,
>> process
>> (pfm_node_lc) reason (pfm_dev_sm_perform_recovery_action, Card reset
>> requested by: Process ID: 155724 (prm_server), Fault Sev: 0, Target node:
>> 0/1/CPU0, CompId: 0x1f, Device Handle: 0x1007001, CondID: 1001, Fault
>> Reason: NP DOUBLE ECC ERROR, NP=1, memId=18, subMemId=0x1)
>> 2013-03-14 14:22:38 Local7.Error 9k 16331: LC/0/1/CPU0:Mar 14
>> 14:24:00.741 : sysmgr[89]: %OS-LIBSYSMGR-3-PARSE : parse_args: parse
> error:
>> unmatched "
>> 2013-03-14 14:22:38 Local7.Error 9k 16333: LC/0/1/CPU0:Mar 14
>> 14:24:00.742 : sysmgr[89]: %OS-SYSMGR-3-ERROR :
>> sysmgr_shutdown_cleanup_handler: shutdown script execution timed-out!
>> Node will reset
>> 2013-03-14 14:22:38 Local7.Error 9k 16335: LC/0/1/CPU0:Mar 14
>> 14:24:00.743 : sysmgr[89]: %OS-SYSMGR-3-ERROR :
>> sysmgr_shutdown_cleanup_handler: shutdown triggered by (pfm_node_lc)
>> did not complete in 45 seconds, shutting down
>>
>>
>> RP/0/RSP0/CPU0:9k#admin sh plat summ location 0/1/CPU0 Fri Mar 15
>> 08:17:12.824 CDT
>> ----------------------------------------------------------------------
>> ------
>> ---
>> Platform Node : 0/1/CPU0 (slot 1)
>> PID : A9K-2T20GE-L
>> Card Type : 2-Port 10GE, 20-Port GE Low Queue LC, Req. XFPs
>> and SFPs
>> VID/SN : V03 / FOC15078GST
>> Oper State : IOS XR RUN
>> Last Reset : pfm_dev_sm_perform_recovery_action, Card reset
>> requested by: Process ID: 155724 (prm_server)
>> : Thu Mar 14 19:24:00 2013
>> Configuration : Power is enabled
>> Bootup enabled.
>> Monitoring enabled
>> Rommon Ver : Version 1.03(20100212:011148)
>> IOS SW Ver : 4.1.2
>> Main Power : Power state Enabled. Estimate power 350 Watts of
>> power required.
>> Faults : N/A
>> ----------------------------------------------------------------------
>> ------
>> ---
>>
>> RP/0/RSP0/CPU0:9k#sh instal summ
>> Fri Mar 15 08:17:44.055 CDT
>> Active Packages:
>> disk0:asr9k-mini-p-4.1.2
>> disk0:asr9k-doc-p-4.1.2
>> disk0:asr9k-k9sec-p-4.1.2
>> disk0:asr9k-mpls-p-4.1.2
>> disk0:asr9k-mgbl-p-4.1.2
>> disk0:asr9k-mcast-p-4.1.2
>>
>>
>>
>> aaron
>>
>>
>>
>>
>> -----Original Message-----
>> From: cisco-nsp-bounces at puck.nether.net
>> [mailto:cisco-nsp-bounces at puck.nether.net] On Behalf Of Jason Lixfeld
>> Sent: Thursday, March 14, 2013 5:09 PM
>> To: cisco-nsp at puck.nether.net NSP
>> Subject: Re: [c-nsp] whoa - asr9k wierd message AND 13 me3600's all
>> rebooted at once!!
>>
>> What XR version are you running?
>> Trident or Typhoon cards?
>> ME3600s all rebooted at the exact moment the LC crashed?
>> ME3600 crashes with errors/crashinfo?
>> OSPF is your IGP or IGP is something else and OSPF was inside a VRF
>> facing the CE?
>> Is BFD for IGP and/or BFD for BGP enabled?
>> BGP is straight BGP or MP-BPG to the ME3600s?
>> LDP between ASR and ME3600s?
>>
>> I don't have an answer for you, but there are some common elements on
>> my network based on the description you have provided here about your
>> network, so I'm asking probing questions to determine any other
> similarities.
>>
>> --
>>
>> Sent from my mobile device
>>
>>
>> On 2013-03-14, at 5:35 PM, "Aaron" <aaron1 at gvtc.com> wrote:
>>
>>> Y'all know anything about this?
>>>
>>>
>>>
>>> Something bad just happened in my network
>>>
>>>
>>>
>>> I have an asr9010 that just showed a 2/20 module fail and come back
>>> up. the pe-ce link on that card also showed ospf neighbor state
>>> bounce at that moment.AND that asr9010 is a route reflector for
>>> several of my pe's throughout my network.. Of those pe's (13)
>>> ME3600's running 15.3(1)S ALL REBOOTED!!!
>>>
>>>
>>>
>>> ..i have another me3600 running 15.3(1)S that is not running bgp that
>>> did not reboot
>>>
>>>
>>>
>>> ..i have several other me3600's running pre 15.3 (so 15.2.something)
>>> that are running similar config as the rebooted me's, which did NOT
>>> reboot
>>>
>>>
>>>
>>> Aaron
>>>
>>>
>>>
>>> _______________________________________________
>>> cisco-nsp mailing list cisco-nsp at puck.nether.net
>>> https://puck.nether.net/mailman/listinfo/cisco-nsp
>>> archive at http://puck.nether.net/pipermail/cisco-nsp/
>>
>> _______________________________________________
>> cisco-nsp mailing list cisco-nsp at puck.nether.net
>> https://puck.nether.net/mailman/listinfo/cisco-nsp
>> archive at http://puck.nether.net/pipermail/cisco-nsp/
>
More information about the cisco-nsp
mailing list