[c-nsp] C6k diag failure in lab, need to worry?

Thu Apr 10 13:14:03 EDT 2008

Agreed. We will strive to do whatever we can. 

Just want to point out that this is not a "crash", but a second reset on
bootup.

As Peter pointed out, this extends the bootup time in the 1% bootup
case, it can happen.

sukumar

> -----Original Message-----
> From: e ninja [mailto:eninja at gmail.com] 
> Sent: Wednesday, April 09, 2008 11:11 PM
> To: Sukumar Subburayan (sukumars)
> Cc: Peter Rathlev; cisco-nsp
> Subject: Re: [c-nsp] C6k diag failure in lab, need to worry?
> 
> Sukumar,
> 
> " You can ignore this one, as it _should_ not have any 
> impact, after the second reload." is not an acceptable answer. 
> 
> 1 crash in every 100 reboots = 1 million crashes out of every 
> 100 million reboots. In our quest for perfection, we should 
> strive to investigate and rectify every unexpected deviation 
> from the norm. 
> 
> Peter,
> 
> Open a TAC case and submit all the captures for Cisco BU to 
> investigate and rectify so that all other customers can 
> benefit from the solution.
> 
> /eninja
> 
> 
> 
> On Wed, Apr 9, 2008 at 10:16 AM, Sukumar Subburayan 
> (sukumars) <sukumars at cisco.com> wrote:
> 
> 
> 	Peter,
> 	
> 	You can ignore this one, as it should not have any 
> impact, after the
> 	second reload.
> 	
> 	We have seen this very rarely (once in 100+ reboots, on very few
> 	systems), where an ASIC was not intialized properly,
> 	and diagnostics was  catching the condition, and resetting the
> 	supervisor.
> 	
> 	sukumar
> 	
> 
> 
> 
> 
> 	> -----Original Message-----
> 	> From: cisco-nsp-bounces at puck.nether.net
> 	> [mailto:cisco-nsp-bounces at puck.nether.net] On Behalf 
> Of Peter Rathlev
> 	> Sent: Wednesday, April 09, 2008 8:40 AM
> 	> To: cisco-nsp
> 	> Subject: [c-nsp] C6k diag failure in lab, need to worry?
> 	>
> 	> 'ello,
> 	>
> 	> We just had a "funny" experience with a C6k/720 in our lab.
> 	> We were testing SXF13 AIS, and during a reload we saw 
> the following:
> 	>
> 	> 00:01:36: %SCHED-SP-7-WATCH: Attempt to monitor uninitialized
> 	> watched bitfield (address 0).
> 	> -Process= "Shutdown", ipl= 0, pid= 256
> 	> -Traceback= 402C3A18 404ED840 4029C954 4029C940
> 	> 00:01:40: %DIAG-SP-3-MAJOR: Module 5: Online Diagnostics
> 	> detected a Major Error.
> 	>  Please use 'show diagnostic result <target>' to see 
> test results.
> 	> 00:01:40: %CONST_DIAG-SP-3-BOOTUP_TEST_FAIL: Module 5:
> 	> TestAclDeny failed
> 	> 00:01:41: %OIR-SP-6-INSCARD: Card inserted in slot 5,
> 	> interfaces are now online Reload scheduled for 07:05:31 PST
> 	> Wed Apr 9 2008 (in 13 seconds)
> 	>
> 	> Module 5 is the supervisor. Afterwards it reloaded and didn't
> 	> do it again, also across several reboots. It's a Sup720-3B
> 	> with a single WS-X6708-10GE and a WS-SVC-FWM-1. It never
> 	> reaches starting GOLD for the DFC.
> 	>
> 	> I didn't have the time to do the "show diagnostics result"
> 	> before reboot, and afterwards it say it never got a failure
> 	> on TestAclDeny:
> 	>
> 	> fw1#sh diagnostic res mod 5 test 18 det
> 	> Current bootup diagnostic level: minimal
> 	>   Test results: (. = Pass, F = Fail, U = Untested)
> 	> ______________________________________________________________
> 	> _____________
> 	>    18) TestAclDeny ---------------------> .
> 	>           Error code ------------------> 3 (DIAG_SKIPPED)
> 	>           Total run count -------------> 1
> 	>           Last test execution time ----> Apr 09 2008 07:08:26
> 	>           First test failure time -----> n/a
> 	>           Last test failure time ------> n/a
> 	>           Last test pass time ---------> Apr 09 2008 07:08:26
> 	>           Total failure count ---------> 0
> 	>           Consecutive failure count ---> 0
> 	> ______________________________________________________________
> 	> _____________
> 	> fw1#
> 	>
> 	> None of the other tests show any failures either: "show
> 	> diagnostics result module 5 detail | incl failure" gives only
> 	> "0" and "n/a" stats. I can do "diagnostic start module 5 test
> 	> 18" all I want and no failures by the way, just getting
> 	> "%DIAG-SP-6-TEST_OK: Module 5: TestAclDeny{ID=18} has
> 	> completed successfully" and no problems.
> 	>
> 	> Is this something we should try and dig into, reporting it to
> 	> TAC? Or should we just ignore this ~5 min delay in a lab
> 	> reboot? We can't seem to reproduce it. :'(
> 	>
> 	> The box had just been "upgraded" to SXF13 AES shortly before
> 	> (from SXF6
> 	> AIS) due to some miscommunications, and this was the first
> 	> boot on SXF13 AIS, but I can't imagine this can have 
> any impact.
> 	>
> 	> Regards,
> 	> Peter
> 	>
> 	>
> 	> _______________________________________________
> 	> cisco-nsp mailing list  cisco-nsp at puck.nether.net
> 	> https://puck.nether.net/mailman/listinfo/cisco-nsp
> 	> archive at http://puck.nether.net/pipermail/cisco-nsp/
> 	>
> 	_______________________________________________
> 	cisco-nsp mailing list  cisco-nsp at puck.nether.net
> 	https://puck.nether.net/mailman/listinfo/cisco-nsp
> 	archive at http://puck.nether.net/pipermail/cisco-nsp/
> 	
> 
> 
>