[j-nsp] M20 SSB slot 0 failures
Chris Cappuccio
chris at nmedia.net
Tue Jan 18 11:28:40 EST 2011
Jonas Frey (Probe Networks) [jf at probe-networks.de] wrote:
> Hi Chris,
>
> i havent seen an error like this where the same SSB works fine in slot 1
> but not slot 0.
>
> But my guess is that slot 0 gives back the true status of the card and
> the test report from slot 1 is inaccurate.
>
> We have seen memory failures of SSB-E(-16) boards a couple of times
> while running in production. It appears the memory of the boards wears
> over time and then starts spitting out errors. This works for some time
> since its ECC memory but all things come to an end.
> Just go and grab new memory and try again. Its easy to replace and
> replacement memory (tho unofficial) is pretty cheap.
> See
> http://juniper.cluepon.net/Unofficial_hardware_upgrades
>
After another day of testing, I figured out quite a few things. Well some of this I already figured out, but...
One chassis had bent pins in slot 1, one had a genuine problem with slot 0 (it appears that the backplane has a fried component on the back side, which you can only see by taking apart the case), and the other two chassis just show this mysterious Address Test error.
So, down to the two chassis that actually appear to "work" properly, minus Address Test error..
The error isn't related to RAM in any particular SSB-E. All permutations of RAM in any SSB-E produce this error.
In fact, I figured out that the "cluepon" upgrade page's memory test doesn't test SSB-E memory at all. It tests the memory in the /FPC/ slot specified.
On the "slot" spec in the SCB/SSB monitor can be 0-7 which corresponds to M40 FPC slots. The monitor code is the same for the SCB and SSB according to Juniper's own notes, which would explain this "option"
When an FPC in in FPC slot 0, you can test slot 0. When it's not there, the SSB monitor complains that it can't contact the NIC for slot 0. Same for slot 1,2,3 on the M20. I'm convinced that the advice on cluepon is total horseshit. Chris 1, cluepon 0.
Focusing on the FPCs instead of the SSB-Es, I get some more interesting results. I started out with FPC + 3xP-1GE-SX in slot 0 and FPC-E + P-1GE-SX + P-AS in slot 1.
Here's where things get wierd. If I remove slot 1 FPC, slot 0 tests fine every time.
SSB0( uart)# diag bchip 0 sdram
[Waiting for completion, a:abort, p:pause]
B SDRAM (Slot 0) test
phase 1, pass 1, B SDRAM (Slot 0) test: Address Test
phase 2, pass 1, B SDRAM (Slot 0) test: Pattern Test
phase 3, pass 1, B SDRAM (Slot 0) test: Walking 0 Test
phase 4, pass 1, B SDRAM (Slot 0) test: Walking 1 Test
phase 5, pass 1, B SDRAM (Slot 0) test: Mem Clear Test
B SDRAM (Slot 0) test completed, 1 pass, 0 errors
As soon as I insert slot 1 FPC, slot 0 fails every time.
I then tried various configurations of FPCs in slot 0 and slot 1. FPC-E in both slots, 0 FPC/1 FPC-E, 1 FPC-E/0 FPC. I tried swapping entire FPC boards, both of them, for alternate FPC/FPC-E
Finally, I tried removing PICs from FPCs.
Result? Failure when two FPCs are installed. But, only when P-AS is installed in one of them.
As soon as I remove the P-AS PIC, the test succeeds every time. As soon as I add it back, the test fails every time.
I have two P-AS to choose from and both seem to cause FPCs to fail the Address Test from SSB-E slot 0.
--
Let food be thy medicine and medicine be thy food - Hippocrates
More information about the juniper-nsp
mailing list