[j-nsp] 10.0 or 10.4?

Richard A Steenbergen ras at e-gerbil.net
Tue Mar 22 14:09:16 EDT 2011


On Tue, Mar 22, 2011 at 05:18:47PM +0100, bas wrote:
> Hi All,
> 
> Well, after this thread I still didn't know which version I should
> choose for our 960 with MPC's only.
> >From what I read it was; In the field (Ras, Raphael) we see 10.3r3 as
> the better choice, and people who talk to JTAC say 10.4r2 is the
> better choice.
> 
> (Of course it depends on configuration and config.)
> 
> But we chose to upgrade to 10.3r3, and installed the version this morning.
> The upgrade seemed to have gone smooth, but after all BGP sessions had
> been re-established, and prefixes re-learnt the CPU stayed at 100%.
> 
> Dropping to shell I saw rpd consuming 99% CPU.
> Looking at task accounting and rtsockmon I saw no obvious causes.
> A failover to the backup RE had no effect, the new master RE consumed
> 100% within a couple of minutes.
> 
> A colleague of mine did a trace of the process saw that the cycles are
> being consumed by "getrusage" system calls.
> 
> Tomorrow morning we'll try to restart routing, if that has no effect
> we will try 10.4r2.

Haven't seen that here. Both 10.3R3 and 10.4R2 picked up a very similar 
set of fixes (they have very close release dates) for major PRs which 
were giving us grief, but 10.3R3 seemed to introduce less new ones in 
the process. 10.4R3 just came out yesterday, so feel free to give it a 
shot and report back, but most of our 10.3R3 issues have been relatively 
"less bad" than normal.

Occasional logging of these junk messages:

Mar 22 11:47:12.629  re1.xx1.xxx1 /kernel: %KERN-3: Unlist request: unilist(nh index = 1049538) found on the rnhlist_deleted_root patnode, hence returning

Logging of these junk messages (note the incorrect timestamps too, these 
were actually logged AFTER the Mar 22 output above :P):

Mar  4 10:29:50.527  re1.cr1.xxx1 rpd[89570]: %DAEMON-4: , hold timer 1:07.860442
Mar  4 10:47:22.577  re1.cr1.xxx1 rpd[89570]: %DAEMON-4: , hold timer 1:22.570405
Mar  4 10:55:44.479  re1.cr1.xxx1 rpd[89570]: %DAEMON-4: , hold timer 1:21.765648
Mar  4 10:59:59.056  re1.cr1.xxx1 rpd[89570]: %DAEMON-4: , hold timer 1:19.371199
Mar  4 11:02:04.228  re1.cr1.xxx1 rpd[89570]: %DAEMON-4: , hold timer 1:18.704278

Weird stuff like that. Probably the worst issue we've seen so far is 
that on MX w/MPC cards (not sure if its the sw or hw) doing an l2circuit 
to a vlan-ccc endpoint and then confusing vlan-maps to push/pop the vlan 
tag off of the packet before encapsulating in the pseudowire (so you can 
mismatch vlan IDs on the endpoints) seems to block ISIS packets from 
being transported (most likely blocking 802.3 LLC frames is my guess). 

On EX there is a definite problem with the power supplies getting stuck 
110v "low power" mode as far as the chassis is concerned, which is an 
issue if you need 220v power to fully and redundantly power your FPCs. 
But at least this code hasn't crashed or blown up massively yet (or 
failed to do some really important operation like say correctly counting 
packets in SNMP so you can bill your customers), which is definitely an 
improvement over a lot of other recent JUNOS code. :)

-- 
Richard A Steenbergen <ras at e-gerbil.net>       http://www.e-gerbil.net/ras
GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC)


More information about the juniper-nsp mailing list