[j-nsp] 10.0 or 10.4?

Richard A Steenbergen ras at e-gerbil.net
Thu Mar 17 11:23:38 EDT 2011


On Tue, Mar 15, 2011 at 10:57:56AM -0700, Steve Feldman wrote:
> 
> What sorts of bugs did you see in 10.4R2?

We were just testing 10.4 on MX, since EX features are being a lot more 
actively developed, thus making major version jumps much more risky. For 
example, when we tried moving from 10.1 to 10.2 on EX8200 we discovered 
a major SNMP issue caused by some internal changes that broke polling 
and caused severe billing issues. There are a lot of firewall specific 
changes in 10.4 for EX which we wanted more time to properly test before 
deploying, since this is an area which tends to cause major outages when 
bugs do pop up. I already outlined a few of the issues we found on 
10.4R2 on MX in a previous post, and seeing as it seemed much less 
stable than 10.3R3 for near equivalent features and bugfixes I didn't 
see the point of pursuing it further right now.

> JTAC is recommending 10.4R2 on our EX8200s to fix a bug (PR581625 in 
> 10.1R4) where some of our firewall filter rules were being silently 
> ignored.

Well, here is what I can tell you about the reasons behind our most 
recent move to 10.3R3 on EX8200 (where these are all fixed). A couple of 
these were actually fixed in 10.2R3, but a few other nasty issues were 
introduced at the same time, making 10.2R3 unusable in production. We're 
heading towards 11.1 in the near future for other reasons anyways, which 
is why we went after 10.3R3 instead of 10.2R4 for our next major code 
goal, but 10.4R2 was definitely not confidence inspiring based on the 
issues we saw under MX. :)

PR588115 - Changing the forwarding-table export policy twice in a row 
quickly (while the previous change is still being evaluated) will cause 
rpd to coredump.

PR581139 - Similar to above, but causes the FPC to crash too. Give it 
several minutes before you commit again following a forwarding-table 
export policy change.

PR523493 - Mysterious FPC crashes

PR509303 - Massive SNMP slowness and stalls, severely impacting polling 
of 10.2R3 boxes with a decent number of interfaces (the more interfaces 
the worse the situation).

PR566782 PR566717 PR540577 - Some more mysterious rpd and pfem crashes, 
with extra checks added to prevent it in the future.

PR559679 - Commit script transient change issue, which sometimes causes 
changes to not be picked up correctly unless you do a "commit full".

PR548166 - Sometimes most or all BGP sessions on a CPU loaded box will 
drop to Idle following a commit and take 30+ minutes to come back up.

PR554456 - Sometimes netconf connections to EX8200's will result in junk 
error messages being logged to the XML stream, corrupting the netconf 
session.

PR550902 - On a CPU loaded box sometimes BGP policy-statement evaluation 
will simply stop working, requiring a hard clear of the neighbor (or 
ironically enough, sometimes just renaming the term in the policy will 
fix it :P) to restore normal evaluation.

PR521993 - Ports on EX8200 FPCs will sometimes not initialize correctly, 
resulting in situations where for example ports 4 and 5 on every FPC 
will be able to receive packets but never transmit them. If you continue 
to try and transmit packets down a wedged port (such as would happen if 
the port is configured for L2), it will cause the FPC to crash.

There are also significant BGP convergence performance improvements 
introduced in 10.3R3 and 10.4R2 if you have a lot of routes with 
communities on them. For us this reduced convergence times on EX8200 
with a few transit and ibgp rr feed views (~2mil paths) from 1 hour+ to 
15-20 minutes. Still not "good" by any means, but significantly "less 
bad".

Of course there are a few other issues introduced in 10.3R3 too 
(shocker, that NEVER happens :P), but I already discussed them 
previously. Ultimately I think 10.4 will be more interesting in its R3 
or R4 timeframe, with SRs past that, but I don't think we're going to 
end up using it much on EX8200 (as I said, we're being forced into 11.1 
to support specific hardware anyways).

-- 
Richard A Steenbergen <ras at e-gerbil.net>       http://www.e-gerbil.net/ras
GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC)


More information about the juniper-nsp mailing list