[j-nsp] MX480 troubles.

Wed Apr 13 13:51:56 EDT 2011

On Apr 13, 2011, at 1:27 PM, Chris Evans wrote:

> Question to you all...
> 
> It seems like alot of folks run bleeding edge code with some if these major
> bugs popping up.I also get the impression that a lot of shops don't test
> code before they deploy.
> 
> I'm just curious how this works for you. In my company we would get
> seriously reprimanded for deploying software that is not tested and any time
> we have outages we have to go through big hoops to understand why how to fix
> etc.. so we do the best we can to deploy architectures/platforms/code that
> wont have issues.
> 
> I couldn't imagine being bleeding edge in a service provider environment,
> its just a concept I can't fathom being in the environment I'm in..
> 
> Looking for input...

This is something that requires a delicate balance.  While one can spend millions of dollars to test every possible thing, it's also not practical to do so for each and every release.  The hardware required to replicate these environments gets quite expensive, as does the test gear and other things necessary to "pull it off".

I've always believed in a "system test", vs "unit test" (UUT/unit under test is what you will see the vendors call it).  You need to prove out the entire thing, vs the void that they regularly operate in.  There are some bugs you just won't see unless you have a link flap in a 4x Bundle and one side is doing LACP fast mode, etc.. 

Testing all these cases can be problematic/impossible.  What you want to do is get a nice baseline going, that simulates your real network as close as possible then you can work with the new code.  Simulate many iterations of what happens (eg: if rancid logs in 1x an hour, make it login in a loop).  We found some bug that only existed on one device due to rancid logging in each hour from a host with a specific latency and IP stack.  It impacted only *one* device, but caused a kernel core.

This is not something you can easily simulate in addition to all the other features you want to test.  Many things one can do start to create mutually exclusive test environments.  Imagine being a tester and you have to test "BGP".  What does that mean?  All route reflectors, or a full mesh?  what size?  how many clusters?  what about bgp confederations?  2-byte, 4-byte ASNs and as-paths.  IPv4 and IPv6 in the same transport for each NLRI, or a IPv4 session for those routes, and IPv6 native session for that NLRI?

I'm sure you can start to see how these create a mutually exclusive set of choices, just within the scope of BGP, even before you get to the routing policy you want to test on top of that.

The best test I've found is dogfooding, put your office lan behind the test router, or at least your lab.  This way if it breaks, you will be compelled to research it.  Some people will argue against this as their "business critical" stuff can't be impacted by the testing, but your customers could have the same problem as well.  (Also, try to load up your test router with as many of the real world variations you have that can co-exist, be it l2vpn, ipv6, mpls, rsvp, pim, etc.. Don't deviate because you think something isn't related, unless you're in the "isolation" mode).

I would always load the code on a test device, then on a device that *my* connection was on.

I certainly don't want an outage any more than any customer does, so be understanding when they do happen.  We can push the vendors for fixes, but sometimes only "so-hard" and some cases while we may hit them often, are very difficult to reproduce.  The developers have also become insulated from the "real world" in many cases, either hard to reach via a TAC, or otherwise.  This is to protect them, but also makes it hard as we don't open cases to "cry wolf" either.

Either way, testing needs to be a true partnership between you and your vendor.  Don't take hardware from them if you are not going to participate.  Don't yell and scream when test code is broken, but try to understand and improve the process.  Sometimes yelling is necessary, but i've found it's rarely productive.  If the vendor refuses to understand the severity of your environment, escalate.  The head of JTAC is a good guy, he's trying to do the right thing.  The same is true for Cisco TAC, and if you talk to the managers, etc.. involved, they have always tried to help.  Explain your constraints and why their solutions are unacceptable, yet be reasonable at the same time.

I do wish that places like Cisco and Juniper had a more open beta/feedback process, such as other vendors (eg: UBNT).  Registering for their program is easy, and they deliver far more "stillborn" code than C/J ever have, yet are highly responsive to reports and engage.

Hopefully this makes sense and helps you understand why it's hard on both sides.

- Jared