[j-nsp] EX4200 VC PFE crashes

Mon Jan 14 14:55:36 EST 2013

Hello list,

We are experiencing PFE crashes (core dumps) on one of our EX4200 VC's. 

It appears we've hit an unknown bug and we have been working with JTAC and Juniper engineering to find the root cause of the issue, but so far without any luck. (There is no public PR for this issue yet.)

I was kind of hoping our issue looked familiar to someone on this list. We are kind of desperate, since we don't have a workaround or solution other than switching to a new vendor.

So here is our setup:

- 6 VC members
- about 60 virtualization servers connected, each hosting about 60 VM's and each connected with a 2x1GE LAG to the VC
- Each VM has two interfaces in two different VLANS (public and private network)
- These VLANS are big broadcast domains, shared by all virtualization servers and VM's within this VC
- We provide both v4 and v6 connectivity on this VC

So that means thousands of MAC, ARP and v6 neighbour entries in the PFE database (but nowhere near the supported limit of 16k entries).

We use OSPF + OSPFv3 to distribute routes, but we've got fairly small tables (about 30 v4 and 17 v6 routes). Other than that the configuration of this VC is fairly trivial.

So the trouble started about a month ago. We were still running 10.4R9.5 back then. Suddenly the PFE daemons of two seperate VC's started core dumping for no reason about every two hours. No configuration changes have been made whatsoever.

JTAC analysed our core dumps and told us this was a known issue (null pointer exception). It would not be resolved in 10.4 but was resolved in 11.2R5.5, the recommended JTAC release for EX4200. 

So we planned emergency maintenance and upgraded to the recommended level. About a week later, the issue returned on just one VC this time and it has happened almost once every week since then. Our other VC's seems to be stable so far.

Long story short. JTAC was out of ideas, created an internal PR and forwarded our case to their engineering team, who is still looking for a root cause as I'm writing this.

In the meantime we are still experiencing this issue and our customers are becoming a bit impatient (and rightfully so). We need to work out a plan B in case Juniper can't find the root cause and provide a fix.

We could upgrade to an even newer release, but we don't have the impression this would solve our issue at all. It could even make matters worse (no way to tell in advance).

We would appreciate it if anyone could share any information about similar issues and workarounds or solutions. Thanks in advance!

Regards,

--
Dennis Krul 
Tilaa