[j-nsp] strange problem on chassis cluster

Sat Sep 4 06:50:57 EDT 2010

  HI!

We have a very strange problem on two chassis clusters with 10.0R3.10 
(will try updating to R4.7 today).

One chassis cluster (2x J6350) is our main system
The other (2x J4350) is a system located on the site of our customer.

The two clusters are speaking BGP with each other. For the customer 
system, this is the only BGP session. Our main system has a full BGP 
mesh to our other locations and edge systems. For understanding the 
problem, I would compress this to three BGP sessions:

A) BGP session to AMS-IX over VLAN 1
B) BGP session to ECIX over VLAN 1
C) BGP session to ECIX over VLAN 2

Involved are two switches. VLAN 1 is configured on both switches to make 
it available in Amsterdam and Düsseldorf. VLAN 2 is only configured on 
the switch, faced to Düsseldorf, to have a backup in the case the first 
switch is dead.

The day before yesterday, I started to pings to the ECIX router. One 
from my local workstation, the other from the main cluster.

If I cofigure something on the redundant interfaces, as soon as I do the 
commit, the first ping stays normal, the second junps to +30ms (normal 
around 6ms). 2-3 minutes later, both pings stop. The BGP session drops. 
This is the only BGP session that is dropped, due to Hold time 
expiration. After a few minutes, the pings and the BGP session come 
back. Every other BGP session even the one to Düsseldorf over VLAN 2 
stays up.

I switched the main load to Düsseldorf to VLAN 2. That time, that BGP 
session was dropped, while the other stays up. The session to Düsseldorf 
is taking the main load with around 260000 prefixes.

Matthias