[j-nsp] SRX650 Failover Test Issue

Sat Mar 26 08:18:52 EDT 2011

Hi All

Kindly find below the testing scenario for control link failover:

1.       The two FW’s were OK and reachable through OoB (tested from IP in the same subnet of management subnet).

2.       Configured the system backup router command “backup-router 11.11.11.254 destination 11.11.11.0/24;”.

3.       Removing the control link between the two firewalls.

4.       Got the failure message and the secondary firwall is still reachable through OoB. Checking the ARP entry:

{disabled:node1}

Juniper123 at FW2> show arp expiration-time    

MAC Address       Address         Name                      Interface           Flags    TTE

5c:26:0a:2b:6e:a2 11.11.11.254    11.11.11.254              fxp0.0              none  1062  n à My IP in the same LAN of the SRX.

28:c0:da:8f:8e:30 30.17.0.2       30.17.0.2                 fab0.0              permanent

28:c0:da:8f:97:30 30.18.0.1       30.18.0.1                 fab1.0              permanent

00:22:83:14:0b:f0 172.16.0.1      172.16.0.1                reth0.0             none  273

00:22:83:14:0b:f1 192.168.0.1     192.168.0.1               reth1.0             none  613

5.       Got the following message:

FW2 FW2 PFEMAN: Shutting down , Master routing engine did not recover; forwarding stopped  

Message from syslogd at FW2 at Mar 26 23:04:01  ...

FW2 FW2 CMLC: Master RE did not recover, forwarding stopped 

Message from syslogd at FW2 at Mar 26 23:04:01  ...

FW2 FW2 CMLC: committing suicide , Shutting down due to loss of communicationwith master RE

6.       After ARP timer expired, the source IP (My IP) entry disappeared and I couldn’t reach the router except by using console and making reboot. 

{disabled:node1}

Juniper123 at FW2> show arp expiration-time 

MAC Address       Address         Name                      Interface           Flags    TTE

28:c0:da:8f:8e:30 30.17.0.2       30.17.0.2                 fab0.0              permanent

28:c0:da:8f:97:30 30.18.0.1       30.18.0.1                 fab1.0              permanent

00:22:83:14:0b:f0 172.16.0.1      172.16.0.1                reth0.0             none    1

00:22:83:14:0b:f1 192.168.0.1     192.168.0.1               reth1.0             none  149

Total entries: 4

Conclusion: When removing the clustering control link, I can’t reach the secondary firewall even from directly connected IP to the management subnet even the ARP entry is not created for my IP, I have to console the box and reboot. This was done with system backup router command.

Any suggestions?

BR,

From: Pavel Lunin [mailto:plunin at senetsy.ru] 
Sent: Wednesday, March 23, 2011 8:05 PM
To: Chen Jiang
Cc: Walaa Abdel razzak; Michael Lee; juniper-nsp
Subject: Re: [j-nsp] SRX650 Failover Test Issue

2011/3/23 Chen Jiang <ilovebgp4 at gmail.com>

It's a by design behavior. When control link or fabric link disconnected, the current  RG0 master node will remain in master status but the current RG0 backup node will disable itself to avoid split-brain issue, "Disable" means the node will offline all SPC/NPC and Line Card. And only reboot the whole chassis could recovery the node.

Right but the question is slightly different: whether it's possible to reboot it not having access to its console.

I don't have a lab ready for testing right now but AFAIR fxp0 is still active for a disabled node even on branch (all the more so for high-end, since it's directly on RE there). Moreover "request routing-engine login node X" should also be available.

By now I don't really remember the details but there is absolutely no doubt, it's possible to access a disable node throughout the network, not only on console. About half a year ago we've run into an bug in 10.0R2 causing losses of heartbeats on control link and consequent regular failovers and node disabling. No doubt it was possible to reboot the disabled node without having access to console.