[j-nsp] SRX650 Failover Test Issue
Walaa Abdel razzak
walaaez at bmc.com.sa
Sat Mar 26 08:18:52 EDT 2011
Hi All
Kindly find below the testing scenario for control link failover:
1. The two FW’s were OK and reachable through OoB (tested from IP in the same subnet of management subnet).
2. Configured the system backup router command “backup-router 11.11.11.254 destination 11.11.11.0/24;”.
3. Removing the control link between the two firewalls.
4. Got the failure message and the secondary firwall is still reachable through OoB. Checking the ARP entry:
{disabled:node1}
Juniper123 at FW2> show arp expiration-time
MAC Address Address Name Interface Flags TTE
5c:26:0a:2b:6e:a2 11.11.11.254 11.11.11.254 fxp0.0 none 1062 n à My IP in the same LAN of the SRX.
28:c0:da:8f:8e:30 30.17.0.2 30.17.0.2 fab0.0 permanent
28:c0:da:8f:97:30 30.18.0.1 30.18.0.1 fab1.0 permanent
00:22:83:14:0b:f0 172.16.0.1 172.16.0.1 reth0.0 none 273
00:22:83:14:0b:f1 192.168.0.1 192.168.0.1 reth1.0 none 613
5. Got the following message:
FW2 FW2 PFEMAN: Shutting down , Master routing engine did not recover; forwarding stopped
Message from syslogd at FW2 at Mar 26 23:04:01 ...
FW2 FW2 CMLC: Master RE did not recover, forwarding stopped
Message from syslogd at FW2 at Mar 26 23:04:01 ...
FW2 FW2 CMLC: committing suicide , Shutting down due to loss of communicationwith master RE
6. After ARP timer expired, the source IP (My IP) entry disappeared and I couldn’t reach the router except by using console and making reboot.
{disabled:node1}
Juniper123 at FW2> show arp expiration-time
MAC Address Address Name Interface Flags TTE
28:c0:da:8f:8e:30 30.17.0.2 30.17.0.2 fab0.0 permanent
28:c0:da:8f:97:30 30.18.0.1 30.18.0.1 fab1.0 permanent
00:22:83:14:0b:f0 172.16.0.1 172.16.0.1 reth0.0 none 1
00:22:83:14:0b:f1 192.168.0.1 192.168.0.1 reth1.0 none 149
Total entries: 4
Conclusion: When removing the clustering control link, I can’t reach the secondary firewall even from directly connected IP to the management subnet even the ARP entry is not created for my IP, I have to console the box and reboot. This was done with system backup router command.
Any suggestions?
BR,
From: Pavel Lunin [mailto:plunin at senetsy.ru]
Sent: Wednesday, March 23, 2011 8:05 PM
To: Chen Jiang
Cc: Walaa Abdel razzak; Michael Lee; juniper-nsp
Subject: Re: [j-nsp] SRX650 Failover Test Issue
2011/3/23 Chen Jiang <ilovebgp4 at gmail.com>
It's a by design behavior. When control link or fabric link disconnected, the current RG0 master node will remain in master status but the current RG0 backup node will disable itself to avoid split-brain issue, "Disable" means the node will offline all SPC/NPC and Line Card. And only reboot the whole chassis could recovery the node.
Right but the question is slightly different: whether it's possible to reboot it not having access to its console.
I don't have a lab ready for testing right now but AFAIR fxp0 is still active for a disabled node even on branch (all the more so for high-end, since it's directly on RE there). Moreover "request routing-engine login node X" should also be available.
By now I don't really remember the details but there is absolutely no doubt, it's possible to access a disable node throughout the network, not only on console. About half a year ago we've run into an bug in 10.0R2 causing losses of heartbeats on control link and consequent regular failovers and node disabling. No doubt it was possible to reboot the disabled node without having access to console.
More information about the juniper-nsp
mailing list