[j-nsp] Cluster issue with SRX550

Sat May 24 12:33:14 EDT 2014

Hi All,

*Scenario*
We have a cluster of two SRX550 and a MX5-T router.
An aggregated link (LACP) is connecting node0 (primary node) of SRX 550
cluster to MX5-T router.
The aggregate link consists of two ports. Copper/Ethernet.

The devices are in the same rack and directly connected (in case someone
doubts the physical connectivity)

*Junos*
SRX550 Cluster 12.1X44-D30.4
MX5-T 11.4R7.5

*Problem*
Sometimes it is observed that the LACP goes down on the MX5-T router. At
that time following logs are seen.

 LACPD_TIMEOUT: ge-1/0/4: lacp current while timer expired current Receive
State: CURRENT
 /kernel: KERN_LACP_INTF_STATE_CHANGE: lacp_update_state_userspace: new
state is 0 cifd ge-1/0/4
 /kernel: KERN_LACP_INTF_STATE_CHANGE: lacp_update_state_userspace: new
state is 0 cifd ge-1/0/5
 /kernel: ae_bundlestate_ifd_change: bundle ae1: bundle IFD minimum links
not met 0 < 1

At this time, the SRX cluster tries to failover and following logs appear.

 jsrpd[1393]: JSRPD_RG_STATE_CHANGE: Redundancy-group 2 transitioned from
'secondary' to 'primary' state due to Remote node is in secondary hold
 jsrpd[1393]: JSRPD_RG_STATE_CHANGE: Redundancy-group 1 transitioned from
'primary' to 'secondary-hold' state due to Monitor failed: IF
 jsrpd[1393]: JSRPD_RG_STATE_CHANGE: Redundancy-group 2 transitioned from
'primary' to 'secondary-hold' state due to Monitor failed: IF
 jsrpd[1393]: JSRPD_RG_STATE_CHANGE: Redundancy-group 1 transitioned from
'secondary-hold' to 'secondary' state due to Back to back failover interval
expired
 jsrpd[1393]: JSRPD_RG_STATE_CHANGE: Redundancy-group 1 transitioned from
'secondary' to 'primary' state due to Remote node is in secondary hold
 jsrpd[1393]: JSRPD_RG_STATE_CHANGE: Redundancy-group 2 transitioned from
'secondary-hold' to 'secondary' state due to Back to back failover interval
expired

Minutes later, the node0 comes back as primary and service is restored.

Besides,  following logs on the SRX550 are coming.

 /kernel: Process with Most Children- 1:init - Children - 211
 /kernel: maxproc limit exceeded by uid 0, please see tuning(7) and
login.conf(5).
 /kernel: nearing maxproc limit by uid 0, please see tuning(7) and
login.conf(5).
 /kernel: Process with Most Children- 1:init - Children - 211
 /kernel: maxproc limit exceeded by uid 0, please see tuning(7) and
login.conf(5).
 /kernel: Process with Most Children- 1:init - Children - 211
 /kernel: maxproc limit exceeded by uid 0, please see tuning(7) and
login.conf(5).
 /kernel: Process with Most Children- 1:init - Children - 211
 /kernel: maxproc limit exceeded by uid 0, please see tuning(7) and
login.conf(5).
 /kernel: nearing maxproc limit by uid 0, please see tuning(7) and
login.conf(5).
 /kernel: Process with Most Children- 1:init - Children - 211

We have been recommended to change following

set interfaces reth1 redundant-ether-options lacp periodic fast

to

set interfaces reth1 redundant-ether-options lacp periodic slow

If someone had similar experience, would appreciate your help.

Regards,

*Ali Sumsam CCIE - *eintellego Networks Pty Ltd
Senior Network Engineer
ali at eintellegonetworks.com ; www.eintellegonetworks.com

Phone: 1300 239 038; Cell +61 (0)450 609 592 ; skype://sumsam.ali80

facebook.com/eintellegonetworks ;  <http://twitter.com/networkceoau>
linkedin.com/in/alisumsam

The Experts Who The Experts Call
Juniper - Cisco - Cloud