[j-nsp] SSB/RE Problem

Erdem Sener erdems at gmail.com
Thu Jun 15 17:11:39 EDT 2006


Hi,

 Have you checked logs from both RE's? Any clues why the communication
between the backup and master RE's is lost? If both RE's are fine but
still the keepalives fail to make it, you might want to try disabling
GRES and open a case with JTAC.

 HTH,
 Erdem



On 6/15/06, Elian Scrosoppi <escrosoppi at ifxcorp.com> wrote:
> Hi guys,
>
> Yesterday we had some problems with the SSB/RE of our M20 router. I have extracted the following logs and information. Anyone can help me to determine the problem?
>
> --
> Model: m20
> 2 RE
> 2 SSB
> JUNOS Base OS boot [7.0R2.7]
> JUNOS Base OS Software Suite [7.0R2.7]
> JUNOS Kernel Software Suite [7.0R2.7]
> JUNOS Packet Forwarding Engine Support (M20/M40) [7.0R2.7]
> JUNOS Routing Software Suite [7.0R2.7]
> JUNOS Online Documentation [7.0R2.7]
> JUNOS Crypto Software Suite [7.0R2.7]
> --
>
> content of /var/log/mastership :
>
> Mar 16 14:30:17 event = E_NO_IPC, state = backup, param = 0x0x0
> Mar 16 14:30:17 No response from the other routing engine for the last 2 seconds.
>
> Mar 16 14:30:17 Currentstate backup NextState backup reason_code 0
> Mar 16 14:30:17 new state = backup
> Mar 16 14:30:17 Keepalive timeout of 2 seconds expired.  Assuming RE mastership.
>
> Mar 16 14:30:17 event = E_CMD_F, state = backup, param = 0x0x0
> Mar 16 14:30:20 The local RE becomes the master, retry = 0.
> Mar 16 14:30:20 Currentstate backup NextState master reason_code 2
> Mar 16 14:30:20 timestamp: Thu Mar 16 14:30:20 2006
> Mar 16 14:30:20 new state = master
>
> (lot of this)
> Mar 16 14:30:26 failed to send RE info/keepalive: errno=0, total=2 in the last 20 sec
> Mar 16 14:30:26 failed to send RE info/keepalive: errno=65, total=2 in the last 20 sec
> Mar 16 14:30:40 failed to receive keepalives from other RE for the last 20 sec
>
> (then)
>
> Mar 16 14:35:37 received version 1, "claim mastership" request
> Mar 16 14:35:37 event = E_REQ_C, state = master, param = 0x0x0
> Mar 16 14:35:37 send "claim mastership" negative acknowledgement
> Mar 16 14:35:37 Currentstate master NextState master reason_code 1
> Mar 16 14:35:37 new state = master
> Mar 16 14:36:02 event = E_ORE_B, state = master, param = 0x0x835cca8
> Mar 16 14:36:02 Currentstate master NextState master reason_code 1
> Mar 16 14:36:02 new state = master
> Jun 14 15:43:07 event = E_ORE_M, state = master, param = 0x0x835cca8
> Jun 14 15:43:07 Duplicate Master Routing Engine
> Jun 14 15:43:07 mcontrol_disabled_exit
> Jun 14 15:43:07 mcontrol_shutdown
> Jun 14 15:43:07 mcontrol_notmaster
> Jun 14 15:43:10 *** mcontrol init V01 ***
> Jun 14 15:43:10 soft-restart: is not a master
> Jun 14 15:43:10 Socket = 0x00000011
> Jun 14 15:43:10 event = E_CFG_B, state = init, param = 0x0x0
> Jun 14 15:43:10 Currentstate init NextState backup reason_code 0
> Jun 14 15:43:10 new state = backup
> --
>
> content of /var/log/messages :
>
> Jun 14 15:42:15  JM20 init: mib-process (PID 13778) terminated by signal number 15!
> Jun 14 15:42:15  JM20 init: ntp (PID 13776) exited with status=0 Normal Exit
> Jun 14 15:42:15  JM20 init: chassis-control (PID 2578) exited with status=6
> Jun 14 15:42:15  JM20 init: chassis-control (PID 13816) started
> Jun 14 15:42:15  JM20 init: failure target for routing set to target 1
> Jun 14 15:42:15  JM20 init: routing (PID 13779) SIGTERM sent
> Jun 14 15:42:15  JM20 init: failure target for routing set to target 1
> Jun 14 15:42:15  JM20 init: routing (PID 13779) SIGTERM sent
> Jun 14 15:42:15  JM20 chassisd[13816]: CHASSISD_IFDEV_DETACH_FPC: ifdev_detach(0)
> Jun 14 15:42:15  JM20 chassisd[13816]: CHASSISD_IFDEV_DETACH_FPC: ifdev_detach(1)
> Jun 14 15:42:15  JM20 rpd[13779]: RPD_SIGNAL_TERMINATE: second termination signal received
> Jun 14 15:42:15  JM20 chassisd[13816]: CHASSISD_IFDEV_DETACH_FPC: ifdev_detach(2)
> Jun 14 15:42:15  JM20 chassisd[13816]: CHASSISD_IFDEV_DETACH_FPC: ifdev_detach(3)
> Jun 14 15:42:15  JM20 chassisd[13816]: CHASSISD_IFDEV_DETACH_ALL_PSEUDO: ifdev_detach(pseudo devices: all)
> Jun 14 15:42:15  JM20 rpd[13779]: RPD_EXIT: Exit rpd[13779] version 7.0R2.7 built by builder on 2005-01-06 06:58:43 UTC, caller 80b14c3
> Jun 14 15:42:16  JM20 alarmd[2579]: chassisd connection succeeded after 1 retries
> Jun 14 15:42:16  JM20 craftd[2580]: chassisd connection succeeded after 1 retries
> Jun 14 15:42:16  JM20 alarmd[2579]: resending alarm state
> Jun 14 15:42:16  JM20 init: routing (PID 13779) exited with status=0 Normal Exit
> Jun 14 15:42:17  JM20 syslogd: sendto: No route to host
> Jun 14 15:42:17  JM20 craftd[2580]: attempt to delete alarm not in list
> Jun 14 15:42:17  JM20 craftd[2580]: forwarding display request to chassisd: type = 4, subtype = 44
> Jun 14 15:42:23  JM20 /kernel: mastership: routing engine 1 becoming master
> Jun 14 15:42:23  JM20 /kernel: mastership: routing engine 1 becoming master
> Jun 14 15:42:23  JM20 rshd[13894]: root at re0 as root: cmd='rcp -T -f /var/db/dcd.snmp_ix'
> Jun 14 15:42:24  JM20 syslogd: sendto: No route to host
> Jun 14 15:42:24  JM20 chassisd[13816]: CHASSISD_SNMP_TRAP10: SNMP trap generated: redundancy switchover (jnxRedundancyContentsIndex 6, jnxRedundancyL1Index 1, jnxRedundancyL2Index 0, jnxRedundancyL3Index 0, jnxRedundancyDescr SSB 0, jnxRedundancyConfig 2, jnxRedundancyState 2, jnxRedundancySwitchoverCount 1, jnxRedundancySwitchoverTime 777978144, jnxRedundancySwitchoverReason 2)
> Jun 14 15:42:24  JM20 chassisd[13816]: CHASSISD_SNMP_TRAP10: SNMP trap generated: redundancy switchover (jnxRedundancyContentsIndex 6, jnxRedundancyL1Index 2, jnxRedundancyL2Index 0, jnxRedundancyL3Index 0, jnxRedundancyDescr SSB 1, jnxRedundancyConfig 3, jnxRedundancyState 3, jnxRedundancySwitchoverCount 1, jnxRedundancySwitchoverTime 777978144, jnxRedundancySwitchoverReason 2)
> Jun 14 15:42:24  JM20 init: failure target for routing set to target 1
> Jun 14 15:42:24  JM20 init: interface-control (PID 13814) terminate signal sent
> Jun 14 15:42:24  JM20 init: ntp (PID 13897) started
> Jun 14 15:42:24  JM20 init: snmp (PID 13898) started
> Jun 14 15:42:24  JM20 init: mib-process (PID 13899) started
> Jun 14 15:42:24  JM20 init: routing (PID 13900) started
> Jun 14 15:42:24  JM20 init: sonet-aps (PID 13901) started
> Jun 14 15:42:24  JM20 init: vrrp (PID 13902) started
> Jun 14 15:42:24  JM20 init: sntpsync (PID 13810) SIGTERM sent
> Jun 14 15:42:24  JM20 init: pfe (PID 13813) terminate signal sent
> Jun 14 15:42:24  JM20 init: sampling (PID 13903) started
> Jun 14 15:42:24  JM20 init: ilmi (PID 13904) started
> Jun 14 15:42:24  JM20 init: remote-operations (PID 13905) started
> Jun 14 15:42:24  JM20 init: class-of-service (PID 13906) started
> Jun 14 15:42:24  JM20 init: network-access (PID 13907) started
> Jun 14 15:42:24  JM20 init: ipsec-key-management (PID 13908) started
> Jun 14 15:42:24  JM20 init: helper (PID 13909) started
> Jun 14 15:42:24  JM20 init: remote-hello (PID 13910) started
> Jun 14 15:42:24  JM20 init: link-management (PID 13911) started
> Jun 14 15:42:24  JM20 init: kernel-replication (PID 13811) SIGTERM sent
> Jun 14 15:42:24  JM20 init: firewall (PID 13815) terminate signal sent
> Jun 14 15:42:24  JM20 init: internal-routing-service (PID 13912) started
> Jun 14 15:42:24  JM20 init: routing-socket-proxy (PID 13913) started
> Jun 14 15:42:24  JM20 init: pic-services-logging (PID 13914) started
> Jun 14 15:42:24  JM20 init: adaptive-services (PID 13915) started
> Jun 14 15:42:24  JM20 init: pgm (PID 13916) started
> Jun 14 15:42:24  JM20 init: neighbor-liveness (PID 13917) started
> Jun 14 15:42:24  JM20 init: service-deployment (PID 13918) started
> Jun 14 15:42:24  JM20 init: failure target for routing set to target 1
> Jun 14 15:42:24  JM20 init: interface-control (PID 13814) terminate signal sent
> Jun 14 15:42:24  JM20 init: sntpsync (PID 13810) SIGTERM sent
> Jun 14 15:42:24  JM20 init: pfe (PID 13813) terminate signal sent
> Jun 14 15:42:24  JM20 init: kernel-replication (PID 13811) SIGTERM sent
> Jun 14 15:42:24  JM20 init: firewall (PID 13815) terminate signal sent
> Jun 14 15:42:24  JM20 init: firewall (PID 13815) exited with status=0 Normal Exit
> Jun 14 15:42:24  JM20 init: firewall (PID 13920) started
> Jun 14 15:42:24  JM20 init: pfe (PID 13813) terminated by signal number 15!
> Jun 14 15:42:24  JM20 init: pfe (PID 13921) started
> Jun 14 15:42:24  JM20 init: kernel-replication (PID 13811) exited with status=0 Normal Exit
> Jun 14 15:42:24  JM20 init: sntpsync (PID 13810) terminated by signal number 15!
> Jun 14 15:42:24  JM20 chassisd[13816]: snmp_ipc_try_connect: connect to master (unix sock) failed: Connection refused, retry in 1
> --
>
> The uptime of the SSB was 1 minute, and RE's was not affected. Then, one hour after, this problem have repeated.
>
> If more information is needed please tell me.
>
> Thanks in advance,
>
> --
> Elian Scrosoppi
> escrosoppi at ifxcorp.com


More information about the juniper-nsp mailing list