[j-nsp] PFE forwarding bug - PR1380183

Aaron Gould aaron1 at gvtc.com
Wed Aug 21 09:22:00 EDT 2019


Sure Adam,

here's logs after interface flap which JTAC and myself determined was the
initial bug trigger... xe-11/0/0:2 is a member of a (2) 10 gig lag ae...
xe-11/0/0:2 and xe-0/0/0:2 make the 20 gig bundle... only xe-11/0/0:2
flapped and I never saw traffic come back to that interface even though it
was up... no packets exited that interface after this bug hit... lacp
errors/timeout seen from that time stamp.  

{master}
agould at my-mx-960> show log messages.0.gz | find "Aug 14 21"
Aug 14 21:07:36  my-mx-960 mib2d[15194]: SNMP_TRAP_LINK_DOWN: ifIndex 672,
ifAdminStatus up(1), ifOperStatus down(2), ifName xe-11/0/0:2
Aug 14 21:07:37  my-mx-960 mib2d[15194]: SNMP_TRAP_LINK_UP: ifIndex 672,
ifAdminStatus up(1), ifOperStatus up(1), ifName xe-11/0/0:2
Aug 14 21:07:39  my-mx-960 mib2d[15194]: SNMP_TRAP_LINK_DOWN: ifIndex 672,
ifAdminStatus up(1), ifOperStatus down(2), ifName xe-11/0/0:2
Aug 14 21:07:42  my-mx-960 mib2d[15194]: SNMP_TRAP_LINK_UP: ifIndex 672,
ifAdminStatus up(1), ifOperStatus up(1), ifName xe-11/0/0:2
Aug 14 21:07:50  my-mx-960 fpc11 mqss_wo_coreif_conn_cfcnt_wait_for_zero:
Timeout occured while waiting for CFCNT value to become zero - conn_num 2
Aug 14 21:07:50  my-mx-960 fpc11 mqss_stream_out_disable: Waiting for CFCNT
value to become zero for WO connection failed - status 29, conn_num 2
Aug 14 21:07:50  my-mx-960 fpc11 mqss_ifd_link_up_down_handler: Disabling
PHY stream for egress side failed - status 29, instance 0, phy_stream 1094
Aug 14 21:07:50  my-mx-960 fpc11 pfe_ifd_link_updown: Handling IFD link DOWN
failed - status 29, ifd xe-11/0/0:2
Aug 14 21:07:50  my-mx-960 mib2d[15194]: SNMP_TRAP_LINK_DOWN: ifIndex 672,
ifAdminStatus up(1), ifOperStatus down(2), ifName xe-11/0/0:2
Aug 14 21:07:50  my-mx-960 mib2d[15194]: SNMP_TRAP_LINK_UP: ifIndex 672,
ifAdminStatus up(1), ifOperStatus up(1), ifName xe-11/0/0:2
Aug 14 21:07:53  my-mx-960 lacpd[13424]: LACPD_TIMEOUT: xe-11/0/0:2: lacp
current while timer expired current Receive State: CURRENT
Aug 14 21:08:23  my-mx-960 lacpd[13424]: LACPD_TIMEOUT: xe-11/0/0:2: lacp
current while timer expired current Receive State: CURRENT
Aug 14 21:10:22  my-mx-960 last message repeated 3 times
Aug 14 21:14:52  my-mx-960 last message repeated 9 times
Aug 14 21:15:22  my-mx-960 lacpd[13424]: LACPD_TIMEOUT: xe-11/0/0:2: lacp
current while timer expired current Receive State: CURRENT
Aug 14 21:15:52  my-mx-960 lacpd[13424]: LACPD_TIMEOUT: xe-11/0/0:2: lacp
current while timer expired current Receive State: CURRENT
Aug 14 21:17:52  my-mx-960 last message repeated 4 times
Aug 14 21:23:52  my-mx-960 last message repeated 12 times
Aug 14 21:24:07  my-mx-960 fpc11
mqss_wo_coreif_conn_credits_wait_for_init_value: Timeout occured while
waiting for available credits value to become initial value - conn_num 2,
credits 0, init_credits 3
Aug 14 21:24:07  my-mx-960 fpc11
mqss_stream_phy_stream_out_wanio_cr_flush_start: Waiting for available
credits value to become initial value for WO connection failed - status 29,
conn_num 2
Aug 14 21:24:07  my-mx-960 fpc11 mqss_stream_phy_stream_out_wanio_cr_flush:
Starting traffic flush using WANIO core flush for PHY stream failed - status
29, stream_num 1094, chmac_speed 0, pr_stream 32
Aug 14 21:24:07  my-mx-960 fpc11 mqss_stream_out_disable_wanio_ea: Starting
traffic flush for PHY stream using WANIO core failed - status 29, stream_num
1094
Aug 14 21:24:07  my-mx-960 fpc11 mqss_stream_out_disable_wanio: Performing
egress PHY stream disable operations for WANIO failed - status 29,
stream_num 1094
Aug 14 21:24:07  my-mx-960 fpc11 mqss_stream_out_disable: Performing egress
PHY stream disable operations for WANIO failed - status 29, stream_num 1094
Aug 14 21:24:07  my-mx-960 fpc11 mqss_ifd_link_up_down_handler: Disabling
PHY stream for egress side failed - status 29, instance 0, phy_stream 1094
Aug 14 21:24:07  my-mx-960 fpc11 pfe_ifd_link_updown: Handling IFD link DOWN
failed - status 29, ifd xe-11/0/0:2
Aug 14 21:24:07  my-mx-960 mib2d[15194]: SNMP_TRAP_LINK_DOWN: ifIndex 672,
ifAdminStatus up(1), ifOperStatus down(2), ifName xe-11/0/0:2
Aug 14 21:24:08  my-mx-960 mib2d[15194]: SNMP_TRAP_LINK_UP: ifIndex 672,
ifAdminStatus up(1), ifOperStatus up(1), ifName xe-11/0/0:2
Aug 14 21:24:23  my-mx-960 lacpd[13424]: LACPD_TIMEOUT: xe-11/0/0:2: lacp
current while timer expired current Receive State: CURRENT
Aug 14 21:24:53  my-mx-960 lacpd[13424]: LACPD_TIMEOUT: xe-11/0/0:2: lacp
current while timer expired current Receive State: CURRENT
Aug 14 21:26:53  my-mx-960 last message repeated 4 times
Aug 14 21:29:53  my-mx-960 last message repeated 6 times
Aug 14 21:30:23  my-mx-960 lacpd[13424]: LACPD_TIMEOUT: xe-11/0/0:2: lacp
current while timer expired current Receive State: CURRENT
Aug 14 21:30:53  my-mx-960 lacpd[13424]: LACPD_TIMEOUT: xe-11/0/0:2: lacp
current while timer expired current Receive State: CURRENT
Aug 14 21:32:53  my-mx-960 last message repeated 4 times
Aug 14 21:42:53  my-mx-960 last message repeated 20 times
Aug 14 21:43:53  my-mx-960 last message repeated 2 times

Tshooting....
- I bounced xe-11/0/0:2, nothing
- I bounced neighboring other 10 gig ints within that 4x10 breakout,
xe-11/0/0:0, xe-11/0/0:1, xe-11/0/0:3.... nothing... as a matter of fact, it
went from bad to worse.  By bouncing those neighboring other members of that
4x10 breakout, I broke all other interfaces on that vPIC of the MPC7E-MRATE
card !!!  I mean, all these went down ... (so in bouncing those neighboring
10's within that 4x10 , I broke one of my 100 gig interfaces in that PIC
group (namely, et-11/0/2)

agould at stlr-960> show interfaces terse | grep ^"xe|et" | grep ^..-11/0
xe-11/0/0:0             up    up
xe-11/0/0:0.0           up    up   aenet    --> ae50.0
xe-11/0/0:1             up    up
xe-11/0/0:1.0           up    up   aenet    --> ae50.0
xe-11/0/0:2             up    up
xe-11/0/0:2.0           up    up   aenet    --> ae61.0
xe-11/0/0:3             up    up
xe-11/0/0:3.0           up    up   aenet    --> ae100.0
et-11/0/2               up    up
et-11/0/2.0             up    up   aenet    --> ae1.0
et-11/0/5               up    down

here's what it showed when I tried that...
Aug 15 13:50:54  my-mx-960 rpd[14820]: RPD_LDP_NBRDOWN: LDP neighbor
111.222.129.10 (ae1.0) is down
Aug 15 13:50:54  my-mx-960 rpd[14820]: RPD_LDP_SESSIONDOWN: LDP session
111.222.128.7 is down, reason: all adjacencies down
Aug 15 13:50:54  my-mx-960 mib2d[15194]: SNMP_TRAP_LINK_DOWN: ifIndex 733,
ifAdminStatus up(1), ifOperStatus down(2), ifName ae1
Aug 15 13:50:54  my-mx-960 mib2d[15194]: SNMP_TRAP_LINK_DOWN: ifIndex 719,
ifAdminStatus up(1), ifOperStatus down(2), ifName et-11/0/2
Aug 15 13:50:54  my-mx-960 rpd[14820]: RPD_OSPF_NBRDOWN: OSPF neighbor
111.222.129.10 (realm ospf-v2 ae1.0 area 0.0.0.1) state changed from Full to
Down due to KillNbr (event reason: interface went down)
Aug 15 13:50:54  my-mx-960 fpc11 PFE 0: 'PFE Disable' action performed.
Bringing down ifd et-11/0/5 218
Aug 15 13:50:54  my-mx-960 fpc11 PFE 0: 'PFE Disable' action performed.
Bringing down ifd xe-11/0/0:0 213
Aug 15 13:50:54  my-mx-960 fpc11 PFE 0: 'PFE Disable' action performed.
Bringing down ifd xe-11/0/0:1 214
Aug 15 13:50:54  my-mx-960 fpc11 PFE 0: 'PFE Disable' action performed.
Bringing down ifd xe-11/0/0:2 215
Aug 15 13:50:54  my-mx-960 fpc11 PFE 0: 'PFE Disable' action performed.
Bringing down ifd xe-11/0/0:3 216
Aug 15 13:50:54  my-mx-960 kernel: if_msg_ifd_cmd_tlv_decode ifd et-11/0/2
#217 down with ASIC Error
Aug 15 13:50:54  my-mx-960 kernel: if_msg_ifd_cmd_tlv_decode ifd et-11/0/5
#218 down with ASIC Error
Aug 15 13:50:54  my-mx-960 kernel: if_msg_ifd_cmd_tlv_decode ifd xe-11/0/0:0
#213 down with ASIC Error
Aug 15 13:50:54  my-mx-960 kernel: if_msg_ifd_cmd_tlv_decode ifd xe-11/0/0:1
#214 down with ASIC Error
Aug 15 13:50:54  my-mx-960 kernel: if_msg_ifd_cmd_tlv_decode ifd xe-11/0/0:2
#215 down with ASIC Error
Aug 15 13:50:54  my-mx-960 kernel: if_msg_ifd_cmd_tlv_decode ifd xe-11/0/0:3
#216 down with ASIC Error
Aug 15 13:50:54  my-mx-960 fpc11 Cmerror Op Set: MQSS(0): MQSS(0): WANIO_CR:
Parity Protect: Parity error detected for Tx SRAM - detected_txlink 0x1
Aug 15 13:50:54  my-mx-960 fpc11 PFE 0: 'PFE Disable' action performed.
Bringing down ifd et-11/0/2 217
Aug 15 13:50:54  my-mx-960 fpc11
mqss_wo_coreif_conn_credits_wait_for_init_value: Timeout occured while
waiting for available credits value to become initial value - conn_num 2,
credits 0, init_credits 3
Aug 15 13:50:54  my-mx-960 fpc11
mqss_stream_phy_stream_out_wanio_cr_flush_start: Waiting for available
credits value to become initial value for WO connection failed - status 29,
conn_num 2
Aug 15 13:50:54  my-mx-960 fpc11 mqss_stream_phy_stream_out_wanio_cr_flush:
Starting traffic flush using WANIO core flush for PHY stream failed - status
29, stream_num 1094, chmac_speed 0, pr_stream 32
Aug 15 13:50:54  my-mx-960 fpc11 mqss_stream_out_disable_wanio_ea: Starting
traffic flush for PHY stream using WANIO core failed - status 29, stream_num
1094
Aug 15 13:50:54  my-mx-960 fpc11 mqss_stream_out_disable_wanio: Performing
egress PHY stream disable operations for WANIO failed - status 29,
stream_num 1094
Aug 15 13:50:54  my-mx-960 fpc11 mqss_stream_out_disable: Performing egress
PHY stream disable operations for WANIO failed - status 29, stream_num 1094
Aug 15 13:50:54  my-mx-960 fpc11 mqss_ifd_link_up_down_handler: Disabling
PHY stream for egress side failed - status 29, instance 0, phy_stream 1094
Aug 15 13:50:54  my-mx-960 fpc11 pfe_ifd_link_updown: Handling IFD link DOWN
failed - status 29, ifd xe-11/0/0:2
Aug 15 13:50:54  my-mx-960 fpc11
mqss_wo_coreif_conn_credits_wait_for_init_value: Timeout occured while
waiting for available credits value to become initial value - conn_num 3,
credits 10, init_credits 3
Aug 15 13:50:54  my-mx-960 fpc11
mqss_stream_phy_stream_out_wanio_cr_flush_start: Waiting for available
credits value to become initial value for WO connection failed - status 29,
conn_num 3
Aug 15 13:50:54  my-mx-960 fpc11 mqss_stream_phy_stream_out_wanio_cr_flush:
Starting traffic flush using WANIO core flush for PHY stream failed - status
29, stream_num 1095, chmac_speed 0, pr_stream 33
Aug 15 13:50:54  my-mx-960 fpc11 mqss_stream_out_disable_wanio_ea: Starting
traffic flush for PHY stream using WANIO core failed - status 29, stream_num
1095
Aug 15 13:50:54  my-mx-960 fpc11 mqss_stream_out_disable_wanio: Performing
egress PHY stream disable operations for WANIO failed - status 29,
stream_num 1095
Aug 15 13:50:54  my-mx-960 fpc11 mqss_stream_out_disable: Performing egress
PHY stream disable operations for WANIO failed - status 29, stream_num 1095
Aug 15 13:50:54  my-mx-960 fpc11 mqss_ifd_link_up_down_handler: Disabling
PHY stream for egress side failed - status 29, instance 0, phy_stream 1095
Aug 15 13:50:54  my-mx-960 fpc11 pfe_ifd_link_updown: Handling IFD link DOWN
failed - status 29, ifd xe-11/0/0:3


-Aaron




More information about the juniper-nsp mailing list