[j-nsp] MX80-48T problems

Mon Aug 11 08:28:18 EDT 2014

Hello,

One of our MX80 droped local ibgp session, also up to 40% of traffic
was dropped.
Both are running junos 10.4R8.4.

The setup is simple.
2x MX80. mx80 and mx80b with multiple bgp sessions, 1 IBGP session
betwean mx80 and mx80b, both of them do some L2 bridging.
The local routing is using OSPF, and is connected 2xEX4200 VC using
10G link on each router.

In the graphs we observed 100% RE cpu usage starting at 11:36.

Aug 11 11:36:21  mx80 pfed: PFED: rtslib sequence gap 925570 - 925569
Aug 11 11:36:23  mx80 rpd[1204]: bgp_recv: read from peer 10.0.3.2
(Internal AS 47544) failed: Connection reset by peer
Aug 11 11:36:23  mx80 rpd[1204]: RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP
peer 10.0.3.2 (Internal AS 47544) changed state from Established to
Idle (event Restart)
Aug 11 11:36:23  mx80 bfdd[1179]: BFDD_TRAP_SHOP_STATE_DOWN: local
discriminator: 148, new state: down, interface: irb.3, peer addr:
10.0.3.2
Aug 11 11:37:02  mx80 rpd[1204]: RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP
peer 10.0.3.2 (Internal AS 47544) changed state from OpenConfirm to
Established (event RecvKeepAlive)

After disabling bgp cpu usage stayed at 100% for 2 hours.

Aug 11 13:10:02  mx80 rpd[1204]: RPD_SCHED_SLIP: 23 sec scheduler
slip, user: 22 sec 137985 usec, system: 0 sec, 60852 usec
Aug 11 13:10:25  mx80 rpd[1204]: RPD_SCHED_SLIP: 22 sec scheduler
slip, user: 21 sec 644991 usec, system: 0 sec, 52989 usec
Aug 11 13:10:47  mx80 rpd[1204]: RPD_SCHED_SLIP: 22 sec scheduler
slip, user: 21 sec 143930 usec, system: 0 sec, 67534 usec
Aug 11 13:11:15  mx80 rpd[1204]: RPD_SCHED_SLIP: 28 sec scheduler
slip, user: 27 sec 540402 usec, system: 0 sec, 88358 usec
Aug 11 13:11:44  mx80 rpd[1204]: RPD_SCHED_SLIP: 28 sec scheduler
slip, user: 27 sec 502983 usec, system: 0 sec, 93646 usec
Aug 11 13:12:19  mx80 rpd[1204]: RPD_SCHED_SLIP: 35 sec scheduler
slip, user: 34 sec 332484 usec, system: 0 sec, 138946 usec
Aug 11 13:12:54  mx80 rpd[1204]: RPD_SCHED_SLIP: 34 sec scheduler
slip, user: 33 sec 869776 usec, system: 0 sec, 113869 usec
Aug 11 13:13:28  mx80 rpd[1204]: RPD_SCHED_SLIP: 34 sec scheduler
slip, user: 33 sec 400973 usec, system: 0 sec, 108975 usec
Aug 11 13:14:03  mx80 rpd[1204]: RPD_SCHED_SLIP: 34 sec scheduler
slip, user: 32 sec 966484 usec, system: 0 sec, 82881 usec
Aug 11 13:14:43  mx80 rpd[1204]: RPD_SCHED_SLIP: 40 sec scheduler
slip, user: 39 sec 293668 usec, system: 0 sec, 126469 usec
Aug 11 13:15:31  mx80 rpd[1204]: RPD_SCHED_SLIP: 47 sec scheduler
slip, user: 45 sec 703233 usec, system: 0 sec, 141001 usec
Aug 11 13:24:37  mx80 rpd[1204]: RPD_BFD_WRITE_ERROR: bfd_send: write
error on pipe to bfdd (Broken pipe)
Aug 11 13:24:37  mx80 rpd[1204]: RPD_PPM_WRITE_ERROR: ppm_send: write
error on pipe to ppmd (Broken pipe)
Aug 11 13:24:37  mx80 rpd[1204]: RPD_SCHED_SLIP: 128 sec scheduler
slip, user: 124 sec 230482 usec, system: 0 sec, 386016 usec
Aug 11 13:27:43  mx80 rpd[1204]: RPD_SCHED_SLIP: 185 sec scheduler
slip, user: 179 sec 832503 usec, system: 0 sec, 563086 usec

There ware also short PFC 0 and 1 cpu usage peaks to 100%.

mx80> show chassis routing-engine
Routing Engine status:
    Temperature                 44 degrees C / 111 degrees F
    CPU temperature             57 degrees C / 134 degrees F
    DRAM                      2048 MB
    Memory utilization          84 percent
    CPU utilization:
      User                      97 percent
      Background                 0 percent
      Kernel                     2 percent
      Interrupt                  1 percent
      Idle                       0 percent
    Model                          RE-MX80-48T
    Start time                     2014-04-19 00:26:21 CEST
    Uptime                         114 days, 12 hours, 35 minutes, 23 seconds
    Last reboot reason             Router rebooted after a normal shutdown.
    Load averages:                 1 minute   5 minute  15 minute
                                       1.25       1.13       1.05

a have followed
http://kb.juniper.net/InfoCenter/index?page=content&id=KB26261#rpd

rpd was using all avaiable cpu
last pid:  9551;  load averages:  1.04,  1.10,  1.05

                                up 114+12:38:22 13:04:13
54 processes:  2 running, 52 sleeping
CPU states: 98.5% user,  0.0% nice,  0.9% system,  0.6% interrupt,  0.0% idle
Mem: 977M Active, 381M Inact, 316M Wired, 136M Cache, 112M Buf, 178M Free
Swap: 2915M Total, 2915M Free

  PID USERNAME    THR PRI NICE   SIZE    RES STATE    TIME   WCPU COMMAND
 1204 root          1 132    0  1055M  1026M RUN     48.1H 96.92% rpd
 1245 root          2  96    0 16992K  6040K select 656:40  0.00% pfed
 1246 root          1  96    0 19568K  9636K select 650:55  0.00% mib2d
 1247 root          1  96    0 16348K 11612K select 609:46  0.00% snmpd
 1175 root          1  96    0  6500K  3160K select 384:57  0.00% ppmd
 1165 root          1  96    0 26916K  7928K select 225:31  0.00% chassisd
 1232 root          3  20    0 66724K 36360K sigwai 208:02  0.00% jpppd
 1235 root          1   4    0  8192K  4028K kqread 172:35  0.00% mcsnoopd
 1305 root          1  96    0 18752K  8512K select 136:27  0.00% l2ald
 1182 root          1  96    0  3712K   932K select 122:36  0.00% license-check
 1179 root          1  96    0  7876K  3424K select 120:45  0.00% bfdd
 1162 root          1  96    0  1980K   836K select  61:44  0.00% bslockd
 1166 root          1  96    0  5512K  2008K select  34:06  0.00% alarmd
 1243 root          1  96    0  4820K  1944K select  16:44  0.00% irsd
 1248 root          1  96    0 33272K  9532K select  15:37  0.00% dcd
 1226 root          1  96    0  6108K  2836K select  14:00  0.00% cfmd
 1012 root          1  96    0  5232K  2208K select  13:34  0.00% eventd
 1239 root          1   4    0 42528K 10296K kqread  11:54  0.00% rpd

at that time there ware no routes changes
root at mx80% rtsockmon -t
        sender   flag    type       op
[12:30:47] l2ald    P    route      delete  vpls
0.c.0.0.0.26.b9.54.5.d2 tid=5 plen=80 type=user flags=0x180 nh=ucst
nhflags=0x0 nhidx=720 altfwdnhidx=0 filtidx=0
[12:30:47] l2ald    P    route      delete  vpls
0.c.0.0.0.1e.4f.2d.b9.cf tid=5 plen=80 type=user flags=0x180 nh=ucst
nhflags=0x0 nhidx=720 altfwdnhidx=0 filtidx=0
[12:30:47] l2ald    P    route      delete  vpls
0.10.0.0.0.1e.c9.cb.95.d1 tid=5 plen=80 type=user flags=0x180 nh=ucst
nhflags=0x0 nhidx=718 altfwdnhidx=0 filtidx=0
[12:30:47] l2ald    P    route      delete  vpls
0.c.0.0.0.26.b9.2a.74.f8 tid=5 plen=80 type=user flags=0x180 nh=ucst
nhflags=0x0 nhidx=720 altfwdnhidx=0 filtidx=0
[12:30:47] l2ald    P    route      delete  vpls
0.c.0.0.0.c.29.78.74.d2 tid=5 plen=80 type=user flags=0x180 nh=ucst
nhflags=0x0 nhidx=720 altfwdnhidx=0 filtidx=0
[12:30:47] l2ald    P    route      delete  vpls
0.c.0.0.0.24.e8.4e.e8.7c tid=5 plen=80 type=user flags=0x180 nh=ucst
nhflags=0x0 nhidx=720 altfwdnhidx=0 filtidx=0
[12:30:47] l2ald    P    route      delete  vpls
0.c.0.0.0.1e.4f.2a.35.20 tid=5 plen=80 type=user flags=0x180 nh=ucst
nhflags=0x0 nhidx=720 altfwdnhidx=0 filtidx=0
[12:30:47] l2ald    P    route      delete  vpls
0.c.0.0.a4.ba.db.9.98.32 tid=5 plen=80 type=user flags=0x180 nh=ucst
nhflags=0x0 nhidx=720 altfwdnhidx=0 filtidx=0
[12:30:49] l2ald    P    route      delete  vpls
0.18.0.0.0.1e.f7.f7.66.80 tid=5 plen=80 type=user flags=0x180 nh=ucst
nhflags=0x0 nhidx=702 altfwdnhidx=0 filtidx=0
[12:30:49] l2ald    P    route      delete  vpls
0.c.0.0.78.2b.cb.1b.1d.80 tid=5 plen=80 type=user flags=0x180 nh=ucst
nhflags=0x0 nhidx=720 altfwdnhidx=0 filtidx=0
[12:30:50] l2ald    P    route      add     vpls
0.18.0.0.0.1e.f7.f7.66.80 tid=5 plen=80 type=user flags=0x0 nh=ucst
nhflags=0x0 nhidx=702 altfwdnhidx=0 filtidx=0
[12:30:51] l2ald    P    route      delete  vpls
0.18.0.0.1c.df.f.f4.86.81 tid=5 plen=80 type=user flags=0x180 nh=ucst
nhflags=0x0 nhidx=702 altfwdnhidx=0 filtidx=0
[12:30:51] l2ald    P    route      add     vpls
0.d.0.0.78.2b.cb.30.14.58 tid=5 plen=80 type=user flags=0x0 nh=ucst
nhflags=0x0 nhidx=721 altfwdnhidx=0 filtidx=0
[12:30:51] l2ald    P    route      add     vpls
0.18.0.0.1c.df.f.f4.86.81 tid=5 plen=80 type=user flags=0x0 nh=ucst
nhflags=0x0 nhidx=702 altfwdnhidx=0 filtidx=0
[12:30:52] l2ald    P    route      add     vpls
0.9.0.0.4.7d.7b.fc.e5.19 tid=5 plen=80 type=user flags=0x0 nh=ucst
nhflags=0x0 nhidx=707 altfwdnhidx=0 filtidx=0

there was no interface flaps at the time

mx80> show task memory
Memory                 Size (kB)  %Available  When
  Currently In Use:       698741         39%  now
  Maximum Ever Used:     1071432         60%  14/04/29 01:30:09
  Available:             1769832        100%  now

swap was not used
root at mx80% vmstat -s
2800873200 cpu context switches
4170573092 device interrupts
524075416 software interrupts
3104628462 traps
2154475645 system calls
       67 kernel threads created
   341523  fork() calls
   167224 vfork() calls
        0 rfork() calls
        0 swap pager pageins
        0 swap pager pages paged in
        0 swap pager pageouts
        0 swap pager pages paged out
    25706 vnode pager pageins
    25751 vnode pager pages paged in
    12486 vnode pager pageouts
    57909 vnode pager pages paged out
        0 page daemon wakeups
        0 pages examined by the page daemon
     4515 pages reactivated
 14019354 copy-on-write faults
       11 copy-on-write optimized faults
 16293592 zero fill pages zeroed
 16087177 zero fill pages prezeroed
     1013 intransit blocking page faults
950170073 total VM faults taken
        0 pages affected by kernel thread creation
 19198331 pages affected by  fork()
 50568387 pages affected by vfork()
        0 pages affected by rfork()
 73932805 pages freed
        0 pages freed by daemon
 23464865 pages freed by exiting processes
   250096 pages active
    97497 pages inactive
    34720 pages in VM cache
    80975 pages wired down
    45450 pages free
     4096 bytes per page
        0 swap pages used
        0 peak swap pages used
436260150 total name lookups
          cache hits (94% pos + 5% neg) system 0% per-directory
          deletions 0%, falsehits 0%, toolong 0%

root at mx80% vmstat 1
 procs      memory      page                   disks   faults      cpu
   faults      cpu
 r b w     avm    fre  flt  re  pi  po  fr  sr da0 da1   in   sy  cs us sy id
 2 0 0 2697708 320644   96   0   0   0   7   0   0   0  421  218 283 16 15 69
 1 0 0 2697708 320640    9   0   0   0   0   0   0   1  296  496 596 99  1  0
 1 0 0 2697708 320640    0   0   0   0   0   0   0   0  286  495 587 99  1  0
 1 0 0 2697708 320640    0   0   0   0   0   0   0   1  259  438 558 99  1  0
 1 0 0 2697708 320640    0   0   0   0   0   0   0   0  248  605 815 99  1  0
 1 0 0 2697708 320640    0   0   0   0   0   0   0   0  303  645 971 98  2  0
 1 0 0 2697708 320640    2   0   0   0   0   0   0   0  257  489 545 99  1  0
 1 0 0 2697708 320640   96   0   0   0   0   0   0   0  306  566 607 99  1  0

There are no coredumps.

in the dmesg

ifd_pfestat_postprocess: Post process req id=32721 relay id=0 cflags = 0
ifd_pfestat_complete: Completed req id=32721 relay id=0 cflags = 0
if_msg_ifd_bchip_stats(xe-0/0/3 #140)
if_msg_ifd_qstats(xe-0/0/3 #140)
if_msg_ifd_qstats(xe-0/0/3 #140 0)
if_msg_ge_get_info: xe-0/0/3 #140
if_msg_ifd_stats(xe-0/0/3 #140)
if_msg_ifd_lcc_stats(xe-0/0/3 #140)
if_msg_ifd_bchip_stats(xe-0/0/3 #140) - reply(idx=0)
if_msg_ifd_bchip_stats_ipc_handler(xe-0/0/3 #140) - reply(idx=0)
if_pfe_msg_flip_ifd_q_stats(xe-0/0/3 #140 #65535) - reply(idx=0)
if_pfe_msg_flip_ifd_q_stats(xe-0/0/3 #140 #65535) - reply(idx=0)
if_pfe_msg_flip_ifd_q_stats(xe-0/0/3 #140 #65535) - reply(idx=0)
if_pfe_msg_flip_ifd_q_stats(xe-0/0/3 #140 #65535) - reply(idx=0)
if_msg_ge_get_info:(xe-0/0/3 #140) - reply
if_pfe_msg_if_reply(xe-0/0/3 #140) - reply(idx=0) reqidx = 32731 peerid = 0
if_msg_ifd_stats_ipc_handler(xe-0/0/3 #140) - reply(idx=0) reqidx =
32731 peerid = 0
if_msg_ifd_stats_ipc_handler: ifd_name xe-0/0/3 residual stats ip 0,
op 0, ib 0ob 0, im 0, om 0
ifd_pfestat_postprocess: Post process req id=32731 relay id=0 cflags = 0
ifd_pfestat_complete: Completed req id=32731 relay id=0 cflags = 0
if_msg_ifd_bchip_stats(xe-0/0/3 #140)
if_msg_ifd_qstats(xe-0/0/3 #140)
if_msg_ifd_qstats(xe-0/0/3 #140 0)
if_msg_ge_get_info: xe-0/0/3 #140
if_msg_ifd_stats(xe-0/0/3 #140)
if_msg_ifd_lcc_stats(xe-0/0/3 #140)
if_msg_ifd_bchip_stats(xe-0/0/3 #140) - reply(idx=0)
if_msg_ifd_bchip_stats_ipc_handler(xe-0/0/3 #140) - reply(idx=0)
f_pfe_msg_flip_ifd_q_stats(xe-0/0/3 #140 #65535) - reply(idx=0)
if_pfe_msg_flip_ifd_q_stats(xe-0/0/3 #140 #65535) - reply(idx=0)
if_pfe_msg_flip_ifd_q_stats(xe-0/0/3 #140 #65535) - reply(idx=0)
if_pfe_msg_flip_ifd_q_stats(xe-0/0/3 #140 #65535) - reply(idx=0)
if_msg_ge_get_info:(xe-0/0/3 #140) - reply
if_pfe_msg_if_reply(xe-0/0/3 #140) - reply(idx=0) reqidx = 32735 peerid = 0
if_msg_ifd_stats_ipc_handler(xe-0/0/3 #140) - reply(idx=0) reqidx =
32735 peerid = 0
if_msg_ifd_stats_ipc_handler: ifd_name xe-0/0/3 residual stats ip 0,
op 0, ib 0ob 0, im 0, om 0

Can pfed: PFED: rtslib sequence gap 925570 - 92556 suggest some hardware issue ?

Thanks for any answers.

-- 
Michal Grzedzicki