[j-nsp] MX80-48T problems
Lazy
lazy404 at gmail.com
Mon Aug 11 08:28:18 EDT 2014
Hello,
One of our MX80 droped local ibgp session, also up to 40% of traffic
was dropped.
Both are running junos 10.4R8.4.
The setup is simple.
2x MX80. mx80 and mx80b with multiple bgp sessions, 1 IBGP session
betwean mx80 and mx80b, both of them do some L2 bridging.
The local routing is using OSPF, and is connected 2xEX4200 VC using
10G link on each router.
In the graphs we observed 100% RE cpu usage starting at 11:36.
Aug 11 11:36:21 mx80 pfed: PFED: rtslib sequence gap 925570 - 925569
Aug 11 11:36:23 mx80 rpd[1204]: bgp_recv: read from peer 10.0.3.2
(Internal AS 47544) failed: Connection reset by peer
Aug 11 11:36:23 mx80 rpd[1204]: RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP
peer 10.0.3.2 (Internal AS 47544) changed state from Established to
Idle (event Restart)
Aug 11 11:36:23 mx80 bfdd[1179]: BFDD_TRAP_SHOP_STATE_DOWN: local
discriminator: 148, new state: down, interface: irb.3, peer addr:
10.0.3.2
Aug 11 11:37:02 mx80 rpd[1204]: RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP
peer 10.0.3.2 (Internal AS 47544) changed state from OpenConfirm to
Established (event RecvKeepAlive)
After disabling bgp cpu usage stayed at 100% for 2 hours.
Aug 11 13:10:02 mx80 rpd[1204]: RPD_SCHED_SLIP: 23 sec scheduler
slip, user: 22 sec 137985 usec, system: 0 sec, 60852 usec
Aug 11 13:10:25 mx80 rpd[1204]: RPD_SCHED_SLIP: 22 sec scheduler
slip, user: 21 sec 644991 usec, system: 0 sec, 52989 usec
Aug 11 13:10:47 mx80 rpd[1204]: RPD_SCHED_SLIP: 22 sec scheduler
slip, user: 21 sec 143930 usec, system: 0 sec, 67534 usec
Aug 11 13:11:15 mx80 rpd[1204]: RPD_SCHED_SLIP: 28 sec scheduler
slip, user: 27 sec 540402 usec, system: 0 sec, 88358 usec
Aug 11 13:11:44 mx80 rpd[1204]: RPD_SCHED_SLIP: 28 sec scheduler
slip, user: 27 sec 502983 usec, system: 0 sec, 93646 usec
Aug 11 13:12:19 mx80 rpd[1204]: RPD_SCHED_SLIP: 35 sec scheduler
slip, user: 34 sec 332484 usec, system: 0 sec, 138946 usec
Aug 11 13:12:54 mx80 rpd[1204]: RPD_SCHED_SLIP: 34 sec scheduler
slip, user: 33 sec 869776 usec, system: 0 sec, 113869 usec
Aug 11 13:13:28 mx80 rpd[1204]: RPD_SCHED_SLIP: 34 sec scheduler
slip, user: 33 sec 400973 usec, system: 0 sec, 108975 usec
Aug 11 13:14:03 mx80 rpd[1204]: RPD_SCHED_SLIP: 34 sec scheduler
slip, user: 32 sec 966484 usec, system: 0 sec, 82881 usec
Aug 11 13:14:43 mx80 rpd[1204]: RPD_SCHED_SLIP: 40 sec scheduler
slip, user: 39 sec 293668 usec, system: 0 sec, 126469 usec
Aug 11 13:15:31 mx80 rpd[1204]: RPD_SCHED_SLIP: 47 sec scheduler
slip, user: 45 sec 703233 usec, system: 0 sec, 141001 usec
Aug 11 13:24:37 mx80 rpd[1204]: RPD_BFD_WRITE_ERROR: bfd_send: write
error on pipe to bfdd (Broken pipe)
Aug 11 13:24:37 mx80 rpd[1204]: RPD_PPM_WRITE_ERROR: ppm_send: write
error on pipe to ppmd (Broken pipe)
Aug 11 13:24:37 mx80 rpd[1204]: RPD_SCHED_SLIP: 128 sec scheduler
slip, user: 124 sec 230482 usec, system: 0 sec, 386016 usec
Aug 11 13:27:43 mx80 rpd[1204]: RPD_SCHED_SLIP: 185 sec scheduler
slip, user: 179 sec 832503 usec, system: 0 sec, 563086 usec
There ware also short PFC 0 and 1 cpu usage peaks to 100%.
mx80> show chassis routing-engine
Routing Engine status:
Temperature 44 degrees C / 111 degrees F
CPU temperature 57 degrees C / 134 degrees F
DRAM 2048 MB
Memory utilization 84 percent
CPU utilization:
User 97 percent
Background 0 percent
Kernel 2 percent
Interrupt 1 percent
Idle 0 percent
Model RE-MX80-48T
Start time 2014-04-19 00:26:21 CEST
Uptime 114 days, 12 hours, 35 minutes, 23 seconds
Last reboot reason Router rebooted after a normal shutdown.
Load averages: 1 minute 5 minute 15 minute
1.25 1.13 1.05
a have followed
http://kb.juniper.net/InfoCenter/index?page=content&id=KB26261#rpd
rpd was using all avaiable cpu
last pid: 9551; load averages: 1.04, 1.10, 1.05
up 114+12:38:22 13:04:13
54 processes: 2 running, 52 sleeping
CPU states: 98.5% user, 0.0% nice, 0.9% system, 0.6% interrupt, 0.0% idle
Mem: 977M Active, 381M Inact, 316M Wired, 136M Cache, 112M Buf, 178M Free
Swap: 2915M Total, 2915M Free
PID USERNAME THR PRI NICE SIZE RES STATE TIME WCPU COMMAND
1204 root 1 132 0 1055M 1026M RUN 48.1H 96.92% rpd
1245 root 2 96 0 16992K 6040K select 656:40 0.00% pfed
1246 root 1 96 0 19568K 9636K select 650:55 0.00% mib2d
1247 root 1 96 0 16348K 11612K select 609:46 0.00% snmpd
1175 root 1 96 0 6500K 3160K select 384:57 0.00% ppmd
1165 root 1 96 0 26916K 7928K select 225:31 0.00% chassisd
1232 root 3 20 0 66724K 36360K sigwai 208:02 0.00% jpppd
1235 root 1 4 0 8192K 4028K kqread 172:35 0.00% mcsnoopd
1305 root 1 96 0 18752K 8512K select 136:27 0.00% l2ald
1182 root 1 96 0 3712K 932K select 122:36 0.00% license-check
1179 root 1 96 0 7876K 3424K select 120:45 0.00% bfdd
1162 root 1 96 0 1980K 836K select 61:44 0.00% bslockd
1166 root 1 96 0 5512K 2008K select 34:06 0.00% alarmd
1243 root 1 96 0 4820K 1944K select 16:44 0.00% irsd
1248 root 1 96 0 33272K 9532K select 15:37 0.00% dcd
1226 root 1 96 0 6108K 2836K select 14:00 0.00% cfmd
1012 root 1 96 0 5232K 2208K select 13:34 0.00% eventd
1239 root 1 4 0 42528K 10296K kqread 11:54 0.00% rpd
at that time there ware no routes changes
root at mx80% rtsockmon -t
sender flag type op
[12:30:47] l2ald P route delete vpls
0.c.0.0.0.26.b9.54.5.d2 tid=5 plen=80 type=user flags=0x180 nh=ucst
nhflags=0x0 nhidx=720 altfwdnhidx=0 filtidx=0
[12:30:47] l2ald P route delete vpls
0.c.0.0.0.1e.4f.2d.b9.cf tid=5 plen=80 type=user flags=0x180 nh=ucst
nhflags=0x0 nhidx=720 altfwdnhidx=0 filtidx=0
[12:30:47] l2ald P route delete vpls
0.10.0.0.0.1e.c9.cb.95.d1 tid=5 plen=80 type=user flags=0x180 nh=ucst
nhflags=0x0 nhidx=718 altfwdnhidx=0 filtidx=0
[12:30:47] l2ald P route delete vpls
0.c.0.0.0.26.b9.2a.74.f8 tid=5 plen=80 type=user flags=0x180 nh=ucst
nhflags=0x0 nhidx=720 altfwdnhidx=0 filtidx=0
[12:30:47] l2ald P route delete vpls
0.c.0.0.0.c.29.78.74.d2 tid=5 plen=80 type=user flags=0x180 nh=ucst
nhflags=0x0 nhidx=720 altfwdnhidx=0 filtidx=0
[12:30:47] l2ald P route delete vpls
0.c.0.0.0.24.e8.4e.e8.7c tid=5 plen=80 type=user flags=0x180 nh=ucst
nhflags=0x0 nhidx=720 altfwdnhidx=0 filtidx=0
[12:30:47] l2ald P route delete vpls
0.c.0.0.0.1e.4f.2a.35.20 tid=5 plen=80 type=user flags=0x180 nh=ucst
nhflags=0x0 nhidx=720 altfwdnhidx=0 filtidx=0
[12:30:47] l2ald P route delete vpls
0.c.0.0.a4.ba.db.9.98.32 tid=5 plen=80 type=user flags=0x180 nh=ucst
nhflags=0x0 nhidx=720 altfwdnhidx=0 filtidx=0
[12:30:49] l2ald P route delete vpls
0.18.0.0.0.1e.f7.f7.66.80 tid=5 plen=80 type=user flags=0x180 nh=ucst
nhflags=0x0 nhidx=702 altfwdnhidx=0 filtidx=0
[12:30:49] l2ald P route delete vpls
0.c.0.0.78.2b.cb.1b.1d.80 tid=5 plen=80 type=user flags=0x180 nh=ucst
nhflags=0x0 nhidx=720 altfwdnhidx=0 filtidx=0
[12:30:50] l2ald P route add vpls
0.18.0.0.0.1e.f7.f7.66.80 tid=5 plen=80 type=user flags=0x0 nh=ucst
nhflags=0x0 nhidx=702 altfwdnhidx=0 filtidx=0
[12:30:51] l2ald P route delete vpls
0.18.0.0.1c.df.f.f4.86.81 tid=5 plen=80 type=user flags=0x180 nh=ucst
nhflags=0x0 nhidx=702 altfwdnhidx=0 filtidx=0
[12:30:51] l2ald P route add vpls
0.d.0.0.78.2b.cb.30.14.58 tid=5 plen=80 type=user flags=0x0 nh=ucst
nhflags=0x0 nhidx=721 altfwdnhidx=0 filtidx=0
[12:30:51] l2ald P route add vpls
0.18.0.0.1c.df.f.f4.86.81 tid=5 plen=80 type=user flags=0x0 nh=ucst
nhflags=0x0 nhidx=702 altfwdnhidx=0 filtidx=0
[12:30:52] l2ald P route add vpls
0.9.0.0.4.7d.7b.fc.e5.19 tid=5 plen=80 type=user flags=0x0 nh=ucst
nhflags=0x0 nhidx=707 altfwdnhidx=0 filtidx=0
there was no interface flaps at the time
mx80> show task memory
Memory Size (kB) %Available When
Currently In Use: 698741 39% now
Maximum Ever Used: 1071432 60% 14/04/29 01:30:09
Available: 1769832 100% now
swap was not used
root at mx80% vmstat -s
2800873200 cpu context switches
4170573092 device interrupts
524075416 software interrupts
3104628462 traps
2154475645 system calls
67 kernel threads created
341523 fork() calls
167224 vfork() calls
0 rfork() calls
0 swap pager pageins
0 swap pager pages paged in
0 swap pager pageouts
0 swap pager pages paged out
25706 vnode pager pageins
25751 vnode pager pages paged in
12486 vnode pager pageouts
57909 vnode pager pages paged out
0 page daemon wakeups
0 pages examined by the page daemon
4515 pages reactivated
14019354 copy-on-write faults
11 copy-on-write optimized faults
16293592 zero fill pages zeroed
16087177 zero fill pages prezeroed
1013 intransit blocking page faults
950170073 total VM faults taken
0 pages affected by kernel thread creation
19198331 pages affected by fork()
50568387 pages affected by vfork()
0 pages affected by rfork()
73932805 pages freed
0 pages freed by daemon
23464865 pages freed by exiting processes
250096 pages active
97497 pages inactive
34720 pages in VM cache
80975 pages wired down
45450 pages free
4096 bytes per page
0 swap pages used
0 peak swap pages used
436260150 total name lookups
cache hits (94% pos + 5% neg) system 0% per-directory
deletions 0%, falsehits 0%, toolong 0%
root at mx80% vmstat 1
procs memory page disks faults cpu
faults cpu
r b w avm fre flt re pi po fr sr da0 da1 in sy cs us sy id
2 0 0 2697708 320644 96 0 0 0 7 0 0 0 421 218 283 16 15 69
1 0 0 2697708 320640 9 0 0 0 0 0 0 1 296 496 596 99 1 0
1 0 0 2697708 320640 0 0 0 0 0 0 0 0 286 495 587 99 1 0
1 0 0 2697708 320640 0 0 0 0 0 0 0 1 259 438 558 99 1 0
1 0 0 2697708 320640 0 0 0 0 0 0 0 0 248 605 815 99 1 0
1 0 0 2697708 320640 0 0 0 0 0 0 0 0 303 645 971 98 2 0
1 0 0 2697708 320640 2 0 0 0 0 0 0 0 257 489 545 99 1 0
1 0 0 2697708 320640 96 0 0 0 0 0 0 0 306 566 607 99 1 0
There are no coredumps.
in the dmesg
ifd_pfestat_postprocess: Post process req id=32721 relay id=0 cflags = 0
ifd_pfestat_complete: Completed req id=32721 relay id=0 cflags = 0
if_msg_ifd_bchip_stats(xe-0/0/3 #140)
if_msg_ifd_qstats(xe-0/0/3 #140)
if_msg_ifd_qstats(xe-0/0/3 #140 0)
if_msg_ge_get_info: xe-0/0/3 #140
if_msg_ifd_stats(xe-0/0/3 #140)
if_msg_ifd_lcc_stats(xe-0/0/3 #140)
if_msg_ifd_bchip_stats(xe-0/0/3 #140) - reply(idx=0)
if_msg_ifd_bchip_stats_ipc_handler(xe-0/0/3 #140) - reply(idx=0)
if_pfe_msg_flip_ifd_q_stats(xe-0/0/3 #140 #65535) - reply(idx=0)
if_pfe_msg_flip_ifd_q_stats(xe-0/0/3 #140 #65535) - reply(idx=0)
if_pfe_msg_flip_ifd_q_stats(xe-0/0/3 #140 #65535) - reply(idx=0)
if_pfe_msg_flip_ifd_q_stats(xe-0/0/3 #140 #65535) - reply(idx=0)
if_msg_ge_get_info:(xe-0/0/3 #140) - reply
if_pfe_msg_if_reply(xe-0/0/3 #140) - reply(idx=0) reqidx = 32731 peerid = 0
if_msg_ifd_stats_ipc_handler(xe-0/0/3 #140) - reply(idx=0) reqidx =
32731 peerid = 0
if_msg_ifd_stats_ipc_handler: ifd_name xe-0/0/3 residual stats ip 0,
op 0, ib 0ob 0, im 0, om 0
ifd_pfestat_postprocess: Post process req id=32731 relay id=0 cflags = 0
ifd_pfestat_complete: Completed req id=32731 relay id=0 cflags = 0
if_msg_ifd_bchip_stats(xe-0/0/3 #140)
if_msg_ifd_qstats(xe-0/0/3 #140)
if_msg_ifd_qstats(xe-0/0/3 #140 0)
if_msg_ge_get_info: xe-0/0/3 #140
if_msg_ifd_stats(xe-0/0/3 #140)
if_msg_ifd_lcc_stats(xe-0/0/3 #140)
if_msg_ifd_bchip_stats(xe-0/0/3 #140) - reply(idx=0)
if_msg_ifd_bchip_stats_ipc_handler(xe-0/0/3 #140) - reply(idx=0)
f_pfe_msg_flip_ifd_q_stats(xe-0/0/3 #140 #65535) - reply(idx=0)
if_pfe_msg_flip_ifd_q_stats(xe-0/0/3 #140 #65535) - reply(idx=0)
if_pfe_msg_flip_ifd_q_stats(xe-0/0/3 #140 #65535) - reply(idx=0)
if_pfe_msg_flip_ifd_q_stats(xe-0/0/3 #140 #65535) - reply(idx=0)
if_msg_ge_get_info:(xe-0/0/3 #140) - reply
if_pfe_msg_if_reply(xe-0/0/3 #140) - reply(idx=0) reqidx = 32735 peerid = 0
if_msg_ifd_stats_ipc_handler(xe-0/0/3 #140) - reply(idx=0) reqidx =
32735 peerid = 0
if_msg_ifd_stats_ipc_handler: ifd_name xe-0/0/3 residual stats ip 0,
op 0, ib 0ob 0, im 0, om 0
Can pfed: PFED: rtslib sequence gap 925570 - 92556 suggest some hardware issue ?
Thanks for any answers.
--
Michal Grzedzicki
More information about the juniper-nsp
mailing list