[j-nsp] MX80 Sampling - High CPU
Ritz Rojas
ritzrojas at gmail.com
Tue Sep 23 12:30:40 EDT 2014
We have a few MX80s (MX80-48T) that we're looking to deploy in certain
applications where they'll be taking full Internet tables (v4 and v6). We
also have a need to gather flow data on our routers, and have noticed an
interesting trend in the lab.
We are not using an MS-MIC currently.
This test box is running 12.3R7.7 at the moment, but we've seen this same
thing in 11.4 too.
When set up with full Internet routes and sampling is enabled, each time a
commit is made for any change at all, RPD and sampled take turns grinding
the CPU up to 100%, for up to 5-10 minutes or more post-commit, and we see
changes to BGP policy sometimes stall and take a decent amount of time (on
the order of several minutes or more) to actually take effect.
First RPD will climb up to almost 100% CPU utilization, chew it for a few
minutes, then it'll go down and sampled will climb up to almost 100% for
it's couple minutes turn and chew a bit. Then sampled goes back down and
RPD takes back over to 100% for a few more minutes. Eventually it all
finally calms back down and normalizes back to expected levels.
Turn off sampling, and any CPU spikes post-commit are only on the order of
seconds, not minutes, and any policy changes take effect pretty much
immediately.
We've seen this regardless of how flow is configured; we've configured flow
with a "simple" config, as well as inline jflow, pretty much with the same
results. We're curious if anyone's had any of these same problems with
jflow killing the CPU on MX80s (yeah, I know these PPC boxes are pretty
weak sisters), and if there's any fix beyond the usual "Doctor, it hurts
when I do this, what should I do?". "Don't do that!".
It's a nice feature, shame that using it seems to come with this heavy a
price.
As an aside, we also see a bit of a slowdown in the RIB/FIB
learning/purging on BGP session turnup/reset, which we're well aware is a
known issue with sampling enabled, so I won't be shocked if this is just
"how it is". I'd love to be wrong.
Here's our sampling config, quick and dirty, regular and inline jflow, in
case we're missing something.
"Normal" Sampling:
router> show configuration forwarding-options
sampling {
input {
rate 8192;
run-length 0;
max-packets-per-second 20000;
}
family inet {
output {
flow-server x.x.x.x {
port xxxxx;
version 5;
}
}
}
}
router> show configuration interfaces xe-0/0/0
unit xxx {
vlan-id xxx;
family inet {
sampling {
input;
output;
}
}
Inline Jflow Sampling:
router> show configuration forwarding-options
sampling {
instance {
BLAH-INSTANCE {
input {
rate 5000;
}
family inet {
output {
flow-server x.x.x.x {
port xxxx;
autonomous-system-type origin;
no-local-dump;
version-ipfix {
template {
BLAH-TEMPLATE;
}
}
}
inline-jflow {
source-address x.x.x.x;
}
}
}
}
}
}
router> show configuration chassis
tfeb {
slot 0 {
sampling-instance BLAH-INSTANCE;
}
}
router> show configuration services
flow-monitoring {
version-ipfix {
template BLAH-TEMPLATE {
flow-active-timeout 10;
flow-inactive-timeout 10;
template-refresh-rate {
packets 10000;
seconds 10;
}
option-refresh-rate {
packets 10000;
seconds 10;
}
ipv4-template;
}
}
}
router> show configuration interfaces xe-0/0/0
unit xxx {
vlan-id xxx;
family inet {
sampling {
input;
output;
}
}
More information about the juniper-nsp
mailing list