[j-nsp] What exactly causes inconsistent RTT seen using ping utility in Junos?

Thu May 2 18:42:19 EDT 2019

As I had access to lab MX960, then I made a similar packet capture as
described in my Apr 25 e-mail. Only difference is that I didn't
capture packets on forwarding-plane Ethernet interface facing the RE,
but on RE Ethernet interface facing the forwarding plane, i.e "start
shell sh" in Junos and then "su" and "tcpdump -ttt -nei em0 'ip proto
84 && src host 128.0.0.1 && ip[86:2] = 0x0800'" where em0 connects to
SCB integrated switch. As with vMX, the delay does not come from the
RE.

I also started following two commands and let them collect ping
results and line-card CPU utilization over night:

$ ssh -qT mx960 'ping 10.66.66.1 source 10.66.66.2' | grep
--line-buffered -oP '(?<=time=)[0-9.]+' | \
> while read -r rtt; do time=$(printf '%(%s)T'); printf '%s %s\n' "$time" "$rtt" >> MX960_ping_results; done

$ while true; do \
>   regex='([0-9]+)% \(1 second\), ([0-9]+)% \(5 second\)'; \
>   while read -r line; do \
>     if [[ $line =~ $regex ]]; then \
>       time=$(printf '%(%s)T'); \
>       printf '%s %s\n' "$time" "${BASH_REMATCH[1]}" >> LC_CPU_util_1s; \
>       printf '%s %s\n' "$time" "${BASH_REMATCH[2]}" >> LC_CPU_util_5s; \
>     fi; \
>   done < <(timeout 1 ssh -qT mx960 'request pfe execute target fpc3 command "show sched"'); \
> done

First command logs ping("ping 10.66.66.1 source 10.66.66.2" command in
Junos) results with Unix timestamps into file. File content will be:

1556775780 0.586
1556775781 0.579
1556775782 0.607
/* output removed for brevity */

Second command logs 1 second and 5 second LC CPU utilization value
from the PFE3(egress interface for 10.66.66.1 is ge-3/3/9) CLI "show
sched" command into two files. This means that both 1 second and 5
second CPU utilization log files have similar content:

1556775780 9
1556775782 9
1556775783 9
/* output removed for brevity */

One 360 seconds long period with three high-RTT cases can be seen
here: https://i.imgur.com/TZeIbDd.png Left-hand y-axis is RTT and
right-hand y-axis is CPU utilization in percentages. Again, there
doesn't seem to be much correlation between line-card CPU utilization
and high RTT..

At this point I'm fairly sure that this occasional RTT is introduced
in line-card and not in RE, but I'm not convinced that it has anything
to do with the CPU utilization on line-card running the microkernel.
However, I'm not sure if it is possible to debug this much further
without access to microkernel source code and knowing the forwarding
plane internals. There is for example possibility to make a primitive
packet capture inside the microkernel("show pfe host-outbound
packets"), but it does not allow filtering and printing of timestamps.

Again, please let me know if my explanations weren't clear or if there
are any additional ideas.

thanks,
Martin