[j-nsp] What exactly causes inconsistent RTT seen using ping utility in Junos?

Wed Apr 24 17:17:20 EDT 2019

Hi,

I built a setup where vMX(local-as 64512) has following 20 eBGP
neighbors over point-to-point connections:

AS64513: vmx1[ge-0/0/6.13] <-> [ge-0.0.6.13]PC
AS64514: vmx1[ge-0/0/6.14] <-> [ge-0.0.6.14]PC
AS64515: vmx1[ge-0/0/6.15] <-> [ge-0.0.6.15]PC
AS64516: vmx1[ge-0/0/6.16] <-> [ge-0.0.6.16]PC
AS64517: vmx1[ge-0/0/6.17] <-> [ge-0.0.6.17]PC
AS64518: vmx1[ge-0/0/6.18] <-> [ge-0.0.6.18]PC
AS64519: vmx1[ge-0/0/6.19] <-> [ge-0.0.6.19]PC
AS64520: vmx1[ge-0/0/6.20] <-> [ge-0.0.6.20]PC
AS64521: vmx1[ge-0/0/6.21] <-> [ge-0.0.6.21]PC
AS64522: vmx1[ge-0/0/6.22] <-> [ge-0.0.6.22]PC
AS64523: vmx1[ge-0/0/7.23] <-> [ge-0.0.7.23]PC
AS64524: vmx1[ge-0/0/7.24] <-> [ge-0.0.7.24]PC
AS64525: vmx1[ge-0/0/7.25] <-> [ge-0.0.7.25]PC
AS64526: vmx1[ge-0/0/7.26] <-> [ge-0.0.7.26]PC
AS64527: vmx1[ge-0/0/7.27] <-> [ge-0.0.7.27]PC
AS64528: vmx1[ge-0/0/7.28] <-> [ge-0.0.7.28]PC
AS64529: vmx1[ge-0/0/7.29] <-> [ge-0.0.7.29]PC
AS64530: vmx1[ge-0/0/7.30] <-> [ge-0.0.7.30]PC
AS64531: vmx1[ge-0/0/7.31] <-> [ge-0.0.7.31]PC
AS64532: vmx1[ge-0/0/7.32] <-> [ge-0.0.7.32]PC

vMX has an evaluation license installed in order to avoid shallow max
prefix limitation. In PC there is a BGP daemon(BIRD) running in a way
that first set of 4000 prefixes oscillates between sessions with
AS64513 and AS64523, i.e between vmx1[ge-0/0/6.13] <-> [ge-0.0.6.13]PC
and vmx1[ge-0/0/7.23] <-> [ge-0.0.7.23]PC connections. Second set of
4000 prefixes oscillates between vmx1[ge-0/0/6.14] <-> [ge-0.0.6.14]PC
and vmx1[ge-0/0/7.24] <-> [ge-0.0.7.24]PC connections. Third set of
4000 prefixes oscillates between vmx1[ge-0/0/6.15] <-> [ge-0.0.6.15]PC
and vmx1[ge-0/0/7.25] <-> [ge-0.0.7.25]PC connections, etc.
CPU usage of rpd fluctuates between 20% and 90% under such conditions.
Now when I execute the "ping 10.55.55.1 source 10.55.55.2 count 1000"
command, then summary of the results is following:

--- 10.55.55.1 ping statistics ---
1000 packets transmitted, 992 packets received, 0% packet loss
round-trip min/avg/max/stddev = 0.189/19.572/315.915/36.297 ms

I also plotted the results: https://i.imgur.com/wWTgLfX.png As
expected, the RTT is very unstable and there is even some packet loss.
When I repeat the "ping 10.55.55.1 source 10.55.55.2 count 1000"
command without route churn, and thus the CPU usage of rpd is <0.5%,
then the summary of the results is much more stable:

--- 10.55.55.1 ping statistics ---
1000 packets transmitted, 1000 packets received, 0% packet loss
round-trip min/avg/max/stddev = 0.193/2.259/85.518/9.258 ms

Plotted results are here: https://i.imgur.com/bFM5zHj.png

This might give an impression, that this delay is directly related to
rpd process or CPU load caused by rpd, but this does not seem to be
true. vMX provides an excellent way to prove this as traffic between
vCP and vFP goes through the Linux bridge in the host machine and can
be captured. When I execute the "ping 10.55.55.1 source 10.55.55.2"
command and measure the delay between the ICMP "echo request" packages
on forwarding plane Ethernet interface facing the RE and ICMP "echo
request" packages on server interface with 10.55.55.1 address
configured, then first packet capture does not detect the occasional
delay where the second packet capture in the server does. For example,
see the following five ping results from the Junos ping utility:

64 bytes from 10.55.55.1: icmp_seq=1 ttl=64 time=0.459 ms
64 bytes from 10.55.55.1: icmp_seq=2 ttl=64 time=0.434 ms
64 bytes from 10.55.55.1: icmp_seq=3 ttl=64 time=51.631 ms
64 bytes from 10.55.55.1: icmp_seq=4 ttl=64 time=0.622 ms
64 bytes from 10.55.55.1: icmp_seq=5 ttl=64 time=0.294 ms

Delays measured on the forwarding-plane Ethernet port facing the RE are stable:

 00:00:01.008455 52:54:00:1c:7d:e6 > 52:54:00:17:a4:ae, ethertype IPv4
(0x0800), length 158: 128.0.0.1 > 128.0.0.16:  ip-proto-84 124
 00:00:01.001053 52:54:00:1c:7d:e6 > 52:54:00:17:a4:ae, ethertype IPv4
(0x0800), length 158: 128.0.0.1 > 128.0.0.16:  ip-proto-84 124
 00:00:01.006539 52:54:00:1c:7d:e6 > 52:54:00:17:a4:ae, ethertype IPv4
(0x0800), length 158: 128.0.0.1 > 128.0.0.16:  ip-proto-84 124
 00:00:01.000228 52:54:00:1c:7d:e6 > 52:54:00:17:a4:ae, ethertype IPv4
(0x0800), length 158: 128.0.0.1 > 128.0.0.16:  ip-proto-84 124
 00:00:01.004250 52:54:00:1c:7d:e6 > 52:54:00:17:a4:ae, ethertype IPv4
(0x0800), length 158: 128.0.0.1 > 128.0.0.16:  ip-proto-84 124

I captured those with "tcpdump -ttt -nei vfp-int-vmx1 'ip proto 84 &&
src host 128.0.0.1 && ip[80:2] = 0x0800'" command. IP protocol 84 used
in the packet filter is Juniper proprietary Trivial Tunneling Protocol
(TTP) used to encapsulate external control traffic between the RE and
line-card microkernel. IPv4 address 128.0.0.1 is configured to em1.0
in RE. "ip[80:2] = 0x0800" matches ICMP "echo request" message ICMP
header type and code bytes inside the TTP packet, i.e type needs to be
0x08(Echo Request) and code needs to be 0x00.

However, same five packets captured on the ge-0.0.9-vmx1(this one has
the 10.55.55.1 configured) interface can be seen below:

 00:00:01.008451 02:06:0a:0e:ff:f9 > fe:06:0a:0e:ff:f9, ethertype IPv4
(0x0800), length 98: 10.55.55.2 > 10.55.55.1: ICMP echo request, id
55340, seq 1, length 64
 00:00:01.001077 02:06:0a:0e:ff:f9 > fe:06:0a:0e:ff:f9, ethertype IPv4
(0x0800), length 98: 10.55.55.2 > 10.55.55.1: ICMP echo request, id
55340, seq 2, length 64
 00:00:01.017782 02:06:0a:0e:ff:f9 > fe:06:0a:0e:ff:f9, ethertype IPv4
(0x0800), length 98: 10.55.55.2 > 10.55.55.1: ICMP echo request, id
55340, seq 3, length 64
 00:00:00.989049 02:06:0a:0e:ff:f9 > fe:06:0a:0e:ff:f9, ethertype IPv4
(0x0800), length 98: 10.55.55.2 > 10.55.55.1: ICMP echo request, id
55340, seq 4, length 64
 00:00:01.004109 02:06:0a:0e:ff:f9 > fe:06:0a:0e:ff:f9, ethertype IPv4
(0x0800), length 98: 10.55.55.2 > 10.55.55.1: ICMP echo request, id
55340, seq 5, length 64

Third packet arrives later than normal and thus fourth packet arrives
earlier than normal, i.e exactly the same behavior demonstrated in my
initial e-mail. Packet capture was done with "tcpdump -ttt -nei
ge-0.0.9-vmx1 'icmp[icmptype] == 8'" command. In short, it seems to be
certain that ICMP "echo request" messages arrive to line-card
microkernel without the delay.

As already said by Saku, one logical explanation for this delay is the
line-card microkernel. As written by David Roy, the microkernel
processes TTP packets in software("TTP Transmit" process seen in the
output of "sh threads" on PFE CLI) and if the microkernel is for
example busy with FIB updates, then for example, the decapsulating of
TTP traffic and then forwarding the payload to MQ chip(or similar ASIC
in PFE) could indeed be affected. On the other hand, the overall LC
CPU usage according to "sh linux cpu usage" did not exceed 80% even
during the route churn, but I actually do not know what exactly this
utilization means..

Please let me know if anything was unclear or if someone has other
ideas or theories.

Martin