[j-nsp] LAG/ECMP hash performance

Thu Aug 29 04:51:46 EDT 2019

On Wed, 28 Aug 2019 at 08:21, Saku Ytti <saku at ytti.fi> wrote:
> I've had two issues where I cannot explain why there is imbalance. One
> in MX2020 another in PTX. I can't find any elephant flows in netflow,
> but I can find traffic grouped around with modest amount of IP address
> entropy (like 20-32 SADDR + 20-32 DADDR + 1 SPORT + RND DPORT). My
> understanding is, that just that RND DPORT should guarantee fair
> balancing, in absence of elephant flows and when flow count is
> sufficient.

Hi Saku,

Hmm, interesting, but has anyone confirmed to you these devices to use
a CRC32 for the hashing are you trying to reverse engineer this? Is
there any reason why this couldn't just be a dodgy Juniper proprietary
hash algo? I'm just playing devils advocate here.

> I did trivial lab test on MX2020, which I'll post at the end of the
> email, which appears (not controlled enough to say for sure) to
> support that hashing is less than idea.

I had the same idea to break out the lab Ixia but I haven't had time yet...

> Do you think that with other parameters it would achieve better
> diffusion quality?

Different parameters may or may not change the diffusion density, but
they may increase the range of results, i.e. perfect diffusion over
2^2 outcomes vs. perfect diffusion over 2^6 outcomes.

Also, ASR9Ks use a CRC32 on Typhoon cards but not of the whole frame,
"Post IOS-XR 4.2.0, Typhoon NPUs use a CRC based calculation of the
L3/L4 info and compute a 32 bit hash value." So actually, your results
below should have good diffusion in theory if this was an ASR9K
(although I'm sure that's not the case in reality). Is the Juniper
taking (1) the whole frame into the CRC function (2) all the headers
but no payload, or (3) just the specific headers fields (S/D
MAC/IP/Port/Intf)?

In the worst case scenario (1), with 4 bytes of CRC output to
represent an entire frame there is a large amount of hash collisions;
Min size frame; 6 byte SRC, 6 byte DST, 2 byte EType, 46 byte payload
== 2^480 possible Ethernet frames being mapped to 2^32 CRC values? So
that to me says that twiddling the DPORT value randomly wouldn't
actually get great diffusion because this isn't a perfect hashing
scenario because there are many collisions (I also don't know how
random that Ixia RNG is).

In the best case scenario (3) it's just the required header fields
feeding the CRC, the input field sizes still severely exceed 32 bits
of output field size.

> SRC: (single 100GE interface, single unit 0)
>   23.A.B.20 .. 23.A.B.46
>   TCP/80
> DST: (N*10GE LACP)
>   157.C.D.20 .. 157.C.D.35
>   TCP 2074..65470 (RANDOM, this alone, everything else static, should
> have guaranteed fair balancing)
>
> I'm running this through IXIA and my results are:
>
> 3*10GE Egress:
>   port1 10766516pps
>   port2 10766543pps
>   port3  7536578pps
> after (set forwarding-options enhanced-hash-key family inet
> incoming-interface-index)
>   port1 9689881pps
>   port2 11791986pps
>   port3 5383270pps
> after removing s-int-index and setting adaptive
>   port1 9689889pps
>   port2 9689892pps
>   port3 9689884pps
>
> I think this supports that the hash function diffuses poorly. It
> should be noted that 2nd step adds entirely _static_ bits to the input
> of the hash, source interface does not change. And it's perfectly
> repeatable. This is to be expected, the most affected weakness bits
> shift, either making the problem worse or better.
> I.e. flows are 100% perfectly hashable, but not without biasing the
> hash results. There aren't any elephants.
>
>
> 4*10GE Egress:
>   port1 4306757pps
>   port2 8612807pps
>   port3 9689893pps
>   port4 6459931pps
> after adding incoming-interface-index)
>   port1 6459922pps
>   port2 8613236pps
>   port3 9691485pps
>   port4 4306620pps
> after removing s-index and adding adaptive:
>   port1 7536562pps
>   port2 7536593pps
>   port3 6459928pps
>   port4 7536566pps
> after removing adaptive and adding no-destination-port + no-source-port
>   port1: 5383279pps
>   port2: 9689886pps
>   port3: 7536588pps
>   port4: 6459922pps
> after removing no-source-port (i.e. destination port is used for hash)
>   port1: 8613235pps
>   port2: 5383272pps
>   port3: 5383274pps
>   port4: 9689884pps
>
> It is curious that it actually balances more fairly, without using TCP
> ports at all! Even thought there is _tons_ of entropy there due to
> random DPORT.

This is interesting.

In the past I have simply generated a flow like, SRC
11:11:11:11:11:11, DST 11:11:11:11:11:11, SRC 1.1.1.1, DST 1.1.1.1,
SPORT 1, DPORT 1, send the traffic, note which egress port was chosen
from the LAG, increment one value (.e.g DST IP) by 1, send traffic,
note the egress port, repeat; and you can sometimes "get a feel" for
the hashing being used. It's very laborious but it might give some
more insight into what's going on here.

Cheers,
James.