[j-nsp] LAG/ECMP hash performance

Wed Aug 28 02:52:29 EDT 2019

On Sat, 24 Aug 2019 at 10:06, Saku Ytti <saku at ytti.fi> wrote:

Hi Saku,

> Has anyone ran into a set of flows where ostensibly you have enough
> entropy to balance fairly, but you end up seeing significant imbalance
> anyhow? Can you share the story? What platform? How did you
> troubleshoot? How did you fix?

No. Out of curiosity, have you, which is what lead you to post this?
If yes, what platform?

> It looks like many/most vendors are still using CRC for LAG/ECMP,
> which historically makes sense, as you could piggyback on EthernetFCS
> transistors for 0cost implementation. Today likely the transistors are
> different anyhow as PHY and lookup engine are too separate, so CRC may
> not be very good choice for the problem.

Yeah I more or less agree. It's a bit computationally expensive if the
lookup engine is not something "modern" (i.e. a typical modern Intel
x86_64 chip) with a native CRC32 instruction. In this case of say an
Intel chip (or any ASIC with CRC32 built-in) generating a CRC32 sum
for load-balancing wouldn't be much of an overhead. But even with a
native CRC32 instruction it seems like overkill. If "speed is
everything" a CRC32 instruction might not complete in a single CPU
cycle so other methods could be faster, especially given that most
people don't need 32bits of entropy produced by CRC32 (as in they
don't have 2^32 links in a single LAG bundle or that many ECMP
routes).

> If I read this right (thanks David)
> https://github.com/rurban/smhasher/blob/master/doc/crc32 - CRC32
> appears to have less than perfect 'diffusion' quality, which would
> communicate that there are scenarios where poor balancing is by design
> and where another hash implementation with good diffusion quality
> would balance fairly.

That is my understanding of CRC32 also, although I didn't know it was
being widely used for load-balancing so I had never though of it as an
actual piratical issue. One thing to consider is that not all CRC32
sums are the same, what kind of polynomial is used varies and so $box1
doing CRC32 for load-balancing might produce different results to
$box2, if they use different polynomials. I have recorded some common
ones here: https://null.53bits.co.uk/index.php?page=crc-and-checksum-error-detection#polynomial

It looks like the standard IEEE 802.3 value 0x04C11DB7 is being used
for these tests, here
https://github.com/jwbensley/Ethernet-CRC32/blob/master/crc32.c

Other polys are used though, e.g. for larger packets. When using jumbo
frames and stretching the amount of data the CRC has to protect
against with the same sized sum (32 bits) other polynomials can be
more effective. It's probably a safe bet that most implementations
that use CRC32 for hashing use the same standard poly value but I'm
keen to hear more about this.

Cheers,
James.