[j-nsp] Hash algorithms for LAG

Fri Jan 20 11:48:40 EST 2012

> This seems to be a horrible complexe topic, with
> much sensible information behind - the exact algorithm
> seems to be much of a secret.
[…]
> Am I completely wrong and there is much more magic
> behind? Has somebody here an deep insight and might
> share it with us?

It depends on a platform. For M/MX/T the hashing algorithm is considered 
to be a kind of business secret (said to be patented, etc). For EX it's 
more or less widely known. I've been provided with it several times 
including one time by JTAC as a reply to a question, which had nothing 
to do with this topic. Not sure if it's covered with NDA, so I'd rather 
not quote it here in details, though there is nothing very special. 4 
lower bits of src-IP XORed with 4 lower bits of dst-IP XORed with 4 
lower bits of this XORed with 4 lower bits of that (TCP/UDP ports and 
some other fields, which are, AFAIR, listed in the EX docs) and so on 
several times. For IP traffic (doesn't matter routed or bridged) bits 
from MACs are never included, for non-IP (MPLS as well) they play the 
main role (this is documented), which can bring some surprises if you 
don't care (see Mark's report on suffering from this). Resulting hash 
value MOD number of next-hops defines the outgoing one. I think, if you 
ask your SE or JTAC, they will provide you with the exact formula for EX.

What about the M/MX/T algorithm, the main differences are:
— uses more bits from different headers,
— is configurable (AFAIR, either per chassis or per FPC) in contrast to 
hard-coded in EX,
— quite significantly differs for PFE families e. g. I-chip, Trio.
— for Trio it's configured in a different knob, but last time I checked, 
the docs were not aware of this.
— for ECMP it can do asymmetric LB signaled with BW communities. 
Next-hops are given weights (w1, w2, … wN) scaled to [0 ; V_max], where 
V_max is the maximum possible hash value. The resulting hash value V is 
compared to these weights in a cycle from lowest to highest. If V<wX, 
X-th NH is chosen for forwarding. This is comparatively resource 
consuming (several CMPs for each packet instead of a single MOD), so 
requires some magic from the network processor. On EX you can also 
configure the BW community tagging for ECMP paths, but the LB will still 
be equal.

As far as I understand, the main goal of the hashing algorithm (what is 
worth to be patented) is the ability to produce as much close to 
uniformly distributed semi-random hash values for any set of input 
variables in order to make LB smoother. This can be really not that easy 
for flexibly configurable input fields, though I have no idea how much 
magic it requires in practice. Other difficult tasks are (though there 
were some objections in one of the previous threads) the flexibility of 
configurable conditional (e. g. if it's IP use this, if it's MPLS, use 
that) extraction of deep bits from nested headers, the length and number 
of fields needed to be pushed to ALU and processed by hashing algorithm, 
and the consequent performance constrains.

P. S. LB hashing is performed on ingress PFE.

--
Regards,
Pavel