[j-nsp] Hash algorithms for LAG
Pavel Lunin
plunin at senetsy.ru
Fri Jan 20 11:48:40 EST 2012
> This seems to be a horrible complexe topic, with
> much sensible information behind - the exact algorithm
> seems to be much of a secret.
[…]
> Am I completely wrong and there is much more magic
> behind? Has somebody here an deep insight and might
> share it with us?
It depends on a platform. For M/MX/T the hashing algorithm is considered
to be a kind of business secret (said to be patented, etc). For EX it's
more or less widely known. I've been provided with it several times
including one time by JTAC as a reply to a question, which had nothing
to do with this topic. Not sure if it's covered with NDA, so I'd rather
not quote it here in details, though there is nothing very special. 4
lower bits of src-IP XORed with 4 lower bits of dst-IP XORed with 4
lower bits of this XORed with 4 lower bits of that (TCP/UDP ports and
some other fields, which are, AFAIR, listed in the EX docs) and so on
several times. For IP traffic (doesn't matter routed or bridged) bits
from MACs are never included, for non-IP (MPLS as well) they play the
main role (this is documented), which can bring some surprises if you
don't care (see Mark's report on suffering from this). Resulting hash
value MOD number of next-hops defines the outgoing one. I think, if you
ask your SE or JTAC, they will provide you with the exact formula for EX.
What about the M/MX/T algorithm, the main differences are:
— uses more bits from different headers,
— is configurable (AFAIR, either per chassis or per FPC) in contrast to
hard-coded in EX,
— quite significantly differs for PFE families e. g. I-chip, Trio.
— for Trio it's configured in a different knob, but last time I checked,
the docs were not aware of this.
— for ECMP it can do asymmetric LB signaled with BW communities.
Next-hops are given weights (w1, w2, … wN) scaled to [0 ; V_max], where
V_max is the maximum possible hash value. The resulting hash value V is
compared to these weights in a cycle from lowest to highest. If V<wX,
X-th NH is chosen for forwarding. This is comparatively resource
consuming (several CMPs for each packet instead of a single MOD), so
requires some magic from the network processor. On EX you can also
configure the BW community tagging for ECMP paths, but the LB will still
be equal.
As far as I understand, the main goal of the hashing algorithm (what is
worth to be patented) is the ability to produce as much close to
uniformly distributed semi-random hash values for any set of input
variables in order to make LB smoother. This can be really not that easy
for flexibly configurable input fields, though I have no idea how much
magic it requires in practice. Other difficult tasks are (though there
were some objections in one of the previous threads) the flexibility of
configurable conditional (e. g. if it's IP use this, if it's MPLS, use
that) extraction of deep bits from nested headers, the length and number
of fields needed to be pushed to ALU and processed by hashing algorithm,
and the consequent performance constrains.
P. S. LB hashing is performed on ingress PFE.
--
Regards,
Pavel
More information about the juniper-nsp
mailing list