[j-nsp] OSPF reference-bandwidth 1T

Tue Jan 22 15:52:58 EST 2019

On 2019-01-22 12:02 MET, Pavel Lunin wrote:

>> (I am myself running a mostly DC network, with a little bit of campus
>> network on the side, and we use bandwidth-based metrics in our OSPF.
>> But we have standardized on using 3 Tbit/s as our "reference bandwidth",
>> and Junos doesn't allow us to set that, so we set explicit metrics.)

> As Adam has already mentioned, DC networks are becoming more and more
> Clos-based, so you basically don't need OSPF at all for this.
> 
> Fabric uplinks, Backbone/DCI and legacy still exist though, however in
> the DC we tend to ECMP it all, so you normally don't want to have unequal
> bandwidth links in parallel in the DC.

Our network is roughly spine-and-leaf.  But we have a fairly small net
(two spines, around twenty leafs, split over two computer rooms a couple
of hundred meters apart the way the fiber goes), and it doesn't make
economical sense to make it a perfectly pure folded Clos network.  So,
there are a couple of leaf switches that are just layer 2 with spanning
tree, and the WAN connections to our partner in the neighbouring city
goes directly into our spines instead of into "peering leafs".  (The
border routers for our normal Internet connectivity are connected as
leafs to our spines, though, but they are really our ISP's CPE routers,
not ours.)

Also, the leaves have wildly different bandwidth needs.  Our DNS, email
and web servers don't need as much bandwidth as a 2000 node HPC cluster,
which in turn needs less bandwidth than the storage cluster for LHC
data.  Most leaves have 10G uplinks (one to each spine), but we also
have leafs with 1G and with 40G uplinks.

I don't want a leaf with 1G uplinks becoming a "transit" node for traffic
between two other leafs in (some) failure cases, because an elephant flow
could easily saturate those 1G links.  Thus, I want higher costs for those
links than for the 10G and 40G links.  Of course, the costs don't have to
be exactly <REFERENCE_BW> / <ACTUAL_BW>, but there need to be some relation
to the bandwidth.

> Workarounds happen, sometimes you have no more 100G ports available and
> need to plug, let's say, 4x40G "temporarily" in addition to two existing
> 100G which are starting to be saturated. In such a case you'd rather
> consciously decide weather you want to ECMP these 200 Gigs among six
> links (2x100 + 4x40) or use 40GB links as a backup only (might be not
> the best idea in this scenario).

Right.  I actually have one leaf switch with unequal bandwidth uplinks.
On one side, it uses 2×10G link aggregation, but on the other side, I
could use an old Infiniband AOC cable giving us a 40G uplink.  In that
case, I have explicitly set the two uplinks to have the same costs.

	/Bellman, NSC

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: OpenPGP digital signature
URL: <https://puck.nether.net/pipermail/juniper-nsp/attachments/20190122/da5040b9/attachment.sig>