[j-nsp] Interconnecting spines in spine & leaf networks [ was Re: Opinions on fusion provider edge ]
Hugo Slabbert
hugo at slabnet.com
Wed Nov 14 23:51:09 EST 2018
>This was all while talking about a data center redesign that we are
>working on currently. Replacing ToR VC EX4550’s connected LAG to ASR9K
>with new dual QFX5120 leaf to single MX960, dual MPC7E-MRATE
>
>I think we will connect each QFX to each mpc7e card. Is it best practice to not interconnect directly between the two QFX’s ? If so why not.
Glib answer: because then it's not spine & leaf anymore ;)
Less glib answer:
1. it's not needed and is suboptimal
Going with a basic 3-stage (2 layer) spine & leaf, each leaf is connected
to each spine. Connectivity between any two leafs is via any spine to
which they are both connected. Suppose you have 2 spines, spine1 and
spine2, and, say, 10 leaf switches. If a given leaf loses its connection to
spine1, it would then just reach all other leafs via spine2.
If you add a connection between two spines, you do create an alternate
path, but it's also not an equal cost or optimal path. If we're going
simple least hops / shortest path, provided leaf1's connection to spine1 is
lost, in theory leaf2 could reach leaf1 via:
leaf2 -> spine1 -> spine2 -> leaf1
...but that would be a longer path than just going via the remaining:
leaf2 -> spine2 -> leaf2
...path. You could force it through the longer path, but why?
2. What's your oversub?
The pitch on spine & leaf networks is generally their high bandwith, high
availability (lots of links), and low oversubscription ratios. For the
sake of illustration let's go away from chassis gear for spines to a
simpler option like, say, 32x100G Tomahawk spines. The spines there have
capacity to connect 32x leaf switches at line rate. Whatever connections
the leaf switches have to the spines do not have any further oversub
imposed within the spine layer.
Now you interconnect your spines. How many of those 32x 100G ports are you
going to dedicate to spine interconnect? 2 links? If so, you've now
dropped the capacity for 2x more leafs in your fabric (and however many
compute nodes they were going to connect), and you're also only providing
200G interconnect between spines for 3 Tbps of leaf connection capacity.
Even if you ignore the less optimal path thing from above and try to
intentionally force a fallback path on spine:leaf link failure to traverse
your spine xconnect, you can impose up to 15:1 oversub in that scenario.
Or you could kill the oversub and carve out 16x of your 32x spine ports for
spine interconnects. But now you've shrunk your fabric significantly (can
only support 16 leaf switches)...and you've done so unnecessarily because
the redundancy model is for leafs to use their uplinks through spines
directly rather than using inter-spine links.
3. >2 spines
What if we leaf1 loses its connection to spine2 and leafx loses its
connection to spine1? Have we not created a reachability problem?
spine1 spine2
/ \
/ \
leaf1 leafx
Why, yes we have. The design solution here is either >1 links between each
leaf & spine (cheating; blergh) or a greater number of spines. What's your
redundancy factor? Augment the above to 4x spines and you've significantly
shrunk your risk of creating connectivity islands.
But if you've designed for interconnecting your spines, what do you for
interconnecting 4x spines? What about if you reach 6x spines? Again: the
model is that resilience is achieved at the leaf:spine interconnectivity
rather than at the "top of the tree" as you would have in a standard
hierarchical, 3-tier-type setup.
--
Hugo Slabbert | email, xmpp/jabber: hugo at slabnet.com
pgp key: B178313E | also on Signal
On Tue 2018-Nov-06 12:38:22 -0600, Aaron1 <aaron1 at gvtc.com> wrote:
>This is a timely topic for me as I just got off a con-call yesterday with my Juniper SE and an SP specialist...
>
>They also recommended EVPN as the way ahead in place of things like fusion. They even somewhat shy away from MC-lag
>
>This was all while talking about a data center redesign that we are working on currently. Replacing ToR VC EX4550’s connected LAG to ASR9K with new dual QFX5120 leaf to single MX960, dual MPC7E-MRATE
>
>I think we will connect each QFX to each mpc7e card. Is it best practice to not interconnect directly between the two QFX’s ? If so why not.
>
>(please forgive, don’t mean to hijack thread, just some good topics going on here)
>
>Aaron
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: Digital signature
URL: <https://puck.nether.net/pipermail/juniper-nsp/attachments/20181114/ff148e91/attachment.sig>
More information about the juniper-nsp
mailing list