[j-nsp] Interconnecting spines in spine & leaf networks [ was Re: Opinions on fusion provider edge ]

Wed Nov 14 23:51:09 EST 2018

>This was all while talking about a data center redesign that we are 
>working on currently.  Replacing ToR VC EX4550’s connected LAG to ASR9K 
>with new dual QFX5120 leaf to single MX960, dual MPC7E-MRATE
>
>I think we will connect each QFX to each mpc7e card.  Is it best practice to not interconnect directly between the two QFX’s ? If so why not.

Glib answer: because then it's not spine & leaf anymore ;)

Less glib answer:

1. it's not needed and is suboptimal

Going with a basic 3-stage (2 layer) spine & leaf, each leaf is connected 
to each spine.  Connectivity between any two leafs is via any spine to 
which they are both connected.  Suppose you have 2 spines, spine1 and 
spine2, and, say, 10 leaf switches. If a given leaf loses its connection to 
spine1, it would then just reach all other leafs via spine2.

If you add a connection between two spines, you do create an alternate 
path, but it's also not an equal cost or optimal path.  If we're going 
simple least hops / shortest path, provided leaf1's connection to spine1 is 
lost, in theory leaf2 could reach leaf1 via:

leaf2 -> spine1 -> spine2 -> leaf1

...but that would be a longer path than just going via the remaining:

leaf2 -> spine2 -> leaf2

...path.  You could force it through the longer path, but why?

2. What's your oversub?

The pitch on spine & leaf networks is generally their high bandwith, high 
availability (lots of links), and low oversubscription ratios.  For the 
sake of illustration let's go away from chassis gear for spines to a 
simpler option like, say, 32x100G Tomahawk spines.  The spines there have 
capacity to connect 32x leaf switches at line rate.  Whatever connections 
the leaf switches have to the spines do not have any further oversub 
imposed within the spine layer.

Now you interconnect your spines.  How many of those 32x 100G ports are you 
going to dedicate to spine interconnect?  2 links?  If so, you've now 
dropped the capacity for 2x more leafs in your fabric (and however many 
compute nodes they were going to connect), and you're also only providing 
200G interconnect between spines for 3 Tbps of leaf connection capacity.  
Even if you ignore the less optimal path thing from above and try to 
intentionally force a fallback path on spine:leaf link failure to traverse 
your spine xconnect, you can impose up to 15:1 oversub in that scenario.

Or you could kill the oversub and carve out 16x of your 32x spine ports for 
spine interconnects.  But now you've shrunk your fabric significantly (can 
only support 16 leaf switches)...and you've done so unnecessarily because 
the redundancy model is for leafs to use their uplinks through spines 
directly rather than using inter-spine links.

3. >2 spines

What if we leaf1 loses its connection to spine2 and leafx loses its 
connection to spine1?  Have we not created a reachability problem?

      spine1     spine2
     /               \
   /                  \
leaf1              leafx

Why, yes we have.  The design solution here is either >1 links between each 
leaf & spine (cheating; blergh) or a greater number of spines.  What's your 
redundancy factor?  Augment the above to 4x spines and you've significantly 
shrunk your risk of creating connectivity islands.

But if you've designed for interconnecting your spines, what do you for 
interconnecting 4x spines?  What about if you reach 6x spines?  Again: the 
model is that resilience is achieved at the leaf:spine interconnectivity 
rather than at the "top of the tree" as you would have in a standard 
hierarchical, 3-tier-type setup.

-- 
Hugo Slabbert       | email, xmpp/jabber: hugo at slabnet.com
pgp key: B178313E   | also on Signal

On Tue 2018-Nov-06 12:38:22 -0600, Aaron1 <aaron1 at gvtc.com> wrote:

>This is a timely topic for me as I just got off a con-call yesterday with my Juniper SE and an SP specialist...
>
>They also recommended EVPN as the way ahead in place of things like fusion.  They even somewhat shy away from MC-lag
>
>This was all while talking about a data center redesign that we are working on currently.  Replacing ToR VC EX4550’s connected LAG to ASR9K with new dual QFX5120 leaf to single MX960, dual MPC7E-MRATE
>
>I think we will connect each QFX to each mpc7e card.  Is it best practice to not interconnect directly between the two QFX’s ? If so why not.
>
>(please forgive, don’t mean to hijack thread, just some good topics going on here)
>
>Aaron
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: Digital signature
URL: <https://puck.nether.net/pipermail/juniper-nsp/attachments/20181114/ff148e91/attachment.sig>