[j-nsp] Interconnecting spines in spine & leaf networks [ was Re: Opinions on fusion provider edge ]

Thu Nov 15 08:31:30 EST 2018

Thanks Hugo, what about leaf to leaf connection?  Is that good?

What about Layer 2 loop prevention?

Aaron

On Nov 14, 2018, at 10:51 PM, Hugo Slabbert <hugo at slabnet.com> wrote:

>> This was all while talking about a data center redesign that we are working on currently.  Replacing ToR VC EX4550’s connected LAG to ASR9K with new dual QFX5120 leaf to single MX960, dual MPC7E-MRATE
>> 
>> I think we will connect each QFX to each mpc7e card.  Is it best practice to not interconnect directly between the two QFX’s ? If so why not.
> 
> Glib answer: because then it's not spine & leaf anymore ;)
> 
> Less glib answer:
> 
> 1. it's not needed and is suboptimal
> 
> Going with a basic 3-stage (2 layer) spine & leaf, each leaf is connected to each spine.  Connectivity between any two leafs is via any spine to which they are both connected.  Suppose you have 2 spines, spine1 and spine2, and, say, 10 leaf switches. If a given leaf loses its connection to spine1, it would then just reach all other leafs via spine2.
> 
> If you add a connection between two spines, you do create an alternate path, but it's also not an equal cost or optimal path.  If we're going simple least hops / shortest path, provided leaf1's connection to spine1 is lost, in theory leaf2 could reach leaf1 via:
> 
> leaf2 -> spine1 -> spine2 -> leaf1
> 
> ...but that would be a longer path than just going via the remaining:
> 
> leaf2 -> spine2 -> leaf2
> 
> ...path.  You could force it through the longer path, but why?
> 
> 2. What's your oversub?
> 
> The pitch on spine & leaf networks is generally their high bandwith, high availability (lots of links), and low oversubscription ratios.  For the sake of illustration let's go away from chassis gear for spines to a simpler option like, say, 32x100G Tomahawk spines.  The spines there have capacity to connect 32x leaf switches at line rate.  Whatever connections the leaf switches have to the spines do not have any further oversub imposed within the spine layer.
> 
> Now you interconnect your spines.  How many of those 32x 100G ports are you going to dedicate to spine interconnect?  2 links?  If so, you've now dropped the capacity for 2x more leafs in your fabric (and however many compute nodes they were going to connect), and you're also only providing 200G interconnect between spines for 3 Tbps of leaf connection capacity.  Even if you ignore the less optimal path thing from above and try to intentionally force a fallback path on spine:leaf link failure to traverse your spine xconnect, you can impose up to 15:1 oversub in that scenario.
> 
> Or you could kill the oversub and carve out 16x of your 32x spine ports for spine interconnects.  But now you've shrunk your fabric significantly (can only support 16 leaf switches)...and you've done so unnecessarily because the redundancy model is for leafs to use their uplinks through spines directly rather than using inter-spine links.
> 
> 3. >2 spines
> 
> What if we leaf1 loses its connection to spine2 and leafx loses its connection to spine1?  Have we not created a reachability problem?
> 
>     spine1     spine2
>    /               \
>  /                  \
> leaf1              leafx
> 
> Why, yes we have.  The design solution here is either >1 links between each leaf & spine (cheating; blergh) or a greater number of spines.  What's your redundancy factor?  Augment the above to 4x spines and you've significantly shrunk your risk of creating connectivity islands.
> 
> But if you've designed for interconnecting your spines, what do you for interconnecting 4x spines?  What about if you reach 6x spines?  Again: the model is that resilience is achieved at the leaf:spine interconnectivity rather than at the "top of the tree" as you would have in a standard hierarchical, 3-tier-type setup.
> 
> -- 
> Hugo Slabbert       | email, xmpp/jabber: hugo at slabnet.com
> pgp key: B178313E   | also on Signal
> 
>> On Tue 2018-Nov-06 12:38:22 -0600, Aaron1 <aaron1 at gvtc.com> wrote:
>> 
>> This is a timely topic for me as I just got off a con-call yesterday with my Juniper SE and an SP specialist...
>> 
>> They also recommended EVPN as the way ahead in place of things like fusion.  They even somewhat shy away from MC-lag
>> 
>> This was all while talking about a data center redesign that we are working on currently.  Replacing ToR VC EX4550’s connected LAG to ASR9K with new dual QFX5120 leaf to single MX960, dual MPC7E-MRATE
>> 
>> I think we will connect each QFX to each mpc7e card.  Is it best practice to not interconnect directly between the two QFX’s ? If so why not.
>> 
>> (please forgive, don’t mean to hijack thread, just some good topics going on here)
>> 
>> Aaron