[c-nsp] Redundant connections best practices

Mon Apr 25 12:35:43 EDT 2005

On Mon, Apr 25, 2005 at 07:53:17AM -0700, Peter Kranz wrote:
> I am Interested in the approaches that carriers are taking with customers
> who request dual ethernet drops for BW. How is redundancy being achieved
> with the dual drops (BGP?, STP?, lower priority routes?) .. 

Our preference is to use BGP.  The advantages are that our ops staff
knows it, it's designed to solve precisely this sort of problem,
filtering is fairly easy, troubleshooting is simple in this particular
configuration, and it runs on a fairly wide range of equipment.
Several BGP attributes can be used to make load-sharing work, and
if your customer is multi-homed they'll generally need BGP anyway.
The potential drawbacks to BGP are that some customers don't know
anything about it and some gear doesn't support it.  For example,
there was a firewall vendor that charged an extra $10k (per box)
if you wanted BGP.  I don't remember who it was, but it was an issue
for us because customers were very reluctant to shell out $20k when
there were other ways of solving the problem.

This led to us toying with the idea of bidirectional HSRP/VRRP.
However, this has a number of unpleasant failure scenarios including
split-brain.  For example, consider the following topology:

    PE1-----PS1-----CS1-----CE1
                     |
                     |
    PE2-----PS2-----CS2-----CE2

(PS == provider switch, CS == customer switch)

If the trunk between CS1 and CS2 fails, both sides will continue to
operate independently.  How severe a problem this is depends on how
the rest of the customer's network looks.

Furthermore, suppose CS2 fails completely.  PE2 has no way to know
that it can't communicate with the customer, so it will continue
to accept packets destined for the customer's network (assuming the
use of static/connected routes), which means you've got a blackhole
in your network caused by a failure in the customer's network.

My personal opinion is that if there's more than a trivial amount of
L2 network between you and your customers, you want an active routing
protocol.  If BGP isn't an option, RIPv2 and OSPF are available, but
you need to be VERY careful not to cross the streams if you use the
same protocol internally.  It is probably a good idea to use a
different protocol entirely to avoid any chance of cross-pollination.

If you really only care about link redundancy and/or bandwidth, you
might consider link aggregation (802.3ad).  Google for 802.3ad for 
more info.

Personally, I would not recommend any configuration that requires
devices under separate administrative domains to run STP.  It's
hard enough to diagnose STP problems when you know how the switches
are configured and have full access to them.  It's nearly impossible
if you don't have either or both of those.  Obviously, a spanning-tree
problem isn't the only cause of a bridge loop, but in my experience,
having a shared L2 network increases the likelihood of this sort
of problem.  Perhaps some of the large exchange point operators will
chime in on this subject since they have a lot of experience with
this sort of configuration.

> several ways to skin this cat, and unsure of the most appropriate method..

It depends on a number of factors, and only you can decide which
methods are most appropriate for your network.  What kind of equipment
are you using?  How about your customers?  How many customers are
you supporting?  What is your ops staff most familiar with?  Do you
provide multiple drops for redundancy, for bandwidth, or both?  How
many different configurations are involved?  How much do you care
that every configuration matches some "standard"?  Are you selling IP
transit, or something different such as L2 VPNs?  Do you aggregate
customers behind L2 devices or does each customer get a port on an
L3 device?  Etc, etc, etc.

--Jeff