[c-nsp] BGP Balanced

Wed Aug 25 09:14:11 EDT 2004

There seems to be enough confusion on this so let
me clarify.

There are really 4 ways to do load balancing at the packet level:

a) per packet
b) per destination/prefix
c) load balancing at L2 (ie: portchannels, MLPPP, etc)
d) some other combination of parameters to do path selection

a) That makes sense.  If you have two paths you round robin
   every other packet over the links.  Benefit, you get pretty
   equal load on both links.  Drawback, you can and will have
   out of order packets.
   Packets that are process switched by the router are handled
   this way so it's one reason why you never want the router
   to be process switching data.

b) When I say per destination/prefix I mean that every packet
   that has a destination address that matches a given route
   would be forwarded down the path that is cached for that prefix.

   This is the way it works in the fastswitching vector.  You
   cache a mac header rewrite for some prefix and for every packet
   that matches that prefix in the cache it gets forwarded out the
   corresponding link.  There are two bad things with this:
   1) If you have a large flow going to one destination address
      all traffic will end up on one link while not traffic would
      be on the other.
   2) The continuous maintenance of the cache causes additional
      overhead and also to build the first entry in the cache
      those packets have to be punted to the processor.  Sorta
      like punting the first packet to build an MLS flow on a switch.

   *This is more granular but just to be accurate the cache entries
    are not a 1:1 map with the routing table.  The caching algorithm
    is like this:

 -- if we are building a fast policy entry, always cache to /32
 -- if we are building an entry against an MPOA VC, always cache
    to /32
 -- if there is no specific route to this destination, use the
    major net mask 
 -- if the net is not subnetted
    -- if not directly connected, use the major net mask
    -- if directly connected, use /32 
 -- if it's a supernet
    -- use the mask in the ndb
 -- if the net is subnetted
    -- if directly connected, use /32
    -- if there are multiple paths to this subnet, use /32
    -- otherwise, use longest prefix length in this major net

    When you do 'sh ip cache' and 'sh ip route' they will no be a 1:1
    correlation.

c)  MLPPP makes the member links look to the L3 forwarding engine look
    like one L3 interface.  Therefore there really isn't equal cost
    paths at L3.  Once the packet is handed to the MLPPP code for
    forwarding at L2 there is a different decision making alogrithm
    as to how the packet is forwarded over the member links.  There
    is a sequence number associated with the packets if they are fragmented
    such that they are never out of order because the downstream side will
    reorder the packets.  This is typically called "load balancing" because
    you really are balancing the traffic on the links based on which
    member link is the most congested at the time of transmit.
    Splitting load like in (a) and (b) is typically called "load sharing"
    because you are just sending traffic over equal cost links with no
    knowledge of which one is more congested than the other.

d)  Then along came CEF.  The goal of CEF was to solve the the cache maintenance
    issue that existed with fastswitching and in the process give better
    load distribution for routes with equal cost paths. To solve the cache
    maintenance issue CEF simply installs a FIB entry for any given route
    that is in the routing table and that FIB entry points to a corresponding
    adjacency entry that holds the mac header rewrite to forward the packet.
    If there are equal cost paths there will be a loadinfo/pathinfo table that holds
    pointers to the various paths.  The question is in this scenario how
    does CEF decide which path to take?  For *every* packet CEF does a hash
    of the src/dst ip address in the packet and the result is one of the paths.
    Therefore for every single packet between a given set of src/dst ip addresses
    those packets will always take the same path.  To see the result of that
    hash you can do: sh ip cef exact-route <src> <dst>
    Therefore, by default with CEF you never have an out of order packet problem
    because all packets for a given src/dst ip address pair follow the same path.
    You don't have a cache maintenance problem because the FIB is directly programmed
    from the routing table and you don't age out entries.

    Now if you turn on CEF per-packet it's the same as process switching in
    the load sharing context except that you do it under interrupt without
    putting the packet to the processor.  I never recommend doing per packet
    because it does cause a performance degredation in the applications.  You
    may never hear about it because the hit may be small enough that the users
    don't notice it but it will be there.

CEF really isn't truely "per destination/prefix".  And it really isn't per flow
in the usual context because most people think of per flow as including the
L3 port information.  CEF in IOS today does not do any load sharing based
on anything other than the L3 src/dst address.  However, some hardware implementations
do allow load balancing that takes in to account the L4 port information but
IOS based CEF does not.

Summary:

If you are dealing with internet/large enterprise where you have
a large range of src/dst ip address combinations the default CEF
loadsharing should give you pretty close to 50/50 60/40 load on
equal cost paths.  Naturally if you have a large backup going
between two servers you will overload one link.  The only way
to fix that is MLPPP or per-packet (which I don't recommend).
There is a small performance penalty with MLPPP so that's why
I suggest customers to try the equal cost routes with CEF default
load sharing to see what kind of load distribution they get over
the links first.  If that isn't close enough to equal then look
at doing MLPPP if possible.  If that's not an option then as
a last option do per-packet.

Hopefully this helps clear up some of the confusion.

Rodney

/*Side note: I kept it simple by saying that all traffic forwarded
  from process level is round robin over equal cost paths.  Well,
  that's not 100% accurate because if you ping from the router
  on some platforms with CEF enabled even though the packet is
  coming from process level it can be forwarded based on the FIB
  table.  I've never taken the time to go chase down that exact
  forwarding algorithm. ie: What do you use as the src address to
  do the hash on if you assign the src address to the outbound
  interface?  Chicken and egg problem*/

On Tue, Aug 24, 2004 at 09:38:37PM -0400, Alex Rubenstein wrote:
> 
> Not being a wise-ass, but what is the difference between per-prefix, and
> per-destination?
> 
> And, I believe what CEF *actually* does is per prefix, meaning that, if
> there is a CEF route for a /17, all traffic for that /17 goes on whatever
> link is cached for that /17 until a route update occurs.
> 
> 
> 
> 
> 
> 
> 
> On Tue, 24 Aug 2004, Gert Doering wrote:
> 
> > Hi,
> >
> > On Tue, Aug 24, 2004 at 09:57:25AM -0700, Manoj koshti wrote:
> > > It will do per prefix load balancing . so traffic will not get equally distributed among the  3 links.
> >
> > CEF *never* does "per prefix".
> >
> > By default it does "per destination" load balancing, but you can change
> > that on interface level to "per packet".
> >
> > gert
> > --
> > USENET is *not* the non-clickable part of WWW!
> >                                                            //www.muc.de/~gert/
> > Gert Doering - Munich, Germany                             gert at greenie.muc.de
> > fax: +49-89-35655025                        gert at net.informatik.tu-muenchen.de
> > _______________________________________________
> > cisco-nsp mailing list  cisco-nsp at puck.nether.net
> > https://puck.nether.net/mailman/listinfo/cisco-nsp
> > archive at http://puck.nether.net/pipermail/cisco-nsp/
> >
> 
> -- Alex Rubenstein, AR97, K2AHR, alex at nac.net, latency, Al Reuben --
> --    Net Access Corporation, 800-NET-ME-36, http://www.nac.net   --
> 
> 
> _______________________________________________
> cisco-nsp mailing list  cisco-nsp at puck.nether.net
> https://puck.nether.net/mailman/listinfo/cisco-nsp
> archive at http://puck.nether.net/pipermail/cisco-nsp/