[c-nsp] BGP Balanced

Thu Aug 26 04:46:09 EDT 2004

 Rodney,

Brilliant answer !!!!!
I hope we'll see more of that kind :)

With kind regards/ met vriendelijke groeten,
--------------------------------------------------------
Jeff Tantsura
CCIE #11416
Senior Consultant
Capgemini Nederland BV
Tel: +31(0)30 689 2866
Mob:+31(0)6 4588 6858
Fax: +31(0)30 689 6565
--------------------------------------------------------

-----Original Message-----
From: cisco-nsp-bounces at puck.nether.net
[mailto:cisco-nsp-bounces at puck.nether.net] On Behalf Of Rodney Dunn
Sent: Wednesday, August 25, 2004 3:14 PM
To: Alex Rubenstein
Cc: Gert Doering; cisco-nsp at puck.nether.net
Subject: Re: [c-nsp] BGP Balanced

There seems to be enough confusion on this so let me clarify.

There are really 4 ways to do load balancing at the packet level:

a) per packet
b) per destination/prefix
c) load balancing at L2 (ie: portchannels, MLPPP, etc)
d) some other combination of parameters to do path selection

a) That makes sense.  If you have two paths you round robin
   every other packet over the links.  Benefit, you get pretty
   equal load on both links.  Drawback, you can and will have
   out of order packets.
   Packets that are process switched by the router are handled
   this way so it's one reason why you never want the router
   to be process switching data.

b) When I say per destination/prefix I mean that every packet
   that has a destination address that matches a given route
   would be forwarded down the path that is cached for that prefix.

   This is the way it works in the fastswitching vector.  You
   cache a mac header rewrite for some prefix and for every packet
   that matches that prefix in the cache it gets forwarded out the
   corresponding link.  There are two bad things with this:
   1) If you have a large flow going to one destination address
      all traffic will end up on one link while not traffic would
      be on the other.
   2) The continuous maintenance of the cache causes additional
      overhead and also to build the first entry in the cache
      those packets have to be punted to the processor.  Sorta
      like punting the first packet to build an MLS flow on a switch.

   *This is more granular but just to be accurate the cache entries
    are not a 1:1 map with the routing table.  The caching algorithm
    is like this:

 -- if we are building a fast policy entry, always cache to /32
 -- if we are building an entry against an MPOA VC, always cache
    to /32
 -- if there is no specific route to this destination, use the
    major net mask
 -- if the net is not subnetted
    -- if not directly connected, use the major net mask
    -- if directly connected, use /32
 -- if it's a supernet
    -- use the mask in the ndb
 -- if the net is subnetted
    -- if directly connected, use /32
    -- if there are multiple paths to this subnet, use /32
    -- otherwise, use longest prefix length in this major net

    When you do 'sh ip cache' and 'sh ip route' they will no be a 1:1
    correlation.

c)  MLPPP makes the member links look to the L3 forwarding engine look
    like one L3 interface.  Therefore there really isn't equal cost
    paths at L3.  Once the packet is handed to the MLPPP code for
    forwarding at L2 there is a different decision making alogrithm
    as to how the packet is forwarded over the member links.  There
    is a sequence number associated with the packets if they are
fragmented
    such that they are never out of order because the downstream side
will
    reorder the packets.  This is typically called "load balancing"
because
    you really are balancing the traffic on the links based on which
    member link is the most congested at the time of transmit.
    Splitting load like in (a) and (b) is typically called "load
sharing"
    because you are just sending traffic over equal cost links with no
    knowledge of which one is more congested than the other.

d)  Then along came CEF.  The goal of CEF was to solve the the cache
maintenance
    issue that existed with fastswitching and in the process give better
    load distribution for routes with equal cost paths. To solve the
cache
    maintenance issue CEF simply installs a FIB entry for any given
route
    that is in the routing table and that FIB entry points to a
corresponding
    adjacency entry that holds the mac header rewrite to forward the
packet.
    If there are equal cost paths there will be a loadinfo/pathinfo
table that holds
    pointers to the various paths.  The question is in this scenario how
    does CEF decide which path to take?  For *every* packet CEF does a
hash
    of the src/dst ip address in the packet and the result is one of the
paths.
    Therefore for every single packet between a given set of src/dst ip
addresses
    those packets will always take the same path.  To see the result of
that
    hash you can do: sh ip cef exact-route <src> <dst>
    Therefore, by default with CEF you never have an out of order packet
problem
    because all packets for a given src/dst ip address pair follow the
same path.
    You don't have a cache maintenance problem because the FIB is
directly programmed
    from the routing table and you don't age out entries.

    Now if you turn on CEF per-packet it's the same as process switching
in
    the load sharing context except that you do it under interrupt
without
    putting the packet to the processor.  I never recommend doing per
packet
    because it does cause a performance degredation in the applications.
You
    may never hear about it because the hit may be small enough that the
users
    don't notice it but it will be there.

CEF really isn't truely "per destination/prefix".  And it really isn't
per flow in the usual context because most people think of per flow as
including the
L3 port information.  CEF in IOS today does not do any load sharing
based on anything other than the L3 src/dst address.  However, some
hardware implementations do allow load balancing that takes in to
account the L4 port information but IOS based CEF does not.

Summary:

If you are dealing with internet/large enterprise where you have a large
range of src/dst ip address combinations the default CEF loadsharing
should give you pretty close to 50/50 60/40 load on equal cost paths.
Naturally if you have a large backup going between two servers you will
overload one link.  The only way to fix that is MLPPP or per-packet
(which I don't recommend).
There is a small performance penalty with MLPPP so that's why I suggest
customers to try the equal cost routes with CEF default load sharing to
see what kind of load distribution they get over the links first.  If
that isn't close enough to equal then look at doing MLPPP if possible.
If that's not an option then as a last option do per-packet.

Hopefully this helps clear up some of the confusion.

Rodney

/*Side note: I kept it simple by saying that all traffic forwarded
  from process level is round robin over equal cost paths.  Well,
  that's not 100% accurate because if you ping from the router
  on some platforms with CEF enabled even though the packet is
  coming from process level it can be forwarded based on the FIB
  table.  I've never taken the time to go chase down that exact
  forwarding algorithm. ie: What do you use as the src address to
  do the hash on if you assign the src address to the outbound
  interface?  Chicken and egg problem*/

On Tue, Aug 24, 2004 at 09:38:37PM -0400, Alex Rubenstein wrote:
>
> Not being a wise-ass, but what is the difference between per-prefix,
> and per-destination?
>
> And, I believe what CEF *actually* does is per prefix, meaning that,
> if there is a CEF route for a /17, all traffic for that /17 goes on
> whatever link is cached for that /17 until a route update occurs.
>
>
>
>
>
>
>
> On Tue, 24 Aug 2004, Gert Doering wrote:
>
> > Hi,
> >
> > On Tue, Aug 24, 2004 at 09:57:25AM -0700, Manoj koshti wrote:
> > > It will do per prefix load balancing . so traffic will not get
equally distributed among the  3 links.
> >
> > CEF *never* does "per prefix".
> >
> > By default it does "per destination" load balancing, but you can
> > change that on interface level to "per packet".
> >
> > gert
> > --
> > USENET is *not* the non-clickable part of WWW!
> >
//www.muc.de/~gert/
> > Gert Doering - Munich, Germany
gert at greenie.muc.de
> > fax: +49-89-35655025
gert at net.informatik.tu-muenchen.de
> > _______________________________________________
> > cisco-nsp mailing list  cisco-nsp at puck.nether.net
> > https://puck.nether.net/mailman/listinfo/cisco-nsp
> > archive at http://puck.nether.net/pipermail/cisco-nsp/
> >
>
> -- Alex Rubenstein, AR97, K2AHR, alex at nac.net, latency, Al Reuben --
> --    Net Access Corporation, 800-NET-ME-36, http://www.nac.net   --
>
>
> _______________________________________________
> cisco-nsp mailing list  cisco-nsp at puck.nether.net
> https://puck.nether.net/mailman/listinfo/cisco-nsp
> archive at http://puck.nether.net/pipermail/cisco-nsp/
_______________________________________________
cisco-nsp mailing list  cisco-nsp at puck.nether.net
https://puck.nether.net/mailman/listinfo/cisco-nsp
archive at http://puck.nether.net/pipermail/cisco-nsp/

Our name has changed.  Please update your address book to the following format: "recipient at capgemini.com".

This message contains information that may be privileged or confidential and is the property of the Capgemini Group. It is intended only for the person to whom it is addressed. If you are not the intended recipient,  you are not authorized to read, print, retain, copy, disseminate,  distribute, or use this message or any part thereof. If you receive this  message in error, please notify the sender immediately and delete all  copies of this message.