[c-nsp] Improve Convergence Times

Bruce Pinsky bep at whack.org
Thu Jan 11 13:51:27 EST 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Greene, Patrick wrote:
> Thanks in advance for you advice.
>  
> I have 2 Cisco 7600's each with an OC-3 to a carrier's MPLS cloud.  I am
> running BGP as my EGP and OSPF as my IGP.  I redistribute my BGP into
> OSPF.  I only have about 10 routers in my OSPF backbone area.  I have
> set it up to run all traffic through one OC-3 as a primary and the other
> as failover using MED and Local-Pref.  On the carrier side, BGP
> Fast-Failover is configured in the event the carrier has an outage.  My
> problem is, when I fail my primary OC-3 interface, it takes 10-12
> seconds for OSPF to pick up the failure and hence traffic is being
> dropped out to the WAN.  What can I do to improve this convergence?
>  


What manner of failure?  Are you unplugging the optics?  Something else?

What is the underlying transport for the OC-3?  SONET?  DWDM?  Something else?

In general, it is better to get the layer1 mechanisms to detect the failure
rather than relying on control plane keepalives to notice.  Depending on
what the optical transport is, I can provide you some recommendations on
what you should tune to detect failures faster.  Minimally, you should
insure that carrier-delay is set to zero (default is 2 seconds) and you
should have IP Event Dampening enabled to penalize flapping interfaces and
reduce churn.

The real question here is what is contributing to that 10 seconds. If it is
detecting the failure, some tuning as above will help.  Off the top I
suspect 5 seconds of it is the Next Hop Tracking trigger delay.  Once the
nexthop is reported as down, a bestpath run will be triggered and BGP will
select the new bestpath.  So you could tune that down a tad bit, but too
low and you run the risk of doing the bestpath run before your IGP
convergences.

Once the bestpath is calculated, it must be redistributed into OSPF.
Probably not much delay there.  Once into OSPF, LSAs must be generated and
an SPF will need to be run at each of 10 routers.  This is likely the
second largest contributor to the convergence since initial SPF delay is 5
seconds.  Also, minimum LSA interval and LSA generation delay can also
contribute here depending on what other changes may be going.

Using the "timers throttle lsa" and "timers throttle spf" commands, you can
decrease the delays introduced by the default values. Tuning these values
is more commonly referred to as doing IGP "fast convergence".  The values
you select are somewhat dependent on your topology.  There is a good
discussion of the convergence process and IGP timer tuning in the
Networkers presentation "Routed Fast Convergence and High Availability"
RST-3363.  Of course, you don't just do the tuning on the CEs facing your
providers, but throughout your OSPF domain.  The general guidelines go
something like this:

SPF
- -----
SPF INITIAL:
Set the SPF INITIAL WAIT TIME to 1 ms

SPF INCREMENT:
Build a baseline of the time normally required to run SPF in the network;
this will generally be around 50 ms
Set the increment to this plus some padding, 5 to 10 ms

SPF and PRC MAXIMUM WAIT TIME:
If the normal SPF time is under 100 ms, set the maximum wait to one second
If it?s higher than 100 ms, set it to:
  (S x P)/1000
  S = normal SPF time
  P = maximum percentage of processor utilization for SPF

LSAs
- ------
Set the link state generation initial wait time to 5 ms
This dampens some of the faster link flaps in the network
Use IP event dampening to quell link flaps, as well

Set the increment and the maximum wait times to the same values as you?ve
set the SPF
No point in generating LSPs faster than the routers will actually process
 them!


The specific values must be set according to your topology since LSA
propagation delay from the point of failure to the rerouting node(s) will
need to be taken into account (and will vary from customer to customer
based on redundancy, network diameter, etc).  As an example, for a large SP
that was trying to achieve 5 sec (or less) convergence in their MPLS
network, they have values of:

SPF IW=50 ms, SPF_Increment=60 ms, SPF_Max=1000ms
LSA Generation IW=0 ms, Increment=40 ms, Max=1000ms

This generally gave them subsecond convergence in the case of P router
failure and 2-5 seconds in the case of PE failure (depending on the failure
mode).

There are a few other settings that need to be taken into account (such as
LSA arrival interval) if you are going to tune your IGP timers.  Your best
course of action if you want to go this direction is probably to work with
some cisco tech folks who have experience doing IGP fast convergence tuning.

- --
=========
bep

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.4 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFpocvE1XcgMgrtyYRAsdGAKCOJqC+JeRBYISXZO0vL8PGHOm8KQCfRFY3
TxfB1ERq4RWHWIwLXQ53zZ8=
=kJph
-----END PGP SIGNATURE-----


More information about the cisco-nsp mailing list