Re: Wither irtf-rr

From: Fred Baker (fred@cisco.com)
Date: Fri Dec 15 2000 - 19:42:44 EST


At 04:27 PM 12/14/00 -0500, abha ahuja wrote:
>Great! We'd be happy to work with them... What are their concerns?

varied.

We're looking at the growth in routes, which is now roughly at the same
rate it was in 1993 prior to the deployment of CIDR. We believe that this
is primarily related to multihoming, and may be amenable to adjustment
using policies. Geoff Huston has suggested that we put a TTL on routes
advertised - when transit contracts are in place, maybe I want my immediate
neighbor to get a certain /28 route, but there is no need to send it beyond
him although other routes will go beyond him. Attacking the same problem, I
suggested another approach - if my route to you and my route to your left
nostril have the same next hop, I don't need to store the fact in my FIB,
nor advertise it beyond me. There is probably another way to solve it, and
it may be better.

We are also looking at convergence times. Vendors are in the process of
deploying a different understanding of a certain paragraph in BGP-4, which
would bring global convergence on a withdraw or a new route to roughly 30
seconds. But for many networks, 30 seconds and 30 years are still roughly
comparable; they would prefer convergence on the order of seconds. The
summary offered by Ran Atkinson was that "current convergence times leave a
lot of things in weird states for long periods of time." I have suggested
using some variation on the DUAL algorithm in a next generation BGP in
order to improve convergence; I have suggested a variation on EIGRP as a
starting point for thinking about the algorithms. I'll attach an email sent
earlier this week, following the IAB discussion of same. Again, there may
be a different or better approach.

Once we solve data distribution, I do believe that a better organization of
the information itself, and the policies, would be useful. I characterize
the way we use BGP now as "building skyscrapers out of tinkertoys". I would
be very much in favor of improved approaches. NIMROD, I believe, has a lot
of the right pieces, although I can't describe MPLS as my favorite
implementation of it. That will take more exploration than I personally
have begun.

At 03:55 PM 12/13/00 -0800, I wrote:
>attached is a story, which as of last night I understand needs some
>updating, something about 30 seconds. I'd be curious as to your comments
>on the rest of it. The rest of this is an email I sent internally, and I
>would be interested in your comments.
>
>The reference to EIGRP is to a Cisco proprietary IGP. It is essentially
>distance vector, with two clicks on that. One is that it uses the DUAL
>Algorithm to improve convergence properties; the second is that it (like
>SPF-based protocols) builds a lattice of routers and hangs routes off them.
>
>If you are not familiar with, or don't remember, DUAL, it was proposed by
>JJ Garcia in the late 1980's as a way of improving distance vector
>protocols like RIP etc. In essence, there is a lot that we fix in
>RIP/IGRP/BGP/etc using some combination of heuristics that almost work -
>poison reverse (aka withdraw), split horizon, and hold-downs. hold-downs
>tend to trump anything else; if I can avoid learning anything until after
>the network converges, I only learn good information. But to acheive that,
>they often have to be impossibly long. DUAL installs what amounts to a
>dynamic hold-down which is exactly as long as it needs to be and no longer.
>
>The idea behind DUAL, in simplest form, is that I put a sequence number, a
>timer, a list of unresponsive neighbors, a list of neighbors needing an
>acknowledgement, and a state on each route. If I get new information about
>a route which I need to propagate, and the state of the route is "not
>busy", I make the state "busy", bump the sequence number, advertise the
>sequenced routed to my neighbors, and list them as needing a response. If
>I don't receive a response in mumble time units, I retransmit. to that
>neighbor. I do that until the neighbor is deemed to have failed or until I
>get a response. The neighbor therefore receives the route; either it is
>new information or it is not, and either the router has neighbors besides
>me or not. If it is new information and he has neighbors besides me, he
>repeats the sequence to his neighbors. It it is not new information or he
>has no other neighbors, he replies "got it". The guy who advertised the
>route to him now removes that neighbor from his list of folks he needs
>responses from; when the list goes empty, he makes the route "not busy"
>and responds to the folks who advertised it to him. While the route is
>busy, it is treated as "in hold down"; when it becomes unbusy, it is
>clearly able to change. The nice thing about this is that we can assure
>convergence with respect to any given advertised route within a fairly
>nominal time period.
>
>The bit about the lattice of routers makes for some interesting
>information management. If I consider a router, from a routing
>perspective, to be a network element that advertises a bag of information,
>each route within which is an identified smaller bag of information, then
>knowing that I have lost a router implies that I have lost all the routes
>that it specifically is my connection towards. I therefore only need to
>make sure that the path to the router is stable, not the path to each of
>its routes. Hence I can separate between "hello" exchanges and "I have
>information for you" exchanges.
>
>I'm not sure how much I want to bite off here, or in what sized chunks. I
>believe that our experience with DUAL would help us to implement it in
>some variation of BGP, which would improve convergence. The latter is also
>interesting; from a NIMROD perspective, if I replaced the word "router"
>with "addressable entity, which might be a router or any set of routers,
>including one or more AS's", I might be able to build that towards some of
>the concepts from NIMROD. There might be value in that direction.
>
>Here is the other email I mentioned:
>
>
>At 03:27 PM 12/6/00 -0800, I wrote:
>>finally getting back to you. Sitting on an airplane lets you do one thing
>>for a while...
>>
>>At 10:25 AM 11/22/00 -0800, Pepe (Joseph) Garcia wrote:
>>>I don't know if your ideas include new protocols, or improvements
>>>to the current ones. I'm hoping it's both.
>>>
>>>Currently I'm pushing to focus on big performance improvements for
>>>the current protocols: OSPF, IS-IS, EIGRP, and BGP. Since in the
>>>past there's been a lot of emphasis on adding new features, but at
>>>the expense of performance and quality ;-(.
>>>
>>>I know a lot of people have bought into the MPLS track, but I
>>>wonder why we haven't made improvements in what people now call
>>>"classical IP routing." I'd like to see if we can provide another
>>>choice that's more familiar to people: IP ;-).
>>>
>>>Sorry for the rambling. At any rate, I'm all ears... I definitely
>>>need your ideas/vision in this area. I'm ready and willing to
>>>push for new ideas.
>>
>>Violent agreement on all fronts. Specific to MPLS, I have no problem with
>>using MPLS for what it brings to the party - the ability to enumerate
>>routes, which perhaps then means the ability to traffic engineer them or
>>build VPNs out of them - but I have a real problem with making the
>>Internet depend on them. They might make sense for optical networks, for
>>which a lambda has a lot of the same characteristics as an MPLS path. But
>>I see them as frosting, something added on to the fundamentals, not as
>>themselves fundamental. I looked around for some already-written emails
>>to forward to you - heaven knows I have been going on about this for a
>>while - but I don't see them. So I'll take it from the top.
>>
>>First, let's enumerate what problems we might want to fix. I think they
>>come down to:
>>
>> convergence time
>> complexity
>> number of routes
>> quality of routes
>> traffic engineering
>>
>>by "convergence time", I mean the time it takes for a reroute to
>>converge. If you look at www.irtf.org, click on the routing research
>>group, and then on the chair's introductory slide show, you will find a
>>number of measurements he has taken that suggest that typical BGP
>>convergence is on the order of 90 seconds. That's a long time. He
>>suggests that it is for a variety of reasons, not the least of which is
>>that BGP doesn't particularly try to converge quickly - it is straight
>>distance vector, and makes distance vector's problems worse in a number
>>of ways.
>>
>>by "complexity", I am not referring to the current growth in BGP routes,
>>which is exponential. I am referring to the complexity of installing and
>>configuring a routing protocol. OSPF is a bucket of bolts, and anyone who
>>turns it on winds up configuring a lot of stuff. BGP isn't properly a
>>routing protocol at all - it is an information distribution protocol, and
>>the actual routing is done by the route policy. Route policies are often
>>pretty complex; you can limit yourself to saying "I will accept only
>>these routes from this guy" if you like, but route policy often gets into
>>preferring this to that for arcane reasons, and so one. The complexity is
>>a real issue in hiring - I'm told that almost every BGP Policy contains a
>>serious bug o some kind, and generally our customers don't have the
>>expertise to realize that, let alone debug them. Hence tools we have
>>developed to help them. But what I worry about is that we are now
>>building skyscrapers out of tinker-toys; it would be really nice if we
>>could collect some of the information architecturally and build them out
>>of larger building blocks.
>>
>>That said, the growth in IP routes is pretty bad, and I suspect that
>>there are some pretty simple things we can do to limit that. Many of the
>>routes are multihomed /24 and smaller prefixes being punched through
>>alternate carriers. Many times, I suspect, the result of the route
>>calculation is that a /31 or a /24 has the same next hop as some larger
>>aggregate. Whenever that is true, I will argue that we have no strong
>>reason to set up a separate MPLS tunnel for it, to store the route in our
>>own FIB, or pass the route to anyone else. I'd love to see a simple
>>"simplify routes" mode that tells BGP to look for such cases and not pass
>>them along when they apply. There are some issues there, such as the fact
>>that if we literally did that, the network that the smaller prefix was
>>out of might not advertise the prefix on. We can overcome that, I think.
>>
>>Another is the quality of the routes we calculate. Generally, what we
>>work towards is the one that gives us the least number of hops or some
>>related administrative metric. In BGP, we tend to go for the shortest AS
>>path, to ensure convergence, and when folks want something different,
>>they have to go through conniption fits to defeat routing - ask Digital
>>Island, etc. But it turns out that we actually want something rather
>>different: what we would really like is the route that gives us the most
>>available path or the shortest round trip time. Example: Geoff Huston
>>gave me two traceroutes that go from Australia to Singapore. The shortest
>>in number of AS's goes via the US west coast, meaning it crosses the
>>pacific (130 ms delay one way) twice; the longer number of AS's has an
>>end to end delay on the order of 50 ms or less. It seems to me that
>>creating non-replicating paths is simple if we can by policy (a) not
>>accept routes from a peer that contain our AS number in the path and (b)
>>not offer routes to a peer that contain its AS number, or more generally
>>not accept routes that loop and not offer routes that will cause our
>>neighbor to loop. Identifying the "best" route is therefore not a matter
>>of selecting the path that has the least number of AS's, but selecting
>>the route that is "best" from some other perspective, such as round trip
>>or one-way delay.
>>
>>And then there is traffic engineering. We do what we do with MPLS in
>>order to achieve that. But I do find myself wondering if there is
>>something we could do that would improve matters in IP. For example,
>>while I understand the oscillation that happened in BBN SPF twenty years
>>ago, I'm not sure the experience is as relevant as we think. We have
>>shied away from anything time-based or load-based as oscillatory. I
>>wonder whether the influx of bandwidth affects that at all. Suppose, for
>>example, that route metrics told me the mean drop rate per kilobit
>>transferred, with a minimum value of ping round trip delay of the line?
>>what I wind up with by default is the network path with the shortest RTT,
>>and when the line starts dropping, we start spreading load. Suppose that
>>BGP had a "round trip to target" attribute, and which contained either
>>the mean traceroute delay to the network or the sum of the TCP RTTs of
>>the BGP sessions going to the network, and that we had a verb that said
>>"when presented with two or more potential routing paths, take the path
>>with the shortest cumulative delay"?
>>
>>
>>Several of these can be handled within the existing routing protocols,
>>and would be options that might be worth exploring. Improving BGP
>>convergence, I think, requires an improved protocol. There, I would argue
>>for the algorithms used in EIGRP, which are designed to improve
>>convergence in a distance vector protocol. Suppose we used something very
>>EIGRP-ish - in a standard protocol - as a way to distribute routing
>>information (nodes on the tree being AS's rather than routers, at least
>>in non-IBGP cases), but basically distributing BGP routing information? I
>>think we could pretty readily make the case that convergence improves.
>>
>>Just a start. Maybe you have some notions.



This archive was generated by hypermail 2b29 : Mon Aug 04 2003 - 04:10:04 EDT