[nsp] pinball routing

jlewis@lewis.org jlewis@lewis.org
Thu, 15 Aug 2002 22:32:10 -0400 (EDT)


We had a very strange day today.  A DS3 to a seemingly unimportant (to the 
rest of the network) POP that services a few T1 customers and a some 
dial-up gear went down.  

At the same time, we noticed unusualy high utilization on some T1's
feeding a different POP several hops away on a different part of our
network.

When we looked into the source of that traffic, it turned out traffic
destined for the internet originating from yet another seemingly unrelated
POP was running through the core of our network, right through the routers
that handle our internet connectivity, 3 hops deep into a chain of POPs
off our core.  At the 3rd hop, it bounced back towards the core and out to
the internet through a router it had already been through.  

i.e. Hopefully you have a fixed width font...in the diagram below, the
letters represent Cisco routers, all 7200's except I, A, and C which are
3640's, and D which is a 7513.  = is fast ethernet, each - or | is one or
more T1 or T3 lines.  O and D handle all of our internet connectivity.  
Much of network that was unaffected is omitted.

                            W
                            |
               C--A--I--X---P=O---V
                            | |
                            D |
                            | |
                         Internet

The line between P and W went down due to a telco failure.  At the same
time, traffic from V to the internet began to go
V-O=P-X-I-A-I-X-P=O-Internet.  Traffic from the internet to V went the
usual way.  This routing was not what you'd predict by looking at the
routing tables on any of the routers and it didn't make sense that
internet traffic could enter O from V, bounce around our network for a
bit, and then get to the internet via O.  The only explanation I can think
of is oddball CEF bugs.  All the routers run CEF.

We tried shutting down circuits, to see if we could shake the routing 
loop.  First we shut down the links from A to I.  This resulted in a loss 
of internet connectivity for V.  When the telco got the P-W 
circuit back up, our routing problem did not change.  Eventually, when we 
ran out of things to try, we rebooted A, I, X, P, and O.  When everything 
came back up, traffic from V was routed properly, but OSPF between X and I 
was not propogating routes.  We didn't see why, since it had worked 
before.  Removing and reentering the ospf network statement covering the 
link between X and I caused OSPF to start working again and brought our 
network back to normal.

After dealing with this, we talked about what might have happened and how 
to avoid a replay.  My guess is that some questionable things we've done 
in our OSPF config (mostly a result of lots of network connectivity 
changes) tripped up bugs in CEF resulting in routing that could not be 
explained.  This got us talking about at the very least doing a major 
overhaul of our OSPF setup, and wondering if maybe it's time to look for 
another IGP.  We currently run OSPF with a backbone and several additional 
areas.  If we were to just make the whole network area 0, we'd have a good 
deal more than the recommended max number of routers in an area (according 
to Sam Halabi's OSPF design guide).  One option we're considering is to 
put the wan interfaces of the vast majority of our routers into area 0, 
and define an area at each POP for the local ethernet and perhaps include 
minor (stub) POPs in those areas.  i.e. all the routers in the diagram 
above might be area O on their wan interfaces, while C and the ethernet on 
A would be area 10.
 
As our network continues to grow, do we need to consider other IGP's like 
ISIS or just iBGP, or are there large provider networks running OSPF with 
lots of routers and lots of areas?

----------------------------------------------------------------------
 Jon Lewis *jlewis@lewis.org*|  I route
 System Administrator        |  therefore you are
 Atlantic Net                |  
_________ http://www.lewis.org/~jlewis/pgp for PGP public key_________