[c-nsp] OSPF router gets separated from a broadcast domain

Mon Feb 4 15:43:06 EST 2008

tabor Ivan's wrote:
> I thought just the same before we get burnt by this issue. So I am
> afraid this doesn't work like this(but I am far from to be sure...)
> 
> Router A
> - has "customer" network x.x.x.0 as connected
> - connected to transit network t.t.t.0 with address t.t.t.a
> - loopback: a.a.a.a
> 
> Router B
> - connected to backbone
> - connected to transit network t.t.t.0 with address t.t.t.b
>          assume that we have a connection problem here, so t.t.t.b is
> up, but cannot reach t.t.t.c and t.t.t.a
> - loopback: b.b.b.b
> 
> Router C
> - connected to backbone
> - connected to transit network t.t.t.0 with address t.t.t.c
> - loopback: c.c.c.c
> 
> 
> Now Router B receives a packet with a destination address x.x.x.x. It
> makes the routing decision based on it's LSDB, which will be something
> like this:
> 1. x.x.x.0 is connected to router a.a.a.a
> 2. router a.a.a.a has an interface in network t.t.t.0, namely t.t.t.a
> 3. I (Router B) have also an interface in t.t.t.0: "Hurray, we have a
> path!"; BTW.   I (Router B) know, that Router C also has an interface in
> t.t.t.0, so if I (Router B) have my t.t.t.b interface down, I would
> route toward c.c.c.c. But "luckily", this is not the case this time.
> 4. Router B starts to ARP t.t.t.a without any success and drops the packet.
> 
> The routing decision on all routers will be similar in the same OSPF area.
> 
> I don't know whether it happens like I described above, but I am keen to
> get to know it.
> 
> cheers,
> Gabor

I does work, I have many local examples where if it did not work many
thousands of people would notice.

I will assume that x.x.x.0/24 is announced via "redistribute connected
subnets" on A, that no static routes are involved and that the loop
backs are in area via a network statement and declared as the router-id
(remember, SPF needs a restart to take a new router ID if it has already
selected one).

We start out with A,B,C having full reachability, and each router has 2
SPF peers in the transport subnet.

Everything works as expected.

Now, lets assume router B still has the interface up, but the transport
is actually unreachable.  All three routers will still have an rap entry
for the other 2, and will continue attempting to forward traffic *until*
the SPF dead timer kicks in.  At that point, within the transport
subnet, A and C have each other for neighbors, and B has no neighbors in
this transport subnet.

At this point, B will no longer be passing on the customer route to the
backbone, A will be.  B will use the route coming via A.

What cannot be handled is where A can talk to B and B to C but no A to
C.  In the case of this non-uniform or asymmetric split, then it would
behave as you suggest.  Because A and B would have an in subnet SPF
relationship, and A and C the same, B would listen to the local routes
received via A but pointing directly to C.

I have only seen this type of split in a few cases, an improper VLAN ACL
at one site eating traffic from only one of the OSPF speakers, where 2
remote sites were connected to the same edge device like a DSLAM or
ME-3400 set to allow traffic to/from the core but not edge port to edge
port, and in emulated multipoint L2 service with traffic los in between
some sites but not others.

Again, key points

*ALL* routers should be using a loopback address as the router ID and
this address should also feature an area statement.

OSPF must me restarted before it will take a new router-id.

Any static routing across the transport network (specially if you are
also doing redistribute static) will break things.

In the case of B and C bumping the OSPF cost up is a good idea if the
transport side interface is of similar speed to your core links, to keep
it from being good B <--> C transport. (Classic problem when serving
customers/CPE gear off of gig subints where your core links are also
gig, same cost)

Remember, with default timers it will take up to 40 seconds for the dead
timer to kick in, and additional seconds for the routing to update.
Depending on the overall network, it could take 45 seconds to over a
minute for routing to be fully re-pathed.  If you are careful, make sure
you really know what you are doing, and that the configs are applied to
*ALL* routers in the subnet, you can lower the timer values for the
transport network, for a faster failover.

-- 
------------------------------------------------------------------------
Christopher E. Brown   <chris.brown at acsalaska.net>   desk (907) 550-8393
                                                     cell (907) 632-8492
IP Engineer - ACS
------------------------------------------------------------------------