[j-nsp] BGP advertisements not making it
Richard A Steenbergen
ras at e-gerbil.net
Mon Aug 2 01:10:36 EDT 2010
On Mon, Aug 02, 2010 at 10:46:42AM +0800, Mark Tinka wrote:
> On Saturday, July 31, 2010 11:55:44 am Richard A Steenbergen
> wrote:
>
> > All my policy evaluation bugs are REALLY obscure, for
> > example subroutine policies which will randomly not
> > apply any actions in any term that doesn't contain a
> > "then accept". I've never seen a problem with a
> > configuration as simple as yours.
>
> We've had this problem - the documentation would suggest
> that a non-accept/reject action should still execute the
> policy, but then it doesn't.
>
> It has meant we need to have 'then accept' in literally all
> our policies, in addition to the real action we're trying to
> implement. It's only a few policies that work without
> needing the 'then accept' action, but these are not directly
> related to routing protocols, e.g., DCU, load balancing,
> e.t.c.
Nah that's a bug, it doesn't happen on most routers. I couldn't even
replicate it on another router with the same code and almost identical
config talking to the same neighbor ASN with the same policy structure.
And the one box where I could replicate it, I had to hard clear the
neighbor to make the change take effect in either direction, a soft
clear wouldn't do it. Wish my luck trying to get JTAC to figure THAT one
out. :)
Speaking of really obscure bgp issues, have you done any profiling on
the effects of out-delay by chance? I was playing a game of "why the
#$%^& does it take my router 7 minutes to even START sending a bgp
table" the other day, and started testing out different configuration
options. I noticed significant improvements with even an out-delay of 1,
though I suspect its because it's bypassing an rpd scheduling issue and
not because I'm packing the updates that much more efficiently. :)
For example, here is a bunch of repetitions of a show bgp sum view of
both sides of two directly connected IBGP neighbors following a clear,
watch the OutQ and the uptime to see how long it takes to exchange a
certain number of routes. In this case router A is sending 324k routes,
and router B is sending 160k routes.
First up, no out-delay, router A looking at router B. Note the OutQ is
fully populated within a few seconds, then takes 3 minutes to drain:
Peer AS InPkt OutPkt OutQ Flaps Last Up/Dwn State
b.b.b.b x 50446 120234 324199 3 7 Establ
b.b.b.b x 50447 137142 231750 3 52 Establ
b.b.b.b x 50456 143576 199308 3 1:06 Establ
b.b.b.b x 50464 152255 154769 3 1:27 Establ
b.b.b.b x 50469 159169 118402 3 1:44 Establ
b.b.b.b x 50474 165163 87116 3 1:58 Establ
b.b.b.b x 50479 170791 57686 3 2:11 Establ
b.b.b.b x 50484 174573 23343 3 2:25 Establ
b.b.b.b x 50486 176852 12026 3 2:33 Establ
b.b.b.b x 50990 178106 6245 3 2:43 Establ
b.b.b.b x 52036 181469 0 3 3:01 Establ
Here is the other side... You can see that router B doesn't actually
start sending updates until router A is done sending (3 minutes in), for
a total convergence time of 4m30s.
Peer AS InPkt OutPkt OutQ Flaps Last Up/Dwn State
a.a.a.a x 10351 3 160971 1 35 Establ
a.a.a.a x 61487 1558 154590 1 3:02 Establ
a.a.a.a x 61943 6319 131500 1 3:18 Establ
a.a.a.a x 63009 18385 72508 1 3:49 Establ
a.a.a.a x 63296 21510 55490 1 4:10 Establ
a.a.a.a x 63845 30355 0 1 4:28 Establ
Now, here is the same thing but with out-delay enabled.
Here is side A looking at side B. This time the OutQ takes a lot longer
to populate, but side B no longer waits until side A is done sending,
both sides exchange routes at the same time.
Peer AS InPkt OutPkt OutQ Flaps Last Up/Dwn State
b.b.b.b x 120535 180845 66288 3 9 Establ
b.b.b.b x 121337 183024 126093 3 20 Establ
b.b.b.b x 121882 184804 226031 3 32 Establ
b.b.b.b x 122118 188703 190614 3 51 Establ
b.b.b.b x 130343 188704 190611 3 1:06 Establ
b.b.b.b x 133371 196315 153121 3 1:35 Establ
inet.0: 76933/76987/76987/0
b.b.b.b x 152940 196923 150734 3 2:19 Establ
inet.0: 159108/160172/160172/0
b.b.b.b x 152952 204508 90764 3 2:49 Establ
inet.0: 159108/160172/160172/0
b.b.b.b x 152956 211999 53871 3 3:18 Establ
inet.0: 159105/160172/160172/0
b.b.b.b x 152970 221038 28340 3 3:47 Establ
inet.0: 159108/160172/160172/0
b.b.b.b x 152983 232156 0 3 4:30 Establ
inet.0: 159107/160172/160172/0
Here is side B's view, showing that B->A converges in under 2 minutes
this time:
Peer AS InPkt OutPkt OutQ Flaps Last Up/Dwn State
a.a.a.a x 4576 2391 63977 5 34 Establ
a.a.a.a x 8634 13643 84010 5 1:14 Establ
inet.0: 12913/79107/79107/0
a.a.a.a x 16350 33210 0 5 2:12 Establ
inet.0: 25769/116220/116220/0
a.a.a.a x 30216 33226 0 5 3:10 Establ
inet.0: 35295/212110/212110/0
a.a.a.a x 39214 33235 0 5 3:38 Establ
inet.0: 39227/235334/235334/0
a.a.a.a x 44349 33243 0 5 4:00 Establ
inet.0: 40923/252531/252531/0
a.a.a.a x 52020 33252 0 5 4:22 Establ
inet.0: 41782/266164/266164/0
It still takes ~4m30s to converge A->B, but the dynamics completely
change. This behavior is completely reproducable, except if you do
something which causes the session to reset by commit (such as changing
a param), in which case the out-delay doesn't seem to take effect at all
on the next re-start. But if you manually clear the session right
afterwards, it does. :)
--
Richard A Steenbergen <ras at e-gerbil.net> http://www.e-gerbil.net/ras
GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC)
More information about the juniper-nsp
mailing list