[j-nsp] Auto-bandwidth Accuracy

Tue May 25 16:31:57 EDT 2010

Yeah, I can't view that PR either, and it was our JTAC case that
triggered it to be opened.  I don't remember all the details of the
trigger mechanism.  Can anybody get Juniper to make the PR public?

-MO

-----Original Message-----
From: Phil Bedard [mailto:philxor at gmail.com] 
Sent: Tuesday, May 25, 2010 4:05 PM
To: Olson, Martin
Cc: Danny Vernals; Richard A Steenbergen; juniper-nsp at puck.nether.net
Subject: Re: [j-nsp] Auto-bandwidth Accuracy

Do you have the detail on PR 457767, it doesn't seem to show up in the
system.   I tried to duplicate 438157 and could never successfully do
so.  

Thanks, 
Phil  

On May 25, 2010, at 3:23 PM, Olson, Martin wrote:

> Yeah, I found the same behavior.  Sometimes the Max AvgBW would go up
by 5X-7X the value it should've, which would lead to really high
reservations after the next adjust-interval.  I opened case
2009-0610-0697 about the issue, and after a while they traced the
problem to PRs 438157 and 457767.  The first code with the fix for both
PRs is 9.6R2/9.5R3/9.4R4/9.3R5.  They told us that if we disabled the
adjust-threshold-overflow-limit in the meantime, that would alleviate
the problem until we upgrade code.
> 
> -MO
> 
> 
> -----Original Message-----
> From: Danny Vernals [mailto:danny.vernals at gmail.com] 
> Sent: Tuesday, May 25, 2010 5:18 AM
> To: Richard A Steenbergen
> Cc: juniper-nsp at puck.nether.net
> Subject: Re: [j-nsp] Auto-bandwidth Accuracy
> 
> On Sun, May 23, 2010 at 7:52 AM, Richard A Steenbergen
<ras at e-gerbil.net> wrote:
>> Recently I've been noticing some really odd auto-bandwidth behavior
on
>> several different routers, and I'm wondering if anybody knows if this
is
>> a known bug or if I'm doing something really wrong in my autobw
config.
>> 
>> Specifically, I'm seeing many cases where the rsvp reservations on an
>> interface are vastly higher than the actual traffic going over it. I
>> started comparing autobw measures bandwidth value vs rsvp resv
bandwidth
>> across my LSPs (with an op script :P), and noticed that a large
number
>> of LSPs that were ingress on Juniper routers were consistently
reserving
>> more bandwidth than they were actually passing.
>> 
>> To troubleshoot this further, I picked one LSP at random and followed
it
>> through the course of an entire adjust-interval. I also watched it in
>> "monitor label-switched-path", and followed the bandwidth recorded
for
>> it in the mpls stats file. The mpls stats file pretty consistently
>> recorded a bandwidth of around 900Mbps. Some samples were up to 1G,
some
>> were down in the 800Mb's, but nothing was significantly outside this
>> range:
>> 
>> xxx.xxxx-xxx.xxxx-BRONZE-1     20442770 pkt    21800398308 Byte
91864 pps 97826023 Bps Util 43.47%
>> xxx.xxxx-xxx.xxxx-BRONZE-1     25748678 pkt    27500224526 Byte
89930 pps 96607224 Bps Util 42.93%
>> xxx.xxxx-xxx.xxxx-BRONZE-1     31309754 pkt    33516047564 Byte
95880 pps 103721086 Bps Util 46.09%
>> xxx.xxxx-xxx.xxxx-BRONZE-1     36934965 pkt    39389728013 Byte
90729 pps 94736781 Bps Util 42.10%
>> xxx.xxxx-xxx.xxxx-BRONZE-1     41323164 pkt    44001156442 Byte
86043 pps 90420165 Bps Util 40.18%
>> xxx.xxxx-xxx.xxxx-BRONZE-1     46229207 pkt    49166295068 Byte
84586 pps 89054114 Bps Util 39.58%
>> xxx.xxxx-xxx.xxxx-BRONZE-1     51764861 pkt    55023074603 Byte
92260 pps 97612992 Bps Util 43.38%
>> xxx.xxxx-xxx.xxxx-BRONZE-1     57091315 pkt    60691783494 Byte
90278 pps 96079811 Bps Util 42.70%
>> xxx.xxxx-xxx.xxxx-BRONZE-1     62138489 pkt    66009079194 Byte
90128 pps 94951708 Bps Util 42.20%
>> xxx.xxxx-xxx.xxxx-BRONZE-1     67697838 pkt    72030553645 Byte
92655 pps 100357907 Bps Util 44.60%
>> xxx.xxxx-xxx.xxxx-BRONZE-1     73083250 pkt    77870203449 Byte
89756 pps 97327496 Bps Util 43.25%
>> xxx.xxxx-xxx.xxxx-BRONZE-1     78530642 pkt    83799427998 Byte
90789 pps 98820409 Bps Util 43.91%
>> xxx.xxxx-xxx.xxxx-BRONZE-1     84166327 pkt    89767404007 Byte
85389 pps 90423878 Bps Util 40.18%
>> xxx.xxxx-xxx.xxxx-BRONZE-1     89990750 pkt    96052103366 Byte
85653 pps 92422049 Bps Util 41.07%
>> xxx.xxxx-xxx.xxxx-BRONZE-1     94808838 pkt   101299936674 Byte
87601 pps 95415151 Bps Util 42.40%
>> xxx.xxxx-xxx.xxxx-BRONZE-1    100044983 pkt   106918990604 Byte
83113 pps 89191332 Bps Util 39.64%
>> xxx.xxxx-xxx.xxxx-BRONZE-1    104706036 pkt   111928263183 Byte
86315 pps 92764307 Bps Util 41.22%
>> xxx.xxxx-xxx.xxxx-BRONZE-1    109664547 pkt   117256403183 Byte
81287 pps 87346557 Bps Util 38.82%
>> xxx.xxxx-xxx.xxxx-BRONZE-1    115001230 pkt   123065374817 Byte
84709 pps 92205898 Bps Util 40.98%
>> xxx.xxxx-xxx.xxxx-BRONZE-1    120197917 pkt   128761293505 Byte
85191 pps 93375716 Bps Util 41.50%
>> xxx.xxxx-xxx.xxxx-BRONZE-1    124790487 pkt   133783111501 Byte
79182 pps 86583068 Bps Util 38.48%
>> xxx.xxxx-xxx.xxxx-BRONZE-1    129450091 pkt   138908431043 Byte
84720 pps 93187628 Bps Util 41.41%
>> xxx.xxxx-xxx.xxxx-BRONZE-1    134048794 pkt   143940227806 Byte
82119 pps 89853513 Bps Util 39.93%
>> xxx.xxxx-xxx.xxxx-BRONZE-1    138900130 pkt   149257983679 Byte
80855 pps 88629264 Bps Util 39.39%
>> xxx.xxxx-xxx.xxxx-BRONZE-1    143665805 pkt   154447812210 Byte
79427 pps 86497142 Bps Util 38.44%
>> xxx.xxxx-xxx.xxxx-BRONZE-1    148501587 pkt   159667032930 Byte
80596 pps 86987012 Bps Util 38.66%
>> xxx.xxxx-xxx.xxxx-BRONZE-1    153971586 pkt   165650360517 Byte
78142 pps 85476108 Bps Util 37.99%
>> 
>> Next, I watched the output of "show mpls lsp name BLAH detail",
looking
>> at the autobw measured amount (Max AvgBW) and the reserved bandwidth.
>> I'm using a stats interval of 60 seconds, an adjust-interval of 900
>> seconds, and in this instance no overflow samples occured. After the
>> previous adjust-interval completes the measured bw is reset to 0, and
>> then starts updating again after the first 60 sec stats interval is
up.
>> For around the first 700 seconds the Max AvgBW was pretty close to
what
>> one would expect (around 900Mbps), then it jumped to ~1.6Gbps for no
>> reason that I can determine. The stats file for this LSP (above)
never
>> showed anything above 1.0G, and a monitor of the lsp never showed any
>> sample thatever got anywhere near that high (let alone enough to make
an
>> entire 60 sec sample interval report that high). At the end of the
900
>> seconds, te 1.6G value is what was signaled to RSVP, and the cycle
>> repeated itself. I watched it for several more cycles, and saw the
same
>> behavior happening over and over again, with measured values of 1.8G
>> plus, while the stats file continued to show an average of around
>> 800-900Mbps and no sample that ever went above 1G.
>> 
> 
> I've seen something similar on 9.5R2 although I didn't pay it much
> heed at the time as I was investigating other issues.  My guess (and
> it is definitely a guess) is that there is an internal data structure
> which stores the LSP usage which is then divided by the sampling
> interval and written to the statistics file after the sampling
> interval.  If something (rpd scheduling issue, CPU at 100%?) prevents
> this value from being written to the statistics file after the
> sampling interval it gets a default value of 0.  The data structure
> keeps the stats from the previous sampling interval and is added to.
> When the next sampling interval expires this value is then divided by
> 1 x sampling interval leading to an average bps value roughly double
> what it should be.
> 
> I'll keep an eye out and report back if I see this behaviour again.
> 
> 
>> This particular router is running 9.4R3, but I've seen similar
behavior
>> on some other 9.5R4 routers as well. This really seems like some kind
of
>> bug, but honestly I'd sooner slit my wrists with a rusty PIC than try
to
>> explain the above to JTAC (besides, they would probably just ask me
for
>> 50 irrelevent log files then do nothing for the next 6 months like
all
>> of my other cases :P). I'm wondering if this is some kind of known
>> issue, or if there is some reason why this config wouldn't work well.
>> 
>> The stats interval of 60 seconds is because I snmp poll and graph the
>> mplsLspOctets every 60 seconds, and snmp is updated based on the
stats
>> interval. Any value other than 60 secs makes the graphs wildly
jitter.
>> But in the JUNOS documentation for auto-bandwidth, there is the
>> following warning:
>> 
>>
http://www.juniper.net/techpubs/en_US/junos9.5/information-products/topi
c-collections/config-guide-mpls-applications/mpls-configuring-automatic-
bandwidth-allocation-for-lsps.html
>> 
>> Note: To prevent unnecessary resignaling of LSPs, it is best to
>> configure an MPLS automatic bandwidth statistics interval of no more
>> than one third the corresponding LSP adjustment interval. For
example,
>> if you configure a value of 30 seconds for the interval statement at
the
>> [edit protocols mpls statistics] hierarchy level, you should
configure a
>> value of no more than 90 seconds for the adjust-interval statement at
>> the [edit protocols mpls label-switched-path label-switched-path-name
>> auto-bandwidth] hierarchy level.
>> 
>> I could never figure this one out, and personally I always thought it
>> was some kind of documentation error. What possible reason could
there
>> be for not having an adjust-interval of more than 3x the statistics
>> value? I'm running 900 sec adjust-intervals with 300 sec overflow
>> detection (the lowest value you can configure) to try and reduce RSVP
>> resignaling load on the network. Every time an LDP resignals, it
tears
>> down the bypass LSPs as well, and at one point (prior to 9.4 I think)
it
>> took over 50 seconds before JUNOS would even try to start resignaling
>> the bypass LSPs. There were some optimizations made to make it kick
off
>> the bypass LSP resignal within ~15 secs instead of ~50 secs, but
we're
>> still trying to keep it from resignaling excessively.
>> 
> 
> I've never seen this advice before but I've certainly seen networks
> operate fine with adjust-interval much greater than 3x statistics
> interval.
> 
>> I'll gladly accept any clue anyone can offer on this one. :)
>> 
>> --
>> Richard A Steenbergen <ras at e-gerbil.net>
http://www.e-gerbil.net/ras
>> GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1
2CBC)
>> _______________________________________________
>> juniper-nsp mailing list juniper-nsp at puck.nether.net
>> https://puck.nether.net/mailman/listinfo/juniper-nsp
>> 
> 
> 
> 
> _______________________________________________
> juniper-nsp mailing list juniper-nsp at puck.nether.net
> https://puck.nether.net/mailman/listinfo/juniper-nsp