[j-nsp] Auto-bandwidth Accuracy

Tue May 25 15:23:51 EDT 2010

Yeah, I found the same behavior.  Sometimes the Max AvgBW would go up by 5X-7X the value it should've, which would lead to really high reservations after the next adjust-interval.  I opened case 2009-0610-0697 about the issue, and after a while they traced the problem to PRs 438157 and 457767.  The first code with the fix for both PRs is 9.6R2/9.5R3/9.4R4/9.3R5.  They told us that if we disabled the adjust-threshold-overflow-limit in the meantime, that would alleviate the problem until we upgrade code.

-MO

-----Original Message-----
From: Danny Vernals [mailto:danny.vernals at gmail.com] 
Sent: Tuesday, May 25, 2010 5:18 AM
To: Richard A Steenbergen
Cc: juniper-nsp at puck.nether.net
Subject: Re: [j-nsp] Auto-bandwidth Accuracy

On Sun, May 23, 2010 at 7:52 AM, Richard A Steenbergen <ras at e-gerbil.net> wrote:
> Recently I've been noticing some really odd auto-bandwidth behavior on
> several different routers, and I'm wondering if anybody knows if this is
> a known bug or if I'm doing something really wrong in my autobw config.
>
> Specifically, I'm seeing many cases where the rsvp reservations on an
> interface are vastly higher than the actual traffic going over it. I
> started comparing autobw measures bandwidth value vs rsvp resv bandwidth
> across my LSPs (with an op script :P), and noticed that a large number
> of LSPs that were ingress on Juniper routers were consistently reserving
> more bandwidth than they were actually passing.
>
> To troubleshoot this further, I picked one LSP at random and followed it
> through the course of an entire adjust-interval. I also watched it in
> "monitor label-switched-path", and followed the bandwidth recorded for
> it in the mpls stats file. The mpls stats file pretty consistently
> recorded a bandwidth of around 900Mbps. Some samples were up to 1G, some
> were down in the 800Mb's, but nothing was significantly outside this
> range:
>
> xxx.xxxx-xxx.xxxx-BRONZE-1     20442770 pkt    21800398308 Byte  91864 pps 97826023 Bps Util 43.47%
> xxx.xxxx-xxx.xxxx-BRONZE-1     25748678 pkt    27500224526 Byte  89930 pps 96607224 Bps Util 42.93%
> xxx.xxxx-xxx.xxxx-BRONZE-1     31309754 pkt    33516047564 Byte  95880 pps 103721086 Bps Util 46.09%
> xxx.xxxx-xxx.xxxx-BRONZE-1     36934965 pkt    39389728013 Byte  90729 pps 94736781 Bps Util 42.10%
> xxx.xxxx-xxx.xxxx-BRONZE-1     41323164 pkt    44001156442 Byte  86043 pps 90420165 Bps Util 40.18%
> xxx.xxxx-xxx.xxxx-BRONZE-1     46229207 pkt    49166295068 Byte  84586 pps 89054114 Bps Util 39.58%
> xxx.xxxx-xxx.xxxx-BRONZE-1     51764861 pkt    55023074603 Byte  92260 pps 97612992 Bps Util 43.38%
> xxx.xxxx-xxx.xxxx-BRONZE-1     57091315 pkt    60691783494 Byte  90278 pps 96079811 Bps Util 42.70%
> xxx.xxxx-xxx.xxxx-BRONZE-1     62138489 pkt    66009079194 Byte  90128 pps 94951708 Bps Util 42.20%
> xxx.xxxx-xxx.xxxx-BRONZE-1     67697838 pkt    72030553645 Byte  92655 pps 100357907 Bps Util 44.60%
> xxx.xxxx-xxx.xxxx-BRONZE-1     73083250 pkt    77870203449 Byte  89756 pps 97327496 Bps Util 43.25%
> xxx.xxxx-xxx.xxxx-BRONZE-1     78530642 pkt    83799427998 Byte  90789 pps 98820409 Bps Util 43.91%
> xxx.xxxx-xxx.xxxx-BRONZE-1     84166327 pkt    89767404007 Byte  85389 pps 90423878 Bps Util 40.18%
> xxx.xxxx-xxx.xxxx-BRONZE-1     89990750 pkt    96052103366 Byte  85653 pps 92422049 Bps Util 41.07%
> xxx.xxxx-xxx.xxxx-BRONZE-1     94808838 pkt   101299936674 Byte  87601 pps 95415151 Bps Util 42.40%
> xxx.xxxx-xxx.xxxx-BRONZE-1    100044983 pkt   106918990604 Byte  83113 pps 89191332 Bps Util 39.64%
> xxx.xxxx-xxx.xxxx-BRONZE-1    104706036 pkt   111928263183 Byte  86315 pps 92764307 Bps Util 41.22%
> xxx.xxxx-xxx.xxxx-BRONZE-1    109664547 pkt   117256403183 Byte  81287 pps 87346557 Bps Util 38.82%
> xxx.xxxx-xxx.xxxx-BRONZE-1    115001230 pkt   123065374817 Byte  84709 pps 92205898 Bps Util 40.98%
> xxx.xxxx-xxx.xxxx-BRONZE-1    120197917 pkt   128761293505 Byte  85191 pps 93375716 Bps Util 41.50%
> xxx.xxxx-xxx.xxxx-BRONZE-1    124790487 pkt   133783111501 Byte  79182 pps 86583068 Bps Util 38.48%
> xxx.xxxx-xxx.xxxx-BRONZE-1    129450091 pkt   138908431043 Byte  84720 pps 93187628 Bps Util 41.41%
> xxx.xxxx-xxx.xxxx-BRONZE-1    134048794 pkt   143940227806 Byte  82119 pps 89853513 Bps Util 39.93%
> xxx.xxxx-xxx.xxxx-BRONZE-1    138900130 pkt   149257983679 Byte  80855 pps 88629264 Bps Util 39.39%
> xxx.xxxx-xxx.xxxx-BRONZE-1    143665805 pkt   154447812210 Byte  79427 pps 86497142 Bps Util 38.44%
> xxx.xxxx-xxx.xxxx-BRONZE-1    148501587 pkt   159667032930 Byte  80596 pps 86987012 Bps Util 38.66%
> xxx.xxxx-xxx.xxxx-BRONZE-1    153971586 pkt   165650360517 Byte  78142 pps 85476108 Bps Util 37.99%
>
> Next, I watched the output of "show mpls lsp name BLAH detail", looking
> at the autobw measured amount (Max AvgBW) and the reserved bandwidth.
> I'm using a stats interval of 60 seconds, an adjust-interval of 900
> seconds, and in this instance no overflow samples occured. After the
> previous adjust-interval completes the measured bw is reset to 0, and
> then starts updating again after the first 60 sec stats interval is up.
> For around the first 700 seconds the Max AvgBW was pretty close to what
> one would expect (around 900Mbps), then it jumped to ~1.6Gbps for no
> reason that I can determine. The stats file for this LSP (above) never
> showed anything above 1.0G, and a monitor of the lsp never showed any
> sample thatever got anywhere near that high (let alone enough to make an
> entire 60 sec sample interval report that high). At the end of the 900
> seconds, te 1.6G value is what was signaled to RSVP, and the cycle
> repeated itself. I watched it for several more cycles, and saw the same
> behavior happening over and over again, with measured values of 1.8G
> plus, while the stats file continued to show an average of around
> 800-900Mbps and no sample that ever went above 1G.
>

I've seen something similar on 9.5R2 although I didn't pay it much
heed at the time as I was investigating other issues.  My guess (and
it is definitely a guess) is that there is an internal data structure
which stores the LSP usage which is then divided by the sampling
interval and written to the statistics file after the sampling
interval.  If something (rpd scheduling issue, CPU at 100%?) prevents
this value from being written to the statistics file after the
sampling interval it gets a default value of 0.  The data structure
keeps the stats from the previous sampling interval and is added to.
When the next sampling interval expires this value is then divided by
1 x sampling interval leading to an average bps value roughly double
what it should be.

I'll keep an eye out and report back if I see this behaviour again.

> This particular router is running 9.4R3, but I've seen similar behavior
> on some other 9.5R4 routers as well. This really seems like some kind of
> bug, but honestly I'd sooner slit my wrists with a rusty PIC than try to
> explain the above to JTAC (besides, they would probably just ask me for
> 50 irrelevent log files then do nothing for the next 6 months like all
> of my other cases :P). I'm wondering if this is some kind of known
> issue, or if there is some reason why this config wouldn't work well.
>
> The stats interval of 60 seconds is because I snmp poll and graph the
> mplsLspOctets every 60 seconds, and snmp is updated based on the stats
> interval. Any value other than 60 secs makes the graphs wildly jitter.
> But in the JUNOS documentation for auto-bandwidth, there is the
> following warning:
>
> http://www.juniper.net/techpubs/en_US/junos9.5/information-products/topic-collections/config-guide-mpls-applications/mpls-configuring-automatic-bandwidth-allocation-for-lsps.html
>
> Note: To prevent unnecessary resignaling of LSPs, it is best to
> configure an MPLS automatic bandwidth statistics interval of no more
> than one third the corresponding LSP adjustment interval. For example,
> if you configure a value of 30 seconds for the interval statement at the
> [edit protocols mpls statistics] hierarchy level, you should configure a
> value of no more than 90 seconds for the adjust-interval statement at
> the [edit protocols mpls label-switched-path label-switched-path-name
> auto-bandwidth] hierarchy level.
>
> I could never figure this one out, and personally I always thought it
> was some kind of documentation error. What possible reason could there
> be for not having an adjust-interval of more than 3x the statistics
> value? I'm running 900 sec adjust-intervals with 300 sec overflow
> detection (the lowest value you can configure) to try and reduce RSVP
> resignaling load on the network. Every time an LDP resignals, it tears
> down the bypass LSPs as well, and at one point (prior to 9.4 I think) it
> took over 50 seconds before JUNOS would even try to start resignaling
> the bypass LSPs. There were some optimizations made to make it kick off
> the bypass LSP resignal within ~15 secs instead of ~50 secs, but we're
> still trying to keep it from resignaling excessively.
>

I've never seen this advice before but I've certainly seen networks
operate fine with adjust-interval much greater than 3x statistics
interval.

> I'll gladly accept any clue anyone can offer on this one. :)
>
> --
> Richard A Steenbergen <ras at e-gerbil.net>       http://www.e-gerbil.net/ras
> GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC)
> _______________________________________________
> juniper-nsp mailing list juniper-nsp at puck.nether.net
> https://puck.nether.net/mailman/listinfo/juniper-nsp
>