[j-nsp] Auto-bandwidth Accuracy
Olson, Martin
molson at above.net
Tue May 25 15:23:51 EDT 2010
Yeah, I found the same behavior. Sometimes the Max AvgBW would go up by 5X-7X the value it should've, which would lead to really high reservations after the next adjust-interval. I opened case 2009-0610-0697 about the issue, and after a while they traced the problem to PRs 438157 and 457767. The first code with the fix for both PRs is 9.6R2/9.5R3/9.4R4/9.3R5. They told us that if we disabled the adjust-threshold-overflow-limit in the meantime, that would alleviate the problem until we upgrade code.
-MO
-----Original Message-----
From: Danny Vernals [mailto:danny.vernals at gmail.com]
Sent: Tuesday, May 25, 2010 5:18 AM
To: Richard A Steenbergen
Cc: juniper-nsp at puck.nether.net
Subject: Re: [j-nsp] Auto-bandwidth Accuracy
On Sun, May 23, 2010 at 7:52 AM, Richard A Steenbergen <ras at e-gerbil.net> wrote:
> Recently I've been noticing some really odd auto-bandwidth behavior on
> several different routers, and I'm wondering if anybody knows if this is
> a known bug or if I'm doing something really wrong in my autobw config.
>
> Specifically, I'm seeing many cases where the rsvp reservations on an
> interface are vastly higher than the actual traffic going over it. I
> started comparing autobw measures bandwidth value vs rsvp resv bandwidth
> across my LSPs (with an op script :P), and noticed that a large number
> of LSPs that were ingress on Juniper routers were consistently reserving
> more bandwidth than they were actually passing.
>
> To troubleshoot this further, I picked one LSP at random and followed it
> through the course of an entire adjust-interval. I also watched it in
> "monitor label-switched-path", and followed the bandwidth recorded for
> it in the mpls stats file. The mpls stats file pretty consistently
> recorded a bandwidth of around 900Mbps. Some samples were up to 1G, some
> were down in the 800Mb's, but nothing was significantly outside this
> range:
>
> xxx.xxxx-xxx.xxxx-BRONZE-1 20442770 pkt 21800398308 Byte 91864 pps 97826023 Bps Util 43.47%
> xxx.xxxx-xxx.xxxx-BRONZE-1 25748678 pkt 27500224526 Byte 89930 pps 96607224 Bps Util 42.93%
> xxx.xxxx-xxx.xxxx-BRONZE-1 31309754 pkt 33516047564 Byte 95880 pps 103721086 Bps Util 46.09%
> xxx.xxxx-xxx.xxxx-BRONZE-1 36934965 pkt 39389728013 Byte 90729 pps 94736781 Bps Util 42.10%
> xxx.xxxx-xxx.xxxx-BRONZE-1 41323164 pkt 44001156442 Byte 86043 pps 90420165 Bps Util 40.18%
> xxx.xxxx-xxx.xxxx-BRONZE-1 46229207 pkt 49166295068 Byte 84586 pps 89054114 Bps Util 39.58%
> xxx.xxxx-xxx.xxxx-BRONZE-1 51764861 pkt 55023074603 Byte 92260 pps 97612992 Bps Util 43.38%
> xxx.xxxx-xxx.xxxx-BRONZE-1 57091315 pkt 60691783494 Byte 90278 pps 96079811 Bps Util 42.70%
> xxx.xxxx-xxx.xxxx-BRONZE-1 62138489 pkt 66009079194 Byte 90128 pps 94951708 Bps Util 42.20%
> xxx.xxxx-xxx.xxxx-BRONZE-1 67697838 pkt 72030553645 Byte 92655 pps 100357907 Bps Util 44.60%
> xxx.xxxx-xxx.xxxx-BRONZE-1 73083250 pkt 77870203449 Byte 89756 pps 97327496 Bps Util 43.25%
> xxx.xxxx-xxx.xxxx-BRONZE-1 78530642 pkt 83799427998 Byte 90789 pps 98820409 Bps Util 43.91%
> xxx.xxxx-xxx.xxxx-BRONZE-1 84166327 pkt 89767404007 Byte 85389 pps 90423878 Bps Util 40.18%
> xxx.xxxx-xxx.xxxx-BRONZE-1 89990750 pkt 96052103366 Byte 85653 pps 92422049 Bps Util 41.07%
> xxx.xxxx-xxx.xxxx-BRONZE-1 94808838 pkt 101299936674 Byte 87601 pps 95415151 Bps Util 42.40%
> xxx.xxxx-xxx.xxxx-BRONZE-1 100044983 pkt 106918990604 Byte 83113 pps 89191332 Bps Util 39.64%
> xxx.xxxx-xxx.xxxx-BRONZE-1 104706036 pkt 111928263183 Byte 86315 pps 92764307 Bps Util 41.22%
> xxx.xxxx-xxx.xxxx-BRONZE-1 109664547 pkt 117256403183 Byte 81287 pps 87346557 Bps Util 38.82%
> xxx.xxxx-xxx.xxxx-BRONZE-1 115001230 pkt 123065374817 Byte 84709 pps 92205898 Bps Util 40.98%
> xxx.xxxx-xxx.xxxx-BRONZE-1 120197917 pkt 128761293505 Byte 85191 pps 93375716 Bps Util 41.50%
> xxx.xxxx-xxx.xxxx-BRONZE-1 124790487 pkt 133783111501 Byte 79182 pps 86583068 Bps Util 38.48%
> xxx.xxxx-xxx.xxxx-BRONZE-1 129450091 pkt 138908431043 Byte 84720 pps 93187628 Bps Util 41.41%
> xxx.xxxx-xxx.xxxx-BRONZE-1 134048794 pkt 143940227806 Byte 82119 pps 89853513 Bps Util 39.93%
> xxx.xxxx-xxx.xxxx-BRONZE-1 138900130 pkt 149257983679 Byte 80855 pps 88629264 Bps Util 39.39%
> xxx.xxxx-xxx.xxxx-BRONZE-1 143665805 pkt 154447812210 Byte 79427 pps 86497142 Bps Util 38.44%
> xxx.xxxx-xxx.xxxx-BRONZE-1 148501587 pkt 159667032930 Byte 80596 pps 86987012 Bps Util 38.66%
> xxx.xxxx-xxx.xxxx-BRONZE-1 153971586 pkt 165650360517 Byte 78142 pps 85476108 Bps Util 37.99%
>
> Next, I watched the output of "show mpls lsp name BLAH detail", looking
> at the autobw measured amount (Max AvgBW) and the reserved bandwidth.
> I'm using a stats interval of 60 seconds, an adjust-interval of 900
> seconds, and in this instance no overflow samples occured. After the
> previous adjust-interval completes the measured bw is reset to 0, and
> then starts updating again after the first 60 sec stats interval is up.
> For around the first 700 seconds the Max AvgBW was pretty close to what
> one would expect (around 900Mbps), then it jumped to ~1.6Gbps for no
> reason that I can determine. The stats file for this LSP (above) never
> showed anything above 1.0G, and a monitor of the lsp never showed any
> sample thatever got anywhere near that high (let alone enough to make an
> entire 60 sec sample interval report that high). At the end of the 900
> seconds, te 1.6G value is what was signaled to RSVP, and the cycle
> repeated itself. I watched it for several more cycles, and saw the same
> behavior happening over and over again, with measured values of 1.8G
> plus, while the stats file continued to show an average of around
> 800-900Mbps and no sample that ever went above 1G.
>
I've seen something similar on 9.5R2 although I didn't pay it much
heed at the time as I was investigating other issues. My guess (and
it is definitely a guess) is that there is an internal data structure
which stores the LSP usage which is then divided by the sampling
interval and written to the statistics file after the sampling
interval. If something (rpd scheduling issue, CPU at 100%?) prevents
this value from being written to the statistics file after the
sampling interval it gets a default value of 0. The data structure
keeps the stats from the previous sampling interval and is added to.
When the next sampling interval expires this value is then divided by
1 x sampling interval leading to an average bps value roughly double
what it should be.
I'll keep an eye out and report back if I see this behaviour again.
> This particular router is running 9.4R3, but I've seen similar behavior
> on some other 9.5R4 routers as well. This really seems like some kind of
> bug, but honestly I'd sooner slit my wrists with a rusty PIC than try to
> explain the above to JTAC (besides, they would probably just ask me for
> 50 irrelevent log files then do nothing for the next 6 months like all
> of my other cases :P). I'm wondering if this is some kind of known
> issue, or if there is some reason why this config wouldn't work well.
>
> The stats interval of 60 seconds is because I snmp poll and graph the
> mplsLspOctets every 60 seconds, and snmp is updated based on the stats
> interval. Any value other than 60 secs makes the graphs wildly jitter.
> But in the JUNOS documentation for auto-bandwidth, there is the
> following warning:
>
> http://www.juniper.net/techpubs/en_US/junos9.5/information-products/topic-collections/config-guide-mpls-applications/mpls-configuring-automatic-bandwidth-allocation-for-lsps.html
>
> Note: To prevent unnecessary resignaling of LSPs, it is best to
> configure an MPLS automatic bandwidth statistics interval of no more
> than one third the corresponding LSP adjustment interval. For example,
> if you configure a value of 30 seconds for the interval statement at the
> [edit protocols mpls statistics] hierarchy level, you should configure a
> value of no more than 90 seconds for the adjust-interval statement at
> the [edit protocols mpls label-switched-path label-switched-path-name
> auto-bandwidth] hierarchy level.
>
> I could never figure this one out, and personally I always thought it
> was some kind of documentation error. What possible reason could there
> be for not having an adjust-interval of more than 3x the statistics
> value? I'm running 900 sec adjust-intervals with 300 sec overflow
> detection (the lowest value you can configure) to try and reduce RSVP
> resignaling load on the network. Every time an LDP resignals, it tears
> down the bypass LSPs as well, and at one point (prior to 9.4 I think) it
> took over 50 seconds before JUNOS would even try to start resignaling
> the bypass LSPs. There were some optimizations made to make it kick off
> the bypass LSP resignal within ~15 secs instead of ~50 secs, but we're
> still trying to keep it from resignaling excessively.
>
I've never seen this advice before but I've certainly seen networks
operate fine with adjust-interval much greater than 3x statistics
interval.
> I'll gladly accept any clue anyone can offer on this one. :)
>
> --
> Richard A Steenbergen <ras at e-gerbil.net> http://www.e-gerbil.net/ras
> GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC)
> _______________________________________________
> juniper-nsp mailing list juniper-nsp at puck.nether.net
> https://puck.nether.net/mailman/listinfo/juniper-nsp
>
More information about the juniper-nsp
mailing list