[j-nsp] Auto-bandwidth Accuracy
Phil Bedard
philxor at gmail.com
Tue May 25 16:04:50 EDT 2010
Do you have the detail on PR 457767, it doesn't seem to show up in the system. I tried to duplicate 438157 and could never successfully do so.
Thanks,
Phil
On May 25, 2010, at 3:23 PM, Olson, Martin wrote:
> Yeah, I found the same behavior. Sometimes the Max AvgBW would go up by 5X-7X the value it should've, which would lead to really high reservations after the next adjust-interval. I opened case 2009-0610-0697 about the issue, and after a while they traced the problem to PRs 438157 and 457767. The first code with the fix for both PRs is 9.6R2/9.5R3/9.4R4/9.3R5. They told us that if we disabled the adjust-threshold-overflow-limit in the meantime, that would alleviate the problem until we upgrade code.
>
> -MO
>
>
> -----Original Message-----
> From: Danny Vernals [mailto:danny.vernals at gmail.com]
> Sent: Tuesday, May 25, 2010 5:18 AM
> To: Richard A Steenbergen
> Cc: juniper-nsp at puck.nether.net
> Subject: Re: [j-nsp] Auto-bandwidth Accuracy
>
> On Sun, May 23, 2010 at 7:52 AM, Richard A Steenbergen <ras at e-gerbil.net> wrote:
>> Recently I've been noticing some really odd auto-bandwidth behavior on
>> several different routers, and I'm wondering if anybody knows if this is
>> a known bug or if I'm doing something really wrong in my autobw config.
>>
>> Specifically, I'm seeing many cases where the rsvp reservations on an
>> interface are vastly higher than the actual traffic going over it. I
>> started comparing autobw measures bandwidth value vs rsvp resv bandwidth
>> across my LSPs (with an op script :P), and noticed that a large number
>> of LSPs that were ingress on Juniper routers were consistently reserving
>> more bandwidth than they were actually passing.
>>
>> To troubleshoot this further, I picked one LSP at random and followed it
>> through the course of an entire adjust-interval. I also watched it in
>> "monitor label-switched-path", and followed the bandwidth recorded for
>> it in the mpls stats file. The mpls stats file pretty consistently
>> recorded a bandwidth of around 900Mbps. Some samples were up to 1G, some
>> were down in the 800Mb's, but nothing was significantly outside this
>> range:
>>
>> xxx.xxxx-xxx.xxxx-BRONZE-1 20442770 pkt 21800398308 Byte 91864 pps 97826023 Bps Util 43.47%
>> xxx.xxxx-xxx.xxxx-BRONZE-1 25748678 pkt 27500224526 Byte 89930 pps 96607224 Bps Util 42.93%
>> xxx.xxxx-xxx.xxxx-BRONZE-1 31309754 pkt 33516047564 Byte 95880 pps 103721086 Bps Util 46.09%
>> xxx.xxxx-xxx.xxxx-BRONZE-1 36934965 pkt 39389728013 Byte 90729 pps 94736781 Bps Util 42.10%
>> xxx.xxxx-xxx.xxxx-BRONZE-1 41323164 pkt 44001156442 Byte 86043 pps 90420165 Bps Util 40.18%
>> xxx.xxxx-xxx.xxxx-BRONZE-1 46229207 pkt 49166295068 Byte 84586 pps 89054114 Bps Util 39.58%
>> xxx.xxxx-xxx.xxxx-BRONZE-1 51764861 pkt 55023074603 Byte 92260 pps 97612992 Bps Util 43.38%
>> xxx.xxxx-xxx.xxxx-BRONZE-1 57091315 pkt 60691783494 Byte 90278 pps 96079811 Bps Util 42.70%
>> xxx.xxxx-xxx.xxxx-BRONZE-1 62138489 pkt 66009079194 Byte 90128 pps 94951708 Bps Util 42.20%
>> xxx.xxxx-xxx.xxxx-BRONZE-1 67697838 pkt 72030553645 Byte 92655 pps 100357907 Bps Util 44.60%
>> xxx.xxxx-xxx.xxxx-BRONZE-1 73083250 pkt 77870203449 Byte 89756 pps 97327496 Bps Util 43.25%
>> xxx.xxxx-xxx.xxxx-BRONZE-1 78530642 pkt 83799427998 Byte 90789 pps 98820409 Bps Util 43.91%
>> xxx.xxxx-xxx.xxxx-BRONZE-1 84166327 pkt 89767404007 Byte 85389 pps 90423878 Bps Util 40.18%
>> xxx.xxxx-xxx.xxxx-BRONZE-1 89990750 pkt 96052103366 Byte 85653 pps 92422049 Bps Util 41.07%
>> xxx.xxxx-xxx.xxxx-BRONZE-1 94808838 pkt 101299936674 Byte 87601 pps 95415151 Bps Util 42.40%
>> xxx.xxxx-xxx.xxxx-BRONZE-1 100044983 pkt 106918990604 Byte 83113 pps 89191332 Bps Util 39.64%
>> xxx.xxxx-xxx.xxxx-BRONZE-1 104706036 pkt 111928263183 Byte 86315 pps 92764307 Bps Util 41.22%
>> xxx.xxxx-xxx.xxxx-BRONZE-1 109664547 pkt 117256403183 Byte 81287 pps 87346557 Bps Util 38.82%
>> xxx.xxxx-xxx.xxxx-BRONZE-1 115001230 pkt 123065374817 Byte 84709 pps 92205898 Bps Util 40.98%
>> xxx.xxxx-xxx.xxxx-BRONZE-1 120197917 pkt 128761293505 Byte 85191 pps 93375716 Bps Util 41.50%
>> xxx.xxxx-xxx.xxxx-BRONZE-1 124790487 pkt 133783111501 Byte 79182 pps 86583068 Bps Util 38.48%
>> xxx.xxxx-xxx.xxxx-BRONZE-1 129450091 pkt 138908431043 Byte 84720 pps 93187628 Bps Util 41.41%
>> xxx.xxxx-xxx.xxxx-BRONZE-1 134048794 pkt 143940227806 Byte 82119 pps 89853513 Bps Util 39.93%
>> xxx.xxxx-xxx.xxxx-BRONZE-1 138900130 pkt 149257983679 Byte 80855 pps 88629264 Bps Util 39.39%
>> xxx.xxxx-xxx.xxxx-BRONZE-1 143665805 pkt 154447812210 Byte 79427 pps 86497142 Bps Util 38.44%
>> xxx.xxxx-xxx.xxxx-BRONZE-1 148501587 pkt 159667032930 Byte 80596 pps 86987012 Bps Util 38.66%
>> xxx.xxxx-xxx.xxxx-BRONZE-1 153971586 pkt 165650360517 Byte 78142 pps 85476108 Bps Util 37.99%
>>
>> Next, I watched the output of "show mpls lsp name BLAH detail", looking
>> at the autobw measured amount (Max AvgBW) and the reserved bandwidth.
>> I'm using a stats interval of 60 seconds, an adjust-interval of 900
>> seconds, and in this instance no overflow samples occured. After the
>> previous adjust-interval completes the measured bw is reset to 0, and
>> then starts updating again after the first 60 sec stats interval is up.
>> For around the first 700 seconds the Max AvgBW was pretty close to what
>> one would expect (around 900Mbps), then it jumped to ~1.6Gbps for no
>> reason that I can determine. The stats file for this LSP (above) never
>> showed anything above 1.0G, and a monitor of the lsp never showed any
>> sample thatever got anywhere near that high (let alone enough to make an
>> entire 60 sec sample interval report that high). At the end of the 900
>> seconds, te 1.6G value is what was signaled to RSVP, and the cycle
>> repeated itself. I watched it for several more cycles, and saw the same
>> behavior happening over and over again, with measured values of 1.8G
>> plus, while the stats file continued to show an average of around
>> 800-900Mbps and no sample that ever went above 1G.
>>
>
> I've seen something similar on 9.5R2 although I didn't pay it much
> heed at the time as I was investigating other issues. My guess (and
> it is definitely a guess) is that there is an internal data structure
> which stores the LSP usage which is then divided by the sampling
> interval and written to the statistics file after the sampling
> interval. If something (rpd scheduling issue, CPU at 100%?) prevents
> this value from being written to the statistics file after the
> sampling interval it gets a default value of 0. The data structure
> keeps the stats from the previous sampling interval and is added to.
> When the next sampling interval expires this value is then divided by
> 1 x sampling interval leading to an average bps value roughly double
> what it should be.
>
> I'll keep an eye out and report back if I see this behaviour again.
>
>
>> This particular router is running 9.4R3, but I've seen similar behavior
>> on some other 9.5R4 routers as well. This really seems like some kind of
>> bug, but honestly I'd sooner slit my wrists with a rusty PIC than try to
>> explain the above to JTAC (besides, they would probably just ask me for
>> 50 irrelevent log files then do nothing for the next 6 months like all
>> of my other cases :P). I'm wondering if this is some kind of known
>> issue, or if there is some reason why this config wouldn't work well.
>>
>> The stats interval of 60 seconds is because I snmp poll and graph the
>> mplsLspOctets every 60 seconds, and snmp is updated based on the stats
>> interval. Any value other than 60 secs makes the graphs wildly jitter.
>> But in the JUNOS documentation for auto-bandwidth, there is the
>> following warning:
>>
>> http://www.juniper.net/techpubs/en_US/junos9.5/information-products/topic-collections/config-guide-mpls-applications/mpls-configuring-automatic-bandwidth-allocation-for-lsps.html
>>
>> Note: To prevent unnecessary resignaling of LSPs, it is best to
>> configure an MPLS automatic bandwidth statistics interval of no more
>> than one third the corresponding LSP adjustment interval. For example,
>> if you configure a value of 30 seconds for the interval statement at the
>> [edit protocols mpls statistics] hierarchy level, you should configure a
>> value of no more than 90 seconds for the adjust-interval statement at
>> the [edit protocols mpls label-switched-path label-switched-path-name
>> auto-bandwidth] hierarchy level.
>>
>> I could never figure this one out, and personally I always thought it
>> was some kind of documentation error. What possible reason could there
>> be for not having an adjust-interval of more than 3x the statistics
>> value? I'm running 900 sec adjust-intervals with 300 sec overflow
>> detection (the lowest value you can configure) to try and reduce RSVP
>> resignaling load on the network. Every time an LDP resignals, it tears
>> down the bypass LSPs as well, and at one point (prior to 9.4 I think) it
>> took over 50 seconds before JUNOS would even try to start resignaling
>> the bypass LSPs. There were some optimizations made to make it kick off
>> the bypass LSP resignal within ~15 secs instead of ~50 secs, but we're
>> still trying to keep it from resignaling excessively.
>>
>
> I've never seen this advice before but I've certainly seen networks
> operate fine with adjust-interval much greater than 3x statistics
> interval.
>
>> I'll gladly accept any clue anyone can offer on this one. :)
>>
>> --
>> Richard A Steenbergen <ras at e-gerbil.net> http://www.e-gerbil.net/ras
>> GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC)
>> _______________________________________________
>> juniper-nsp mailing list juniper-nsp at puck.nether.net
>> https://puck.nether.net/mailman/listinfo/juniper-nsp
>>
>
>
>
> _______________________________________________
> juniper-nsp mailing list juniper-nsp at puck.nether.net
> https://puck.nether.net/mailman/listinfo/juniper-nsp
More information about the juniper-nsp
mailing list