[j-nsp] Auto-bandwidth Accuracy

Sun May 23 02:52:19 EDT 2010

Recently I've been noticing some really odd auto-bandwidth behavior on
several different routers, and I'm wondering if anybody knows if this is
a known bug or if I'm doing something really wrong in my autobw config.

Specifically, I'm seeing many cases where the rsvp reservations on an 
interface are vastly higher than the actual traffic going over it. I 
started comparing autobw measures bandwidth value vs rsvp resv bandwidth 
across my LSPs (with an op script :P), and noticed that a large number 
of LSPs that were ingress on Juniper routers were consistently reserving 
more bandwidth than they were actually passing.

To troubleshoot this further, I picked one LSP at random and followed it 
through the course of an entire adjust-interval. I also watched it in 
"monitor label-switched-path", and followed the bandwidth recorded for 
it in the mpls stats file. The mpls stats file pretty consistently 
recorded a bandwidth of around 900Mbps. Some samples were up to 1G, some 
were down in the 800Mb's, but nothing was significantly outside this 
range:

xxx.xxxx-xxx.xxxx-BRONZE-1     20442770 pkt    21800398308 Byte  91864 pps 97826023 Bps Util 43.47%
xxx.xxxx-xxx.xxxx-BRONZE-1     25748678 pkt    27500224526 Byte  89930 pps 96607224 Bps Util 42.93%
xxx.xxxx-xxx.xxxx-BRONZE-1     31309754 pkt    33516047564 Byte  95880 pps 103721086 Bps Util 46.09%
xxx.xxxx-xxx.xxxx-BRONZE-1     36934965 pkt    39389728013 Byte  90729 pps 94736781 Bps Util 42.10%
xxx.xxxx-xxx.xxxx-BRONZE-1     41323164 pkt    44001156442 Byte  86043 pps 90420165 Bps Util 40.18%
xxx.xxxx-xxx.xxxx-BRONZE-1     46229207 pkt    49166295068 Byte  84586 pps 89054114 Bps Util 39.58%
xxx.xxxx-xxx.xxxx-BRONZE-1     51764861 pkt    55023074603 Byte  92260 pps 97612992 Bps Util 43.38%
xxx.xxxx-xxx.xxxx-BRONZE-1     57091315 pkt    60691783494 Byte  90278 pps 96079811 Bps Util 42.70%
xxx.xxxx-xxx.xxxx-BRONZE-1     62138489 pkt    66009079194 Byte  90128 pps 94951708 Bps Util 42.20%
xxx.xxxx-xxx.xxxx-BRONZE-1     67697838 pkt    72030553645 Byte  92655 pps 100357907 Bps Util 44.60%
xxx.xxxx-xxx.xxxx-BRONZE-1     73083250 pkt    77870203449 Byte  89756 pps 97327496 Bps Util 43.25%
xxx.xxxx-xxx.xxxx-BRONZE-1     78530642 pkt    83799427998 Byte  90789 pps 98820409 Bps Util 43.91%
xxx.xxxx-xxx.xxxx-BRONZE-1     84166327 pkt    89767404007 Byte  85389 pps 90423878 Bps Util 40.18%
xxx.xxxx-xxx.xxxx-BRONZE-1     89990750 pkt    96052103366 Byte  85653 pps 92422049 Bps Util 41.07%
xxx.xxxx-xxx.xxxx-BRONZE-1     94808838 pkt   101299936674 Byte  87601 pps 95415151 Bps Util 42.40%
xxx.xxxx-xxx.xxxx-BRONZE-1    100044983 pkt   106918990604 Byte  83113 pps 89191332 Bps Util 39.64%
xxx.xxxx-xxx.xxxx-BRONZE-1    104706036 pkt   111928263183 Byte  86315 pps 92764307 Bps Util 41.22%
xxx.xxxx-xxx.xxxx-BRONZE-1    109664547 pkt   117256403183 Byte  81287 pps 87346557 Bps Util 38.82%
xxx.xxxx-xxx.xxxx-BRONZE-1    115001230 pkt   123065374817 Byte  84709 pps 92205898 Bps Util 40.98%
xxx.xxxx-xxx.xxxx-BRONZE-1    120197917 pkt   128761293505 Byte  85191 pps 93375716 Bps Util 41.50%
xxx.xxxx-xxx.xxxx-BRONZE-1    124790487 pkt   133783111501 Byte  79182 pps 86583068 Bps Util 38.48%
xxx.xxxx-xxx.xxxx-BRONZE-1    129450091 pkt   138908431043 Byte  84720 pps 93187628 Bps Util 41.41%
xxx.xxxx-xxx.xxxx-BRONZE-1    134048794 pkt   143940227806 Byte  82119 pps 89853513 Bps Util 39.93%
xxx.xxxx-xxx.xxxx-BRONZE-1    138900130 pkt   149257983679 Byte  80855 pps 88629264 Bps Util 39.39%
xxx.xxxx-xxx.xxxx-BRONZE-1    143665805 pkt   154447812210 Byte  79427 pps 86497142 Bps Util 38.44%
xxx.xxxx-xxx.xxxx-BRONZE-1    148501587 pkt   159667032930 Byte  80596 pps 86987012 Bps Util 38.66%
xxx.xxxx-xxx.xxxx-BRONZE-1    153971586 pkt   165650360517 Byte  78142 pps 85476108 Bps Util 37.99%

Next, I watched the output of "show mpls lsp name BLAH detail", looking
at the autobw measured amount (Max AvgBW) and the reserved bandwidth. 
I'm using a stats interval of 60 seconds, an adjust-interval of 900
seconds, and in this instance no overflow samples occured. After the
previous adjust-interval completes the measured bw is reset to 0, and
then starts updating again after the first 60 sec stats interval is up. 
For around the first 700 seconds the Max AvgBW was pretty close to what
one would expect (around 900Mbps), then it jumped to ~1.6Gbps for no
reason that I can determine. The stats file for this LSP (above) never
showed anything above 1.0G, and a monitor of the lsp never showed any
sample thatever got anywhere near that high (let alone enough to make an
entire 60 sec sample interval report that high). At the end of the 900
seconds, te 1.6G value is what was signaled to RSVP, and the cycle
repeated itself. I watched it for several more cycles, and saw the same
behavior happening over and over again, with measured values of 1.8G
plus, while the stats file continued to show an average of around 
800-900Mbps and no sample that ever went above 1G.

This particular router is running 9.4R3, but I've seen similar behavior
on some other 9.5R4 routers as well. This really seems like some kind of
bug, but honestly I'd sooner slit my wrists with a rusty PIC than try to
explain the above to JTAC (besides, they would probably just ask me for
50 irrelevent log files then do nothing for the next 6 months like all
of my other cases :P). I'm wondering if this is some kind of known
issue, or if there is some reason why this config wouldn't work well.

The stats interval of 60 seconds is because I snmp poll and graph the
mplsLspOctets every 60 seconds, and snmp is updated based on the stats
interval. Any value other than 60 secs makes the graphs wildly jitter. 
But in the JUNOS documentation for auto-bandwidth, there is the 
following warning:

http://www.juniper.net/techpubs/en_US/junos9.5/information-products/topic-collections/config-guide-mpls-applications/mpls-configuring-automatic-bandwidth-allocation-for-lsps.html

Note: To prevent unnecessary resignaling of LSPs, it is best to 
configure an MPLS automatic bandwidth statistics interval of no more 
than one third the corresponding LSP adjustment interval. For example, 
if you configure a value of 30 seconds for the interval statement at the 
[edit protocols mpls statistics] hierarchy level, you should configure a 
value of no more than 90 seconds for the adjust-interval statement at 
the [edit protocols mpls label-switched-path label-switched-path-name 
auto-bandwidth] hierarchy level.

I could never figure this one out, and personally I always thought it
was some kind of documentation error. What possible reason could there
be for not having an adjust-interval of more than 3x the statistics
value? I'm running 900 sec adjust-intervals with 300 sec overflow
detection (the lowest value you can configure) to try and reduce RSVP
resignaling load on the network. Every time an LDP resignals, it tears
down the bypass LSPs as well, and at one point (prior to 9.4 I think) it
took over 50 seconds before JUNOS would even try to start resignaling
the bypass LSPs. There were some optimizations made to make it kick off 
the bypass LSP resignal within ~15 secs instead of ~50 secs, but we're 
still trying to keep it from resignaling excessively.

I'll gladly accept any clue anyone can offer on this one. :)

-- 
Richard A Steenbergen <ras at e-gerbil.net>       http://www.e-gerbil.net/ras
GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC)