[j-nsp] Auto-bandwidth Accuracy
Richard A Steenbergen
ras at e-gerbil.net
Sun May 23 02:52:19 EDT 2010
Recently I've been noticing some really odd auto-bandwidth behavior on
several different routers, and I'm wondering if anybody knows if this is
a known bug or if I'm doing something really wrong in my autobw config.
Specifically, I'm seeing many cases where the rsvp reservations on an
interface are vastly higher than the actual traffic going over it. I
started comparing autobw measures bandwidth value vs rsvp resv bandwidth
across my LSPs (with an op script :P), and noticed that a large number
of LSPs that were ingress on Juniper routers were consistently reserving
more bandwidth than they were actually passing.
To troubleshoot this further, I picked one LSP at random and followed it
through the course of an entire adjust-interval. I also watched it in
"monitor label-switched-path", and followed the bandwidth recorded for
it in the mpls stats file. The mpls stats file pretty consistently
recorded a bandwidth of around 900Mbps. Some samples were up to 1G, some
were down in the 800Mb's, but nothing was significantly outside this
range:
xxx.xxxx-xxx.xxxx-BRONZE-1 20442770 pkt 21800398308 Byte 91864 pps 97826023 Bps Util 43.47%
xxx.xxxx-xxx.xxxx-BRONZE-1 25748678 pkt 27500224526 Byte 89930 pps 96607224 Bps Util 42.93%
xxx.xxxx-xxx.xxxx-BRONZE-1 31309754 pkt 33516047564 Byte 95880 pps 103721086 Bps Util 46.09%
xxx.xxxx-xxx.xxxx-BRONZE-1 36934965 pkt 39389728013 Byte 90729 pps 94736781 Bps Util 42.10%
xxx.xxxx-xxx.xxxx-BRONZE-1 41323164 pkt 44001156442 Byte 86043 pps 90420165 Bps Util 40.18%
xxx.xxxx-xxx.xxxx-BRONZE-1 46229207 pkt 49166295068 Byte 84586 pps 89054114 Bps Util 39.58%
xxx.xxxx-xxx.xxxx-BRONZE-1 51764861 pkt 55023074603 Byte 92260 pps 97612992 Bps Util 43.38%
xxx.xxxx-xxx.xxxx-BRONZE-1 57091315 pkt 60691783494 Byte 90278 pps 96079811 Bps Util 42.70%
xxx.xxxx-xxx.xxxx-BRONZE-1 62138489 pkt 66009079194 Byte 90128 pps 94951708 Bps Util 42.20%
xxx.xxxx-xxx.xxxx-BRONZE-1 67697838 pkt 72030553645 Byte 92655 pps 100357907 Bps Util 44.60%
xxx.xxxx-xxx.xxxx-BRONZE-1 73083250 pkt 77870203449 Byte 89756 pps 97327496 Bps Util 43.25%
xxx.xxxx-xxx.xxxx-BRONZE-1 78530642 pkt 83799427998 Byte 90789 pps 98820409 Bps Util 43.91%
xxx.xxxx-xxx.xxxx-BRONZE-1 84166327 pkt 89767404007 Byte 85389 pps 90423878 Bps Util 40.18%
xxx.xxxx-xxx.xxxx-BRONZE-1 89990750 pkt 96052103366 Byte 85653 pps 92422049 Bps Util 41.07%
xxx.xxxx-xxx.xxxx-BRONZE-1 94808838 pkt 101299936674 Byte 87601 pps 95415151 Bps Util 42.40%
xxx.xxxx-xxx.xxxx-BRONZE-1 100044983 pkt 106918990604 Byte 83113 pps 89191332 Bps Util 39.64%
xxx.xxxx-xxx.xxxx-BRONZE-1 104706036 pkt 111928263183 Byte 86315 pps 92764307 Bps Util 41.22%
xxx.xxxx-xxx.xxxx-BRONZE-1 109664547 pkt 117256403183 Byte 81287 pps 87346557 Bps Util 38.82%
xxx.xxxx-xxx.xxxx-BRONZE-1 115001230 pkt 123065374817 Byte 84709 pps 92205898 Bps Util 40.98%
xxx.xxxx-xxx.xxxx-BRONZE-1 120197917 pkt 128761293505 Byte 85191 pps 93375716 Bps Util 41.50%
xxx.xxxx-xxx.xxxx-BRONZE-1 124790487 pkt 133783111501 Byte 79182 pps 86583068 Bps Util 38.48%
xxx.xxxx-xxx.xxxx-BRONZE-1 129450091 pkt 138908431043 Byte 84720 pps 93187628 Bps Util 41.41%
xxx.xxxx-xxx.xxxx-BRONZE-1 134048794 pkt 143940227806 Byte 82119 pps 89853513 Bps Util 39.93%
xxx.xxxx-xxx.xxxx-BRONZE-1 138900130 pkt 149257983679 Byte 80855 pps 88629264 Bps Util 39.39%
xxx.xxxx-xxx.xxxx-BRONZE-1 143665805 pkt 154447812210 Byte 79427 pps 86497142 Bps Util 38.44%
xxx.xxxx-xxx.xxxx-BRONZE-1 148501587 pkt 159667032930 Byte 80596 pps 86987012 Bps Util 38.66%
xxx.xxxx-xxx.xxxx-BRONZE-1 153971586 pkt 165650360517 Byte 78142 pps 85476108 Bps Util 37.99%
Next, I watched the output of "show mpls lsp name BLAH detail", looking
at the autobw measured amount (Max AvgBW) and the reserved bandwidth.
I'm using a stats interval of 60 seconds, an adjust-interval of 900
seconds, and in this instance no overflow samples occured. After the
previous adjust-interval completes the measured bw is reset to 0, and
then starts updating again after the first 60 sec stats interval is up.
For around the first 700 seconds the Max AvgBW was pretty close to what
one would expect (around 900Mbps), then it jumped to ~1.6Gbps for no
reason that I can determine. The stats file for this LSP (above) never
showed anything above 1.0G, and a monitor of the lsp never showed any
sample thatever got anywhere near that high (let alone enough to make an
entire 60 sec sample interval report that high). At the end of the 900
seconds, te 1.6G value is what was signaled to RSVP, and the cycle
repeated itself. I watched it for several more cycles, and saw the same
behavior happening over and over again, with measured values of 1.8G
plus, while the stats file continued to show an average of around
800-900Mbps and no sample that ever went above 1G.
This particular router is running 9.4R3, but I've seen similar behavior
on some other 9.5R4 routers as well. This really seems like some kind of
bug, but honestly I'd sooner slit my wrists with a rusty PIC than try to
explain the above to JTAC (besides, they would probably just ask me for
50 irrelevent log files then do nothing for the next 6 months like all
of my other cases :P). I'm wondering if this is some kind of known
issue, or if there is some reason why this config wouldn't work well.
The stats interval of 60 seconds is because I snmp poll and graph the
mplsLspOctets every 60 seconds, and snmp is updated based on the stats
interval. Any value other than 60 secs makes the graphs wildly jitter.
But in the JUNOS documentation for auto-bandwidth, there is the
following warning:
http://www.juniper.net/techpubs/en_US/junos9.5/information-products/topic-collections/config-guide-mpls-applications/mpls-configuring-automatic-bandwidth-allocation-for-lsps.html
Note: To prevent unnecessary resignaling of LSPs, it is best to
configure an MPLS automatic bandwidth statistics interval of no more
than one third the corresponding LSP adjustment interval. For example,
if you configure a value of 30 seconds for the interval statement at the
[edit protocols mpls statistics] hierarchy level, you should configure a
value of no more than 90 seconds for the adjust-interval statement at
the [edit protocols mpls label-switched-path label-switched-path-name
auto-bandwidth] hierarchy level.
I could never figure this one out, and personally I always thought it
was some kind of documentation error. What possible reason could there
be for not having an adjust-interval of more than 3x the statistics
value? I'm running 900 sec adjust-intervals with 300 sec overflow
detection (the lowest value you can configure) to try and reduce RSVP
resignaling load on the network. Every time an LDP resignals, it tears
down the bypass LSPs as well, and at one point (prior to 9.4 I think) it
took over 50 seconds before JUNOS would even try to start resignaling
the bypass LSPs. There were some optimizations made to make it kick off
the bypass LSP resignal within ~15 secs instead of ~50 secs, but we're
still trying to keep it from resignaling excessively.
I'll gladly accept any clue anyone can offer on this one. :)
--
Richard A Steenbergen <ras at e-gerbil.net> http://www.e-gerbil.net/ras
GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC)
More information about the juniper-nsp
mailing list