[j-nsp] QoS when there is no congestion

Mon Nov 14 18:19:53 EST 2016

> My opinion on QoS for networks with low bandwidth is to always implement
> it. It's really not that difficult and you never know when microbursts
> could be affecting things. Believe me, even if your upstream link is a
> 1Gb/s circuit, and your outbound traffic is less than 10Mb/s, you can
> still have drops due to microbursts.
> 
> Voice and video, depending on your use of them, are normally very
> important and very sensitive to drops/delays. It will never cost you
> anything (besides time) to learn to implement some simple QoS policies,
> however, if you have customers who complain about bad voice/video quality,
> it will cost you your reputation and possibly revenue when said customers
> cancel their contracts because of a service you cannot reliably provide.
> 
> -evt

As a convert from the "bandwidth is always the answer" camp, I'd like to echo these sentiments. And apologize for the incoming wall of text.

Our network is not 'congested' - at least not in the sense that you'd pick up with a Cacti bandwidth graph. A better way to think of things is that we have many points of contention.

For us, it comes down to a matter of allocation of buffer space and I guess what you could call the "funnelling" of packets. If you have a device that only has two interfaces at the same speed, there's nothing to think about. The challenge is when there are more interfaces: Consider a simple 3-interface situation, where traffic from two (interfaces A and B) is destined out one (Z). All are 1 Gbps. Say A and B hit the router at 100 Mbps. It's easy to think that 200 Mbps should fit into 1 Gbps, but this isn't completely accurate. The packets are not nicely interleaved so that they "fit together" - in reality some of them are occupying the same point in time as each other. The overall bitrate measured over an entire second is comparatively low but they're arriving at 1 Gbps speeds. If you want a car analogy, think of a multi-lane freeway. Two cars travelling in different lanes going in the same direction, going the same speed. They are the only two cars for a quarter mile. Just because the highway is rated for 80 cars in that space at that speed doesn't mean there won't be a crash if one suddenly changes lanes.

Through buffering, the router is able to take this data and cram it into interface Z, but it needs a large enough packet queue to deal with the packets that arrive at the same time.

This is called a "microburst", and WILL cause packet delay and reordering if the buffer isn't large enough. Anyone operating an IP SAN should be familiar with this concept. This is a big issue issue with switches used for iSCSI, such as the Cisco 3750s we started out with (despite common notions, QoS actually has to be enabled as the 'default'/'disabled' buffers are insufficient to deal with microbursts).

If you want a really blatant example of this in action, you need to look no further than Cisco's 2960, which has default buffer allocations so small that it experiences problems sending data out of a 100 Mbps port if it arrives on a 1 Gbps port. (CSCte64832 http://www.cisco.com/c/en/us/td/docs/switches/lan/catalyst2960/software/release/12-2_53_se/release/notes/OL22233.html)

Back in our early days of MPLS, our network was almost entirely Cisco 6500s. We had a number of these boxes as P/PE that just used the SFP ports built in to the supervisors for optics. Most of our small sites were fine, but as Internet traffic levels grew, we noticed TV problems (~400 Mbps at the time). The issues appeared similar to packet loss and were loosely coupled to peak Internet usage - it was very easy to see this on the graphs in the evening, but issues were observed during the day, as well. We did not understand why a gigabit link that only had 600-800 Mbps running over it would be having these issues. We eventually figured out the problem and moved everything onto interfaces with real buffers and the problems disappeared. (we had not yet figured out DSCP to MPLS EXP mapping.)

This highlighted for us that we cannot know or control what will happen with data to or from our customers on the Internet side. Using the SUP720s' onboard ports was a bad decision for a variety of reasons, but what would happen if a real DoS hit our equipment? Since our network handles voice and video as well, it is extremely important for us to protect our sensitive services. The only way you can guarantee that is by allocating sufficient buffer space for the things you deem important. I think a couple of our MXes are still running default queues but all marking is enforced at ingress.

Of course since we implement QoS throughout the network, everything scales out very well down to the access equipment. It is a lot of work but it would be nearly impossible to reliably operate a converged network without the ability to tell traffic apart and prioritize important stuff.

My two cents.

Cheers
Ross