[c-nsp] Dynamic output buffer allocation on Cisco 4948
John Neiberger
jneiberger at gmail.com
Fri Sep 27 00:06:03 EDT 2013
We played around with HSRP to end up with a couple of different topologies
to help eliminate potential issues. We were still seeing this issue when it
was as simple as this:
[host A] ------ [switch 1] ----- [7600] ---- [switch 2] ---- [host B]
There is another 7600 that both switches are connected to, as well, and we
toyed with redundancy to shift traffic around to different links and such,
but nothing made the slightest bit of difference. We ultimately added
another 1-gig link to the switch uplinks and made them a port channel,
which seems to have gotten their latency down to something manageable. The
next step would be to tweak the QoS to treat this as EF, as you suggested.
I guess we'll see if things stay good or if they run into problems as they
add more load in the future.
This app isn't the only thing on these switches, but it accounts for the
bulk of the load and it's the only thing we know of that was having
problems. It's a pretty odd situation, and I haven't done a good job of
explaining how everything is connected. I suck at text diagrams. :) But
even Cisco HTTS was pretty stumped. They looked at it for quite a while and
weren't able to nail down the cause. I sure it was the buffer issue,
though.
Thanks,
John
On Thu, Sep 26, 2013 at 9:55 PM, Blake Dunlap <ikiris at gmail.com> wrote:
> Its hard to make any inferences on your vodoo 1 way round trip latency
> without more detail like diagrams, so i'll take a step back and ask is this
> overly delay sensitive app the main load on the switch, or just a rounding
> error as far as total traffic?
>
> If it is the first option, honestly I don't really know what you can do
> besides upgrading your uplinks with the next step in speed, using more
> active channels/paths, lowering your oversubscription ratio with more
> hardware, or just giving up and choosing between delaying the microbursts
> or dropping them. If it is the second, then have you tried setting up LLQ
> and treating your app as EF?
>
> -Blake
>
>
> On Thu, Sep 26, 2013 at 10:34 PM, John Neiberger <jneiberger at gmail.com>wrote:
>
>> Host to host on the same VLAN was always far less than 1ms RTT. I never
>> once saw it go over that. It was usually far less. We only saw the problem
>> when going from a host in VLAN A to a host in VLAN B, never the other way
>> around. I thought this was a problem on the host in VLAN B, but any other
>> server in the same VLAN could ping it with no latency problems at all.
>>
>>
>> On Thu, Sep 26, 2013 at 9:12 PM, Fwissue <fwissue at gmail.com> wrote:
>>
>>> I would try host to host on the same vlan, then consider flow-control
>>> impact
>>>
>>> Thanks
>>>
>>> ~mike
>>>
>>> On Sep 26, 2013, at 8:18 AM, John Neiberger <jneiberger at gmail.com>
>>> wrote:
>>>
>>> > It was host to host, so it was really Host A to Host B and vice versa.
>>> The
>>> > expected RTT was less than a millisecond, which is what they often
>>> got, but
>>> > the latency would spike regularly up to as high as 24 ms. I initially
>>> > thought it was a problem on one of the hosts but they can ping to and
>>> from
>>> > devices on the same vlan with no variable latency. The latency only
>>> occurs
>>> > in one direction when going from one vlan to the other. We manipulated
>>> the
>>> > HSRP configs to shift traffic to different routers and switches but the
>>> > behavior didn't change. From Host A to Host B we saw variable latency,
>>> but
>>> > never ever did we see it if the ping originated from Host B even
>>> though,
>>> > depending on the HSRP configuration, the packets were traversing
>>> exactly
>>> > the same path. It has me completely stumped.
>>> >
>>> >
>>> > On Thu, Sep 26, 2013 at 9:04 AM, Blake Dunlap <ikiris at gmail.com>
>>> wrote:
>>> >
>>> >> This may seem like a stupid question, but when you were pinging, were
>>> you
>>> >> pinging from hosts, or from the routers?
>>> >>
>>> >> -Blake
>>> >>
>>> >>
>>> >> On Thu, Sep 26, 2013 at 9:38 AM, John Neiberger <jneiberger at gmail.com
>>> >wrote:
>>> >>
>>> >>> Thanks! I talked to our Cisco NCE about this and he gave me these
>>> >>> commands:
>>> >>>
>>> >>> show qos interface gigabitEthernet x/y- will show you 4 queues and
>>> also
>>> >>> whether QoS is disabled or not
>>> >>>
>>> >>> sh int gi x/y counters detail -you will see packet counters in queue
>>> #1-4
>>> >>> incrementing
>>> >>>
>>> >>> Sh platform hardware interface g x/y stat | in TxB
>>> >>>
>>> >>>
>>> >>> I'm nearly certain that this big buffer issue is the answer to my
>>> high
>>> >>> variable latency problem, but there is still one mystery about this
>>> that
>>> >>> has me completely perplexed. The high variable latency was only
>>> occurring
>>> >>> in one direction (from VLAN A to VLAN B) but not in the other (from
>>> VLAN B
>>> >>> to VLAN A). What really makes that weird is that because of some hsrp
>>> >>> differences, we really had a circular topology for a bit. The path
>>> was
>>> >>> *exactly* the same no matter which direction you were pinging. The
>>> ICMP
>>> >>> packets had to travel around the same circle through the same
>>> devices and
>>> >>> interfaces. So if we have big buffers on congested interfaces that
>>> are
>>> >>> introducing variable latency, why would we only see it in one
>>> direction?
>>> >>>
>>> >>>
>>> >>> When VLAN A pings VLAN B, it is the initial ICMP packet that would
>>> have
>>> >>> been delayed, while the response would come in on a different
>>> interface
>>> >>> that wasn't congested. When VLAN B pings VLAN A, the initial ping
>>> would
>>> >>> not
>>> >>> hit congested interfaces but the ping reply would. The total round
>>> trip
>>> >>> time should have been similar, but it never was. I'm completely
>>> stumped by
>>> >>> that. I even had Cisco HTTS on this for a couple of days and they
>>> couldn't
>>> >>> figure it out.
>>> >>>
>>> >>>
>>> >>> Thanks,
>>> >>>
>>> >>> John
>>> >>>
>>> >>>
>>> >>> On Thu, Sep 26, 2013 at 1:50 AM, Terebizh, Evgeny <eterebizh at amt.ru>
>>> >>> wrote:
>>> >>>
>>> >>>> Try also
>>> >>>> "show platform hardware interface gigabitEthernet 1/1 tx-queue".
>>> >>>> I guess it's gonna show the actual values for queue utilisation.
>>> >>>> Please let me know if this helps.
>>> >>>>
>>> >>>> /ET
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> On 9/24/13 11:17 PM, "John Neiberger" <jneiberger at gmail.com> wrote:
>>> >>>>
>>> >>>>> I've been helping to troubleshoot an interesting problem with
>>> variable
>>> >>>>> latency through a 4948. I haven't run into this before. I usually
>>> have
>>> >>>>> seen
>>> >>>>> really low latency through 4948s, but this particular application
>>> >>> requires
>>> >>>>> consistent low latency and they've been noticing that latency goes
>>> up
>>> >>> on
>>> >>>>> average as load goes up. It didn't seem to be a problem on their
>>> >>> servers,
>>> >>>>> but communication through busy interfaces seemed to dramatically
>>> >>> increase
>>> >>>>> the latency. They were used to <1ms latency and it was bouncing up
>>> to
>>> >>> 20+
>>> >>>>> ms at times. I'm starting to think this is due to the shared output
>>> >>> buffer
>>> >>>>> in the 4948 causing the output buffer on the uplink to dynamically
>>> get
>>> >>>>> bigger.
>>> >>>>>
>>> >>>>> I've been trying to find more details on how the 4948 handles its
>>> >>> shared
>>> >>>>> output queue space, but I haven't been able to find anything. Do
>>> any of
>>> >>>>> you
>>> >>>>> know more about how this works and what commands I could use to
>>> >>>>> troubleshoot? I can't find anything that might show how much buffer
>>> >>> space
>>> >>>>> a
>>> >>>>> particular interface is using at any given time, or if it even
>>> makes
>>> >>> sense
>>> >>>>> to think of it that way. If I knew the size of the buffer at any
>>> given
>>> >>>>> moment, I could calculate the expected latency and prove whether
>>> or not
>>> >>>>> that was the problem.
>>> >>>>>
>>> >>>>> Thanks!
>>> >>>>> John
>>> >>>>> _______________________________________________
>>> >>>>> cisco-nsp mailing list cisco-nsp at puck.nether.net
>>> >>>>> https://puck.nether.net/mailman/listinfo/cisco-nsp
>>> >>>>> archive at http://puck.nether.net/pipermail/cisco-nsp/
>>> >>> _______________________________________________
>>> >>> cisco-nsp mailing list cisco-nsp at puck.nether.net
>>> >>> https://puck.nether.net/mailman/listinfo/cisco-nsp
>>> >>> archive at http://puck.nether.net/pipermail/cisco-nsp/
>>> > _______________________________________________
>>> > cisco-nsp mailing list cisco-nsp at puck.nether.net
>>> > https://puck.nether.net/mailman/listinfo/cisco-nsp
>>> > archive at http://puck.nether.net/pipermail/cisco-nsp/
>>>
>>
>>
>
More information about the cisco-nsp
mailing list