[c-nsp] Dynamic output buffer allocation on Cisco 4948

Fwissue fwissue at gmail.com
Fri Sep 27 00:44:28 EDT 2013


Would be possible to create another vlan on switch 1 and move a host to new vlan, will that help you to isolate where tge latency is?

Thanks

~mike

On Sep 26, 2013, at 9:06 PM, John Neiberger <jneiberger at gmail.com> wrote:

> We played around with HSRP to end up with a couple of different topologies to help eliminate potential issues. We were still seeing this issue when it was as simple as this:
> 
> [host A] ------ [switch 1] ----- [7600] ---- [switch 2] ---- [host B]
> 
> There is another 7600 that both switches are connected to, as well, and we toyed with redundancy to shift traffic around to different links and such, but nothing made the slightest bit of difference. We ultimately added another 1-gig link to the switch uplinks and made them a port channel, which seems to have gotten their latency down to something manageable. The next step would be to tweak the QoS to treat this as EF, as you suggested. I guess we'll see if things stay good or if they run into problems as they add more load in the future. 
> 
> This app isn't the only thing on these switches, but it accounts for the bulk of the load and it's the only thing we know of that was having problems. It's a pretty odd situation, and I haven't done a good job of explaining how everything is connected. I suck at text diagrams.  :)  But even Cisco HTTS was pretty stumped. They looked at it for quite a while and weren't able to nail down the cause. I sure it was the buffer issue, though. 
> 
> Thanks,
> John
> 
> 
> On Thu, Sep 26, 2013 at 9:55 PM, Blake Dunlap <ikiris at gmail.com> wrote:
>> Its hard to make any inferences on your vodoo 1 way round trip latency without more detail like diagrams, so i'll take a step back and ask is this overly delay sensitive app the main load on the switch, or just a rounding error as far as total traffic?
>> 
>> If it is the first option, honestly I don't really know what you can do besides upgrading your uplinks with the next step in speed, using more active channels/paths, lowering your oversubscription ratio with more hardware, or just giving up and choosing between delaying the microbursts or dropping them. If it is the second, then have you tried setting up LLQ and treating your app as EF?
>> 
>> -Blake
>> 
>> 
>> On Thu, Sep 26, 2013 at 10:34 PM, John Neiberger <jneiberger at gmail.com> wrote:
>>> Host to host on the same VLAN was always far less than 1ms RTT. I never once saw it go over that. It was usually far less. We only saw the problem when going from a host in VLAN A to a host in VLAN B, never the other way around. I thought this was a problem on the host in VLAN B, but any other server in the same VLAN could ping it with no latency problems at all. 
>>> 
>>> 
>>> On Thu, Sep 26, 2013 at 9:12 PM, Fwissue <fwissue at gmail.com> wrote:
>>>> I would try host to host on the same vlan, then consider flow-control impact
>>>> 
>>>> Thanks
>>>> 
>>>> ~mike
>>>> 
>>>> On Sep 26, 2013, at 8:18 AM, John Neiberger <jneiberger at gmail.com> wrote:
>>>> 
>>>> > It was host to host, so it was really Host A to Host B and vice versa. The
>>>> > expected RTT was less than a millisecond, which is what they often got, but
>>>> > the latency would spike regularly up to as high as 24 ms. I initially
>>>> > thought it was a problem on one of the hosts but they can ping to and from
>>>> > devices on the same vlan with no variable latency. The latency only occurs
>>>> > in one direction when going from one vlan to the other. We manipulated the
>>>> > HSRP configs to shift traffic to different routers and switches but the
>>>> > behavior didn't change. From Host A to Host B we saw variable latency, but
>>>> > never ever did we see it if the ping originated from Host B even though,
>>>> > depending on the HSRP configuration, the packets were traversing exactly
>>>> > the same path. It has me completely stumped.
>>>> >
>>>> >
>>>> > On Thu, Sep 26, 2013 at 9:04 AM, Blake Dunlap <ikiris at gmail.com> wrote:
>>>> >
>>>> >> This may seem like a stupid question, but when you were pinging, were you
>>>> >> pinging from hosts, or from the routers?
>>>> >>
>>>> >> -Blake
>>>> >>
>>>> >>
>>>> >> On Thu, Sep 26, 2013 at 9:38 AM, John Neiberger <jneiberger at gmail.com>wrote:
>>>> >>
>>>> >>> Thanks! I talked to our Cisco NCE about this and he gave me these
>>>> >>> commands:
>>>> >>>
>>>> >>> show qos  interface gigabitEthernet x/y- will show you 4 queues and also
>>>> >>> whether QoS is disabled or not
>>>> >>>
>>>> >>> sh int gi x/y counters detail -you will see packet counters in queue #1-4
>>>> >>> incrementing
>>>> >>>
>>>> >>> Sh platform hardware interface g x/y stat | in TxB
>>>> >>>
>>>> >>>
>>>> >>> I'm nearly certain that this big buffer issue is the answer to my high
>>>> >>> variable latency problem, but there is still one mystery about this that
>>>> >>> has me completely perplexed. The high variable latency was only occurring
>>>> >>> in one direction (from VLAN A to VLAN B) but not in the other (from VLAN B
>>>> >>> to VLAN A). What really makes that weird is that because of some hsrp
>>>> >>> differences, we really had a circular topology for a bit. The path was
>>>> >>> *exactly* the same no matter which direction you were pinging. The ICMP
>>>> >>> packets had to travel around the same circle through the same devices and
>>>> >>> interfaces. So if we have big buffers on congested interfaces that are
>>>> >>> introducing variable latency, why would we only see it in one direction?
>>>> >>>
>>>> >>>
>>>> >>> When VLAN A pings VLAN B, it is the initial ICMP packet that would have
>>>> >>> been delayed, while the response would come in on a different interface
>>>> >>> that wasn't congested. When VLAN B pings VLAN A, the initial ping would
>>>> >>> not
>>>> >>> hit congested interfaces but the ping reply would. The total round trip
>>>> >>> time should have been similar, but it never was. I'm completely stumped by
>>>> >>> that. I even had Cisco HTTS on this for a couple of days and they couldn't
>>>> >>> figure it out.
>>>> >>>
>>>> >>>
>>>> >>> Thanks,
>>>> >>>
>>>> >>> John
>>>> >>>
>>>> >>>
>>>> >>> On Thu, Sep 26, 2013 at 1:50 AM, Terebizh, Evgeny <eterebizh at amt.ru>
>>>> >>> wrote:
>>>> >>>
>>>> >>>> Try also
>>>> >>>> "show platform hardware interface gigabitEthernet 1/1 tx-queue".
>>>> >>>> I guess it's gonna show the actual values for queue utilisation.
>>>> >>>> Please let me know if this helps.
>>>> >>>>
>>>> >>>> /ET
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>> On 9/24/13 11:17 PM, "John Neiberger" <jneiberger at gmail.com> wrote:
>>>> >>>>
>>>> >>>>> I've been helping to troubleshoot an interesting problem with variable
>>>> >>>>> latency through a 4948. I haven't run into this before. I usually have
>>>> >>>>> seen
>>>> >>>>> really low latency through 4948s, but this particular application
>>>> >>> requires
>>>> >>>>> consistent low latency and they've been noticing that latency goes up
>>>> >>> on
>>>> >>>>> average as load goes up. It didn't seem to be a problem on their
>>>> >>> servers,
>>>> >>>>> but communication through busy interfaces seemed to dramatically
>>>> >>> increase
>>>> >>>>> the latency. They were used to <1ms latency and it was bouncing up to
>>>> >>> 20+
>>>> >>>>> ms at times. I'm starting to think this is due to the shared output
>>>> >>> buffer
>>>> >>>>> in the 4948 causing the output buffer on the uplink to dynamically get
>>>> >>>>> bigger.
>>>> >>>>>
>>>> >>>>> I've been trying to find more details on how the 4948 handles its
>>>> >>> shared
>>>> >>>>> output queue space, but I haven't been able to find anything. Do any of
>>>> >>>>> you
>>>> >>>>> know more about how this works and what commands I could use to
>>>> >>>>> troubleshoot? I can't find anything that might show how much buffer
>>>> >>> space
>>>> >>>>> a
>>>> >>>>> particular interface is using at any given time, or if it even makes
>>>> >>> sense
>>>> >>>>> to think of it that way. If I knew the size of the buffer at any given
>>>> >>>>> moment, I could calculate the expected latency and prove whether or not
>>>> >>>>> that was the problem.
>>>> >>>>>
>>>> >>>>> Thanks!
>>>> >>>>> John
>>>> >>>>> _______________________________________________
>>>> >>>>> cisco-nsp mailing list  cisco-nsp at puck.nether.net
>>>> >>>>> https://puck.nether.net/mailman/listinfo/cisco-nsp
>>>> >>>>> archive at http://puck.nether.net/pipermail/cisco-nsp/
>>>> >>> _______________________________________________
>>>> >>> cisco-nsp mailing list  cisco-nsp at puck.nether.net
>>>> >>> https://puck.nether.net/mailman/listinfo/cisco-nsp
>>>> >>> archive at http://puck.nether.net/pipermail/cisco-nsp/
>>>> > _______________________________________________
>>>> > cisco-nsp mailing list  cisco-nsp at puck.nether.net
>>>> > https://puck.nether.net/mailman/listinfo/cisco-nsp
>>>> > archive at http://puck.nether.net/pipermail/cisco-nsp/
> 


More information about the cisco-nsp mailing list