[cisco-voip] WAN Delays > 80ms for CUCM cluster?

Tue Nov 6 16:57:08 EST 2018

Ryan,
1) Yes. I see "more" latent connectivity to 1 of the remote (to the
publisher) subscribers, but I can't tell if that is just when I'm looking
at it, or if it is a problem child. it's odd because this one node only
runs TFTP and doesn't even have ccm running on it... but I guess that might
not matter (for some reason I don't understand)?
2) This I have not thought of but will raise it in my next conversation
with them. They run the NTP nodes that the whole company uses, so this is a
good idea.

Wes,
I believe you are correct with the SDL not being prioritized correctly.
I've sent the relevant section of the SRND to the network team on how to
properly implement the QoS, but "it's a long doc, prove there is an issue
first." Grrrr!  I can run an xping from my CUBE on the same VLAN and it
gives me nearly identical latency results as the dbreplication command
gives me. I know that ICMP is not being QoS'd the same way that our SDL
traffic SHOULD BE (but likely isn't), so it doesn't surprise me that SDL
and ICMP latency look nearly identical.

Thanks!

On Tue, Nov 6, 2018 at 3:37 PM Ryan Huff <ryanhuff at outlook.com> wrote:

> Nick,
>
> Having network roots, I imagine you’ve tried / evaluate all of this
> already, but still worth mentioning.
>
> 1.) From the latent node, traceroute to all the other cluster nodes (since
> dbrep is more of a mesh nowadays). Is it taking the path you expect and/or
> the most optimal if more than one path exists?
>
> 2.) High NTP distance to a reference clock or can also cause really weird
> behavior in CCM, as it correlates to dbreplication.
>
> Sent from my iPhone
>
> On Nov 6, 2018, at 15:54, Wes Sisk (wsisk) <wsisk at cisco.com> wrote:
>
> Nick,
>
> The features you describe are propagated by both SDL signaling and with a
> dependence on database replication.
>
> At casual observation it sounds like database traffic between nodes may
> not prioritized and may be delayed or dropped.
>
> The 80 msec is especially important for near real-time convergence of the
> distributed processes. Concurrently database replication plays a critical
> role as every process reads its local database.
>
> Very casually:
> node1: "Hey node 2, RouteList5 changed”
> node2: “okay, let me read the changes from my local database”
> node2: I don’t see any changes….
>
> In the mean time database replication is held up in the network….
>
> -Wes
>
>
> On Nov 6, 2018, at 3:31 PM, Nick Barnett <nicksbarnett at gmail.com> wrote:
>
> We think it is happening frequently WITHOUT this command being ran. Weird
> stuff happens... like deleting a speed dial and it never goes away... or
> changing the distribution order on a route list that auotmatically reverts
> back after a few seconds... or maybe the GUI shows it never reverted back
> however it is clearly not performing the correct algo. I can duplicate the
> RTT issue by raising the packet size to 1200 and doing a repeat 100
> packets. it WILL give me times over 80ms. BUT, the SDL traffic is supposed
> to be QOS in a certain way and I'm sure that the pings I'm doing are NOT
> being classified and queued properly. It is very frustrating that I know
> what I'm talking (enough to discuss with them, but it has been 7 years
> since I was 100% router jockey) about and can't get them to pay attention
> to a probable network issue.
>
> I have an IP SLA running that shows average latency in the 20ms range. IP
> SLA is a fake red herring if you ask me... it only looks at an AVERAGE
> every 5 minutes and if there are no issues, of course it will look great.
>
> Thanks,
> Nick
>
> On Tue, Nov 6, 2018 at 12:42 PM Ryan Huff <ryanhuff at outlook.com> wrote:
>
>> You are able to correlate the out-of-band RTT to only when the
>> dbreplication stat command is ran, or are there other times the RTT is OOB
>> that isn't related to querying the replication status?
>>
>> Thanks,
>>
>> -R
>> ------------------------------
>> *From:* cisco-voip <cisco-voip-bounces at puck.nether.net> on behalf of
>> Nick Barnett <nicksbarnett at gmail.com>
>> *Sent:* Tuesday, November 6, 2018 11:57 AM
>> *To:* Cisco VoIP Group
>> *Subject:* [cisco-voip] WAN Delays > 80ms for CUCM cluster?
>>
>> We all know the max latency is 80ms, but ours occasionally goes over. I'm
>> trying to track down why but the network team cannot find an issue. We are
>> able to reproduce the issue repeatedly by running "utils dbreplication
>> runtimestate." Whether this is causing the issue (I doubt it) or that
>> command just takes long enough to run that it will eventually find a time
>> that is > 80ms (my guess Is yes)... I'm not 100% sure.
>>
>> We opened a case with TAC to find out what that command is actually
>> doing, but they won't divulge the info that our network team needs.
>>
>> My theory is that it's actually calling some shell script in redhat under
>> the CLI appliance layer. Has anyone investigated that? Do we know what this
>> command is actually doing? Specifically, i want to know where it's getting
>> those ping times... is it running a generic ping with generic datagram
>> data? Is it sending a 1497 packet of 0x0000 and then 0xFFFF? Basically, I'm
>> trying to give the network team something to go on because they are saying
>> it's not them. (Of course they could run a packet capture and tell me
>> (mostly) what it's doing, but it's hard to get their attention when they
>> don't think it's on their end).
>>
>> Thanks,
>> Nick
>>
>> P.S.  We have frequent DB replication issues... at least a few times per
>> quarter. This is so annoying and I'm pretty sure it's due to this latency,
>> but I can't get anyone to pay attention.
>>
> _______________________________________________
> cisco-voip mailing list
> cisco-voip at puck.nether.net
> https://puck.nether.net/mailman/listinfo/cisco-voip
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://puck.nether.net/pipermail/cisco-voip/attachments/20181106/b2ec39ab/attachment.html>