[cisco-voip] CMR and the SD-WAN

Mon Apr 30 11:40:08 EDT 2018

I think Engineering Deathmatch should be more about troubleshooting than
configuration.  This is the type of stuff which separates the Engineers
from the Installers.

On Mon, Apr 30, 2018 at 10:34 AM Ryan Huff <ryanhuff at outlook.com> wrote:

> It’s funny what we’ll do for “strategic customers”. Lol, ah well there was
> Jack and Coke at the end of that rainbow for sure :).
>
> Sent from my iPhone
>
> On Apr 30, 2018, at 11:27, Anthony Holloway <
> avholloway+cisco-voip at gmail.com> wrote:
>
> Yes, what James said, thank you for sharing this info.  I think I would
> have given up at "counting f**king packet sequence numbers."
>
> On Mon, Apr 30, 2018 at 10:13 AM James Buchanan <james.buchanan2 at gmail.com>
> wrote:
>
>> Painful as this was, hats off to you for writing this up and sharing.
>> Much appreciated!
>>
>> On Mon, Apr 30, 2018 at 3:36 PM, Ryan Huff <ryanhuff at outlook.com> wrote:
>>
>>> So here is a *neat* little situation I ran into recently, and is worth
>>> sharing and reading; if this saves a life it was worth the crap I had to go
>>> through …..
>>>
>>>
>>>
>>> == The Scenario ==
>>>
>>>
>>>
>>>    - Expressway C/E 8.10.3 cluster over wan (2 Control Peers, 2 Edge
>>>    Peers)
>>>    - Customer deployed and managed SD-WAN solution in front of the Edge
>>>    cluster to the Internet (with two separate transport carriers). I think it
>>>    was Palos, but we’ll call it a whitebox’ed solution for our purposes
>>>    - Using MRA and B2B Expressway configs
>>>    - UAT for MRA and B2B is accepted and works great
>>>
>>>
>>>
>>> == The Problem ==
>>>
>>>
>>>
>>> The customer applies the zone/search rule config in Expressway for CMR
>>> and notices that randomly, during a presentation session in the CMR, the
>>> BFCP server (AKA, the WebEx meeting) will close the BFCP presentation to
>>> the endpoint coming from the customer’s Expressway; all other BFCP clients
>>> are still receiving the BFCP presentation. That’s right, it *appears*
>>> that WebEx *kicked* the BFCP participant coming from the customer’s
>>> Edge, but not because the BFCP server closed the session (all other
>>> participants remain)! Although it was happening randomly’ish in length of
>>> time into the presentation, it would always happen at some point to the
>>> endpoint, generally around the 2 minute’ish mark.
>>>
>>>
>>>
>>> == The diagnosis ==
>>>
>>>
>>>
>>> Although random, a consistent’ish length would seem to suggest a timer /
>>> re-invite of some flavor, and that would be wrong, as ultimately uncovered.
>>> Sparing you all the gory tales of escalation and vendor bus underskirt
>>> sliding; the issue was in fact, the SD-WAN solution itself.
>>>
>>>
>>>
>>> == The Explanation & The Fix ==
>>>
>>>
>>>
>>> What was happening is that every 120 seconds or so, the BFCP server
>>> (WebEx meeting) would send a UDP BFCP packet to all the BFCP presentation
>>> subscribers. The customer’s SD-WAN solution was *identifying* these
>>> packets according to the customer (gotta love layer 7 capable firewalls
>>> 😊) and queueing them onto a physically different link than which the
>>> stream was on, thus creating *physical asymmetry, delay and latency*. I
>>> specifically requested that all inspection capabilities be turned off for
>>> the traffic but I guess that isn’t the same as “identifying the traffic” ….
>>> Lol. In a TCP stream, this would likely be tolerated to a degree as packet
>>> loss or delay and/or jitter and would simply re transmit ….. but we are
>>> dealing with *UDP* here, no bueno.
>>>
>>>
>>>
>>> To resolve, the customer had to identify and classify the traffic and
>>> force a active/failover transmission through the SD-WAN solution for that
>>> traffic, rather than a “load balance” transmission behavior.
>>>
>>>
>>>
>>> == Sleuthing & The Closing ==
>>>
>>>
>>>
>>> In hind sight, seems simple and makes perfect sense right? However, when
>>> your only visibility into the network is the Expressway servers themselves,
>>> it can be *very* challenging to discover because at that point in the
>>> topology, everything looks like it is coming from and going to the VIP on
>>> the firewall pair. So how do you catch something like this when you can’t
>>> see everything? *PCAPs*. *Literally counting f**king packet sequence
>>> numbers for 6 hours and identifying a consistent pattern of packets coming
>>> out of order and being “lost”.*
>>>
>>>
>>>
>>> -Ryan-
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> cisco-voip mailing list
>>> cisco-voip at puck.nether.net
>>> https://puck.nether.net/mailman/listinfo/cisco-voip
>>>
>>>
>> _______________________________________________
>> cisco-voip mailing list
>> cisco-voip at puck.nether.net
>> https://puck.nether.net/mailman/listinfo/cisco-voip
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://puck.nether.net/pipermail/cisco-voip/attachments/20180430/095eee97/attachment.html>