[cisco-voip] CMR and the SD-WAN

Mon Apr 30 10:36:47 EDT 2018

So here is a neat little situation I ran into recently, and is worth sharing and reading; if this saves a life it was worth the crap I had to go through …..

== The Scenario ==

  *   Expressway C/E 8.10.3 cluster over wan (2 Control Peers, 2 Edge Peers)
  *   Customer deployed and managed SD-WAN solution in front of the Edge cluster to the Internet (with two separate transport carriers). I think it was Palos, but we’ll call it a whitebox’ed solution for our purposes
  *   Using MRA and B2B Expressway configs
  *   UAT for MRA and B2B is accepted and works great

== The Problem ==

The customer applies the zone/search rule config in Expressway for CMR and notices that randomly, during a presentation session in the CMR, the BFCP server (AKA, the WebEx meeting) will close the BFCP presentation to the endpoint coming from the customer’s Expressway; all other BFCP clients are still receiving the BFCP presentation. That’s right, it appears that WebEx kicked the BFCP participant coming from the customer’s Edge, but not because the BFCP server closed the session (all other participants remain)! Although it was happening randomly’ish in length of time into the presentation, it would always happen at some point to the endpoint, generally around the 2 minute’ish mark.

== The diagnosis ==

Although random, a consistent’ish length would seem to suggest a timer / re-invite of some flavor, and that would be wrong, as ultimately uncovered. Sparing you all the gory tales of escalation and vendor bus underskirt sliding; the issue was in fact, the SD-WAN solution itself.

== The Explanation & The Fix ==

What was happening is that every 120 seconds or so, the BFCP server (WebEx meeting) would send a UDP BFCP packet to all the BFCP presentation subscribers. The customer’s SD-WAN solution was identifying these packets according to the customer (gotta love layer 7 capable firewalls 😊) and queueing them onto a physically different link than which the stream was on, thus creating physical asymmetry, delay and latency. I specifically requested that all inspection capabilities be turned off for the traffic but I guess that isn’t the same as “identifying the traffic” …. Lol. In a TCP stream, this would likely be tolerated to a degree as packet loss or delay and/or jitter and would simply re transmit ….. but we are dealing with UDP here, no bueno.

To resolve, the customer had to identify and classify the traffic and force a active/failover transmission through the SD-WAN solution for that traffic, rather than a “load balance” transmission behavior.

== Sleuthing & The Closing ==

In hind sight, seems simple and makes perfect sense right? However, when your only visibility into the network is the Expressway servers themselves, it can be very challenging to discover because at that point in the topology, everything looks like it is coming from and going to the VIP on the firewall pair. So how do you catch something like this when you can’t see everything? PCAPs. Literally counting f**king packet sequence numbers for 6 hours and identifying a consistent pattern of packets coming out of order and being “lost”.

-Ryan-

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://puck.nether.net/pipermail/cisco-voip/attachments/20180430/94ae5217/attachment.html>