[cisco-voip] MRA DR / Resilience

Wed Jan 13 08:09:36 EST 2021

Hey all,

I'm playing in this scenario now and trying to figure out what parts of the solution work, and which do not, in a DR "site failover' kind of scenario with regard to MRA.

I understand the documentation prescribes there's no failover for voice and video, but I think that failover is different than the one I'm describing here.

I know I can take Expressway C and Expressway E nodes out of the cluster at will, and things will heal over time once the Jabber clients catch up.

I can take a Unity Connection guest down, and it should work, though the Jetty service certainly has load limits. I don't think I'm hitting those here.

I can take an IM&P node down, and, with the exception of pChat services (DB was not deployed HA and merge job just seems to fail but that's another investigation), clients will eventually fail over and recover.

Today, we have half the C  cluster, half the E cluster, and one of two CUC nodes down. All IMP are up. One UCM subscriber is down, and things have been going poorly. Jabber customers keep getting punted from the client with "Your session has expired" randomly. The Jabber log looks like this token has expired, but, doesn't provide enough debugging to know why. It's possible that the Expressway E is fronting this message, since I understand it sits between Jabber and the rest of the infrastructure for oAuth, and Jabber does not talk to the UCM/CUC directly.

When we did not have SSO, the worst thing we had to do is make sure that the Jabber client's device pool had an active UCM as the primary in the CMGroup, as they wouldn't register properly without that, but, those UCMs are up.

Does anyone know what might be going on here?

My best guess is that the Expressway isn't intelligent enough to mark a UCM out of service when unreachable (or CUC server for that matter) and it is trying to refresh a customer's token against a server that isn't up. When this times out, instead of trying another it is telling Jabber the refresh token is expired. If this is the case, there's no cluster resilience with Jabber, if any nodes are down then things are going to be intermittent.

Why does Jabber sometimes choose to pop the dialog asking for a new session, and sometimes it just kicks the customer out of the client requiring a new sign in? I see a bug that suggests enabling LegacyOAuthSignout parameter, but, it doesn't explain what effect that's going to have on the client.

Basically, this is just a test but I am trying to learn from it, and would appreciate any thoughts/experiences. If it is the Expressway cluster, then there's no way around this as far as I can tell. Marking a UCM inactive with xAPI doesn't work, it just gets pushed back to active.

Any comments appreciated.

Best,

Adam Pawlowski
SUNYAB NCS

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://puck.nether.net/pipermail/cisco-voip/attachments/20210113/ffc87aa2/attachment.htm>