[cisco-voip] MRA DR / Resilience

Pawlowski, Adam ajp26 at buffalo.edu
Wed Jan 13 09:04:32 EST 2021


Hi Nate, we’re still on X12.6.5 so I’ll have to scope this out.

It looks like, if I read that right, the Expressway will finally flag servers as inactive instead of … just not.

It’s unclear if this improves anything with Jabber’s behavior.

My customers have gifted my inbox with Jabber PRT logs this morning, and in reading through them, it looks like most of the issues are:


  *   Jabber trying to hit the CUC node that’s down for SSO auth, which results in a sign in failure
  *   Jabber trying to hit the UCM node that’s down for UDS, which results in a sign in failure

Both things would be resolved if the servers are marked inactive and not presented to the Jabber client, but the Jabber client also has to handle this better if it tries to reach to something it cannot, instead of just bombing out. That’s probably a pipe dream with Jabber at this point.

Thanks again,

Adam

From: NateCCIE <nateccie at gmail.com>
Sent: Wednesday, January 13, 2021 8:56 AM
To: Pawlowski, Adam <ajp26 at buffalo.edu>
Cc: cisco-voip at puck.nether.net
Subject: Re: [cisco-voip] MRA DR / Resilience

SIP Registration Failover for Cisco Jabber - MRA Deployments

https://www.cisco.com/c/dam/en/us/td/docs/voice_ip_comm/expressway/release_note/Cisco-Expressway-Release-Note-X12-7.pdf#page16

This is new in x12.7
Sent from my iPhone


On Jan 13, 2021, at 6:10 AM, Pawlowski, Adam <ajp26 at buffalo.edu<mailto:ajp26 at buffalo.edu>> wrote:

Hey all,

I’m playing in this scenario now and trying to figure out what parts of the solution work, and which do not, in a DR “site failover’ kind of scenario with regard to MRA.

I understand the documentation prescribes there’s no failover for voice and video, but I think that failover is different than the one I’m describing here.

I know I can take Expressway C and Expressway E nodes out of the cluster at will, and things will heal over time once the Jabber clients catch up.

I can take a Unity Connection guest down, and it should work, though the Jetty service certainly has load limits. I don’t think I’m hitting those here.

I can take an IM&P node down, and, with the exception of pChat services (DB was not deployed HA and merge job just seems to fail but that’s another investigation), clients will eventually fail over and recover.

Today, we have half the C  cluster, half the E cluster, and one of two CUC nodes down. All IMP are up. One UCM subscriber is down, and things have been going poorly. Jabber customers keep getting punted from the client with “Your session has expired” randomly. The Jabber log looks like this token has expired, but, doesn’t provide enough debugging to know why. It’s possible that the Expressway E is fronting this message, since I understand it sits between Jabber and the rest of the infrastructure for oAuth, and Jabber does not talk to the UCM/CUC directly.

When we did not have SSO, the worst thing we had to do is make sure that the Jabber client’s device pool had an active UCM as the primary in the CMGroup, as they wouldn’t register properly without that, but, those UCMs are up.

Does anyone know what might be going on here?

My best guess is that the Expressway isn’t intelligent enough to mark a UCM out of service when unreachable (or CUC server for that matter) and it is trying to refresh a customer’s token against a server that isn’t up. When this times out, instead of trying another it is telling Jabber the refresh token is expired. If this is the case, there’s no cluster resilience with Jabber, if any nodes are down then things are going to be intermittent.

Why does Jabber sometimes choose to pop the dialog asking for a new session, and sometimes it just kicks the customer out of the client requiring a new sign in? I see a bug that suggests enabling LegacyOAuthSignout parameter, but, it doesn’t explain what effect that’s going to have on the client.

Basically, this is just a test but I am trying to learn from it, and would appreciate any thoughts/experiences. If it is the Expressway cluster, then there’s no way around this as far as I can tell. Marking a UCM inactive with xAPI doesn’t work, it just gets pushed back to active.

Any comments appreciated.

Best,

Adam Pawlowski
SUNYAB NCS


_______________________________________________
cisco-voip mailing list
cisco-voip at puck.nether.net<mailto:cisco-voip at puck.nether.net>
https://puck.nether.net/mailman/listinfo/cisco-voip
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://puck.nether.net/pipermail/cisco-voip/attachments/20210113/6b7b03f4/attachment.htm>


More information about the cisco-voip mailing list