[Outages-discussion] Baicells Outage?

Thu Jan 4 11:50:30 EST 2018

AWS moved up their reboots as well.  Was scheduled for the 6th but I woke up this morning to all our EC2 instances being rebooted.

From: Outages-discussion [mailto:outages-discussion-bounces at outages.org] On Behalf Of Frank Bulk
Sent: Thursday, January 04, 2018 10:46 AM
To: outages-discussion at outages.org
Subject: Re: [Outages-discussion] Baicells Outage?

Moving over to -discussion.

The Intel flaw is real.
There’s nothing on Azure Status history indicating that there was a problem: https://azure.microsoft.com/en-us/status/history/
There were a few Azure issues overnight, but not a lot: http://downdetector.com/status/windows-azure.  Here are two note there:
Due to the CPU vulnerability issue Azure VMs are being restarted one after the other. But all of our Azure VMs are very slow during and after the reboot. Some of them cannot be accessed after reboot (only black screen through MSTSC). (Central Europe)
Same problem here. VMs are slow booting and are very slow in general. Diskspd tests gave me very low results (sometimes les than 100 iops)

So it’s not clear to me why BaiCells was so affected while it barely created a ripple for others.  The group (https://www.facebook.com/groups/baicellsoperatorsupportgroup/?ref=br_rs) is closed, so I can’t see what’s being discussed there.

Frank

From: Outages [mailto:outages-bounces at outages.org] On Behalf Of Zak Rupas via Outages
Sent: Thursday, January 04, 2018 10:34 AM
To: Outages at outages.org<mailto:Outages at outages.org>
Subject: [outages] Baicells Outage?

Good Morning Outages-

Is anyone feeling this issue currently? Any truth to the story?

From Baicells facebook group:

So this morning I feel a bit like the parent who woke up having discovered his child injured someone in a car crash. I'm not directly responsible, but I've some culpability -- or a lot of it -- due to the choices I've made as parent.

Yes, it is true the discovery of the Intel x86 security flaw set the tech industry ablaze last night. Among the reactions Microsoft abruptly shutdown it's Azure cloud servers with no advance notice. Yes, that specific action was beyond our control. But, Amazon is dancing, while people who rely upon Azure -- all of us -- are justifiably frustrated and angry.

On this occasion, redundancy across Azure servers AND across various LTE functions, like HSS and MME, did not help us.

Yes, we are going to investigate adding redundancy across multiple cloud providers such as Amazon....but let's be blunt...who's to say that step still will be enough. What needs completion without excuse or qualification are the other options we've discussed ad nauseum, the local EPC and Halo B.

Apologies from us aren't gonna cut it, nor will excuses, so I'm not going to waste your time offering them. I'm already this AM discussing with our executive team pushing as top priority the other EPC options. Here's what I will tell you.

This past Tuesday a new hire started at Baicells North America. Ronald Mao is ex-Huawei and ex-Motorola and has lived in the US since 1987. His entire career has been centered on product line management. He is our new PLM for, shall we say, major things. I've already asked Jesse to brief Ronald on the ongoing cloud issues, as well as local where we are on the local EPC and Halo B. I will ask Ronald to send me an update each week, and I'll pass this one to the group via Facebook AND an email UNTIL THIS IS DONE.

One P.S. note, thank you Cameron and Rick. While it's small consolation to our customers, Rick and Cam have been up all night working with the team overseas, reporting to you, and in general trying to manage what's been frankly out of their hands.

Jesse Raasch<https://www.facebook.com/jesse.raasch?fref=gs&dti=1588455311448839&hc_location=group> Cameron Kilton<https://www.facebook.com/ciaworks?fref=gs&dti=1588455311448839&hc_location=group> Rick Harnish<https://www.facebook.com/rick.harnish.10?fref=gs&dti=1588455311448839&hc_location=group> Savannah Lancaster<https://www.facebook.com/savannah.lancaster?fref=gs&dti=1588455311448839&hc_location=group> Ronald Mao Minchul Ho<https://www.facebook.com/minchul.ho?fref=gs&dti=1588455311448839&hc_location=group> Sonny May<https://www.facebook.com/sonny.may.94?fref=gs&dti=1588455311448839&hc_location=group> Nitisha Potti<https://www.facebook.com/nitisha.potti?fref=gs&dti=1588455311448839&hc_location=group> Boun Senekham<https://www.facebook.com/bsenekham?fref=gs&dti=1588455311448839&hc_location=group>

Update: I spoke with Ronald this morning (he is in CA). He has his marching orders. I'll post updates from him until we close on the local EPC and Halo B.

So Cameron has been trying to post this, but it's getting rejected:

For those still having CPE attach issues. Please instruct your customers to have CPE powered off for at least 5 minutes.

Micah Deshotel That didn't work for me. The down ones stayed down. Power

System Alert: OMC is currently reporting offline. We are investigating as of 10:12PM EST. UPDATE: 11:07pm EST Azure is rebooting servers to apply a major patch. Most of our instances are back online. OMC should be restored shortly.

UPDATE 2: 1:20am EST. MME VMs also fell victim to the critical issue with Azure. VMs have since been restored. More information about the issue an be found here: https://www.geekwire.com/…/cloud-vendors-secretly-scramble…/<https://l.facebook.com/l.php?u=https%3A%2F%2Fwww.geekwire.com%2F2018%2Fcloud-vendors-secretly-scramble-patch-critical-flaw-intel-chips-performance-hits-expected%2F&h=ATNFZQ-do0svtoa_e5hnQZl0gvZbgu6awtfb_Gk4bHVCd87C26PuC1WMVDtgDyT8qK3fFfPSQ7b34-F2tgFtUfQHStT7qfLlstCNSV7V6lexFW_hL3D1t6kkGbeOa7beRxl6DrJuMP_oeinlJRpZ3wFWRWUumOt1zsrGTV5Hv9Hzc_Br_dwBI54WDVkxt3_FW3zyGrAVA8u2dM_lnJetjxJVrD8N7PCXs3a6glQv0d_nWDmsRvN3MmiTIZV3gBHTITs7lk2JgmLJyKs_p0yfVvwJ3TkLPouV_wTQ>

We will continue to review our cloud infrastructure and investigate cross platform redundancy.

Cycling the UEs hasnt worked yet either......

Thanks
Zak Rupas
Forethought.net
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://puck.nether.net/pipermail/outages-discussion/attachments/20180104/204ae564/attachment-0001.html>