[Outages-discussion] Outages-discussion] Azure Postmortem

Aaron D. Osgood AOsgood at Streamline-Solutions.net
Wed Sep 12 14:41:08 EDT 2018


Perhaps that is "Lawyer-Speak" for "The damned place caught fire"

 

 

Aaron D. Osgood

Streamline Communications L.L.C

274 E. Eau Gallie Blvd. #332
Indian Harbour Beach, FL 32937

TEL: 207-518-8455
MOBILE: 207-831-5829
GTalk: aaron.osgood
 <mailto:AOsgood at Streamline-Solutions.net> AOsgood at Streamline-Solutions.net
 <http://www.streamline-solutions.net/> www.Streamline-Solutions.net



Introducing Efficiency to Business since 1986 

 

From: Outages-discussion [mailto:outages-discussion-bounces at outages.org] On
Behalf Of Steve Mikulasik
Sent: September 12, 2018 13:22
To: outages-discussion at outages.org
Subject: [Outages-discussion] Azure Postmortem

 

MS made a statement about what took them down, sounds like they have some
facility upgrades to do https://azure.microsoft.com/en-us/status/history/

 


Summary of impact: In the early morning of September 4, 2018, high energy
storms hit southern Texas in the vicinity of Microsoft Azure's South Central
US region. Multiple Azure datacenters in the region saw voltage sags and
swells across the utility feeds. At 08:42 UTC, lightning caused electrical
activity on the utility supply, which caused significant voltage swells.
These swells triggered a portion of one Azure datacenter to transfer from
utility power to generator power. Additionally, these power swells shutdown
the datacenter's mechanical cooling systems despite having surge suppressors
in place. Initially, the datacenter was able to maintain its operational
temperatures through a load dependent thermal buffer that was designed
within the cooling system. However, once this thermal buffer was depleted
the datacenter temperature exceeded safe operational thresholds, and an
automated shutdown of devices was initiated. This shutdown mechanism is
intended to preserve infrastructure and data integrity, but in this
instance, temperatures increased so quickly in parts of the datacenter that
some hardware was damaged before it could shut down. A significant number of
storage servers were damaged, as well as a small number of network devices
and power units.
While storms were still active in the area, onsite teams took a series of
actions to prevent further damage - including transferring the rest of the
datacenter to generators thereby stabilizing the power supply. To initiate
the recovery of infrastructure, the first step was to recover the Azure
Software Load Balancers (SLBs) for storage scale units. SLB services are
critical in the Azure networking stack, managing the routing of both
customer and platform service traffic. The second step was to recover the
storage servers and the data on these servers. This involved replacing
failed infrastructure components, migrating customer data from the damaged
servers to healthy servers, and validating that none of the recovered data
was corrupted. This process took time due to the number of servers damaged,
and the need to work carefully to maintain customer data integrity above all
else. The decision was made to work towards recovery of data and not fail
over to another datacenter, since a fail over would have resulted in limited
data loss due to the asynchronous nature of geo replication.
Despite onsite redundancies, there are scenarios in which a datacenter
cooling failure can impact customer workloads in the affected datacenter.
Unfortunately, this particular set of issues also caused a cascading impact
to services outside of the region, as described below.

 

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://puck.nether.net/pipermail/outages-discussion/attachments/20180912/570cd8b2/attachment.html>


More information about the Outages-discussion mailing list