[Outages-discussion] Azure Postmortem
Matt Hoppes
mattlists at rivervalleyinternet.net
Wed Sep 12 13:46:02 EDT 2018
The only thing that stands out as odd to me is:
"but in this instance, temperatures increased so quickly in parts of the
datacenter that some hardware was damaged before it could shut down."
How long does it take to shutdown a server? It shouldn't take more than
a minute or two I wouldn't think - and it seems it would be better to
just power off than risk hardware damage if the temperature got too extreme.
I don't understand that part.
On 9/12/18 1:22 PM, Steve Mikulasik wrote:
> MS made a statement about what took them down, sounds like they have
> some facility upgrades to do
> https://azure.microsoft.com/en-us/status/history/
>
> *Summary of impact:* In the early morning of September 4, 2018, high
> energy storms hit southern Texas in the vicinity of Microsoft Azure’s
> South Central US region. Multiple Azure datacenters in the region saw
> voltage sags and swells across the utility feeds. At 08:42 UTC,
> lightning caused electrical activity on the utility supply, which caused
> significant voltage swells. These swells triggered a portion of one
> Azure datacenter to transfer from utility power to generator power.
> Additionally, these power swells shutdown the datacenter’s mechanical
> cooling systems despite having surge suppressors in place. Initially,
> the datacenter was able to maintain its operational temperatures through
> a load dependent thermal buffer that was designed within the cooling
> system. However, once this thermal buffer was depleted the datacenter
> temperature exceeded safe operational thresholds, and an automated
> shutdown of devices was initiated. This shutdown mechanism is intended
> to preserve infrastructure and data integrity, but in this instance,
> temperatures increased so quickly in parts of the datacenter that some
> hardware was damaged before it could shut down. A significant number of
> storage servers were damaged, as well as a small number of network
> devices and power units.
> While storms were still active in the area, onsite teams took a series
> of actions to prevent further damage – including transferring the rest
> of the datacenter to generators thereby stabilizing the power supply. To
> initiate the recovery of infrastructure, the first step was to recover
> the Azure Software Load Balancers (SLBs) for storage scale units. SLB
> services are critical in the Azure networking stack, managing the
> routing of both customer and platform service traffic. The second step
> was to recover the storage servers and the data on these servers. This
> involved replacing failed infrastructure components, migrating customer
> data from the damaged servers to healthy servers, and validating that
> none of the recovered data was corrupted. This process took time due to
> the number of servers damaged, and the need to work carefully to
> maintain customer data integrity above all else. The decision was made
> to work towards recovery of data and not fail over to another
> datacenter, since a fail over would have resulted in limited data loss
> due to the asynchronous nature of geo replication.
> Despite onsite redundancies, there are scenarios in which a datacenter
> cooling failure can impact customer workloads in the affected
> datacenter. Unfortunately, this particular set of issues also caused a
> cascading impact to services outside of the region, as described below.
>
>
>
> _______________________________________________
> Outages-discussion mailing list
> Outages-discussion at outages.org
> https://puck.nether.net/mailman/listinfo/outages-discussion
>
More information about the Outages-discussion
mailing list