[Outages-discussion] Azure Postmortem

Wed Sep 12 13:46:02 EDT 2018

The only thing that stands out as odd to me is:

"but in this instance, temperatures increased so quickly in parts of the 
datacenter that some hardware was damaged before it could shut down."

How long does it take to shutdown a server?  It shouldn't take more than 
a minute or two I wouldn't think - and it seems it would be better to 
just power off than risk hardware damage if the temperature got too extreme.

I don't understand that part.

On 9/12/18 1:22 PM, Steve Mikulasik wrote:
> MS made a statement about what took them down, sounds like they have 
> some facility upgrades to do 
> https://azure.microsoft.com/en-us/status/history/
> 
> *Summary of impact:* In the early morning of September 4, 2018, high 
> energy storms hit southern Texas in the vicinity of Microsoft Azure’s 
> South Central US region. Multiple Azure datacenters in the region saw 
> voltage sags and swells across the utility feeds. At 08:42 UTC, 
> lightning caused electrical activity on the utility supply, which caused 
> significant voltage swells.  These swells triggered a portion of one 
> Azure datacenter to transfer from utility power to generator power. 
> Additionally, these power swells shutdown the datacenter’s mechanical 
> cooling systems despite having surge suppressors in place. Initially, 
> the datacenter was able to maintain its operational temperatures through 
> a load dependent thermal buffer that was designed within the cooling 
> system. However, once this thermal buffer was depleted the datacenter 
> temperature exceeded safe operational thresholds, and an automated 
> shutdown of devices was initiated. This shutdown mechanism is intended 
> to preserve infrastructure and data integrity, but in this instance, 
> temperatures increased so quickly in parts of the datacenter that some 
> hardware was damaged before it could shut down. A significant number of 
> storage servers were damaged, as well as a small number of network 
> devices and power units.
> While storms were still active in the area, onsite teams took a series 
> of actions to prevent further damage – including transferring the rest 
> of the datacenter to generators thereby stabilizing the power supply. To 
> initiate the recovery of infrastructure, the first step was to recover 
> the Azure Software Load Balancers (SLBs) for storage scale units. SLB 
> services are critical in the Azure networking stack, managing the 
> routing of both customer and platform service traffic. The second step 
> was to recover the storage servers and the data on these servers. This 
> involved replacing failed infrastructure components, migrating customer 
> data from the damaged servers to healthy servers, and validating that 
> none of the recovered data was corrupted. This process took time due to 
> the number of servers damaged, and the need to work carefully to 
> maintain customer data integrity above all else. The decision was made 
> to work towards recovery of data and not fail over to another 
> datacenter, since a fail over would have resulted in limited data loss 
> due to the asynchronous nature of geo replication.
> Despite onsite redundancies, there are scenarios in which a datacenter 
> cooling failure can impact customer workloads in the affected 
> datacenter. Unfortunately, this particular set of issues also caused a 
> cascading impact to services outside of the region, as described below.
> 
> 
> 
> _______________________________________________
> Outages-discussion mailing list
> Outages-discussion at outages.org
> https://puck.nether.net/mailman/listinfo/outages-discussion
>