[Outages-discussion] Azure Postmortem

Wed Sep 12 13:59:19 EDT 2018

> On Sep 12, 2018, at 1:48 PM, Steve Mikulasik <Steve.Mikulasik at civeo.com> wrote:
> 
> My bet is that the servers turn themselves off, rather than it be pod by pod. Things late to detect the temp as high don’t turn off before they get cooked. Also they may have long shutdown times depending if the host has to wait for all the VMs to shutdown before it can.

Also all my experience is in old-timey datacenters before all these really high-density servers came out, but I imagine that when you have a single rack pulling the equivalent of say, 25 hair dryers, things get hot way more quickly than you’d think.  Especially if the air handlers aren’t running, and even worse if these are those closed cabinets.

I can see an engineer wanting to delay the shutdown - I don’t think Windows yet has a CoW filesystem that can survive a hard power-off during heavy disk activity without it requiring a lengthy check/repair on the next boot.

Charles

>  
> From: Matt Hoppes <mattlists at rivervalleyinternet.net> 
> Sent: Wednesday, September 12, 2018 11:46 AM
> To: Steve Mikulasik <Steve.Mikulasik at civeo.com>; outages-discussion at outages.org
> Subject: Re: [Outages-discussion] Azure Postmortem
>  
> The only thing that stands out as odd to me is:
> 
> "but in this instance, temperatures increased so quickly in parts of the 
> datacenter that some hardware was damaged before it could shut down."
> 
> How long does it take to shutdown a server? It shouldn't take more than 
> a minute or two I wouldn't think - and it seems it would be better to 
> just power off than risk hardware damage if the temperature got too extreme.
> 
> I don't understand that part.
> 
> On 9/12/18 1:22 PM, Steve Mikulasik wrote:
> > MS made a statement about what took them down, sounds like they have 
> > some facility upgrades to do 
> > https://azure.microsoft.com/en-us/status/history/ <https://azure.microsoft.com/en-us/status/history/>
> > 
> > *Summary of impact:* In the early morning of September 4, 2018, high 
> > energy storms hit southern Texas in the vicinity of Microsoft Azure’s 
> > South Central US region. Multiple Azure datacenters in the region saw 
> > voltage sags and swells across the utility feeds. At 08:42 UTC, 
> > lightning caused electrical activity on the utility supply, which caused 
> > significant voltage swells.  These swells triggered a portion of one 
> > Azure datacenter to transfer from utility power to generator power. 
> > Additionally, these power swells shutdown the datacenter’s mechanical 
> > cooling systems despite having surge suppressors in place. Initially, 
> > the datacenter was able to maintain its operational temperatures through 
> > a load dependent thermal buffer that was designed within the cooling 
> > system. However, once this thermal buffer was depleted the datacenter 
> > temperature exceeded safe operational thresholds, and an automated 
> > shutdown of devices was initiated. This shutdown mechanism is intended 
> > to preserve infrastructure and data integrity, but in this instance, 
> > temperatures increased so quickly in parts of the datacenter that some 
> > hardware was damaged before it could shut down. A significant number of 
> > storage servers were damaged, as well as a small number of network 
> > devices and power units.
> > While storms were still active in the area, onsite teams took a series 
> > of actions to prevent further damage – including transferring the rest 
> > of the datacenter to generators thereby stabilizing the power supply. To 
> > initiate the recovery of infrastructure, the first step was to recover 
> > the Azure Software Load Balancers (SLBs) for storage scale units. SLB 
> > services are critical in the Azure networking stack, managing the 
> > routing of both customer and platform service traffic. The second step 
> > was to recover the storage servers and the data on these servers. This 
> > involved replacing failed infrastructure components, migrating customer 
> > data from the damaged servers to healthy servers, and validating that 
> > none of the recovered data was corrupted. This process took time due to 
> > the number of servers damaged, and the need to work carefully to 
> > maintain customer data integrity above all else. The decision was made 
> > to work towards recovery of data and not fail over to another 
> > datacenter, since a fail over would have resulted in limited data loss 
> > due to the asynchronous nature of geo replication.
> > Despite onsite redundancies, there are scenarios in which a datacenter 
> > cooling failure can impact customer workloads in the affected 
> > datacenter. Unfortunately, this particular set of issues also caused a 
> > cascading impact to services outside of the region, as described below.
> > 
> > 
> > 
> > _______________________________________________
> > Outages-discussion mailing list
> > Outages-discussion at outages.org <mailto:Outages-discussion at outages.org>
> > https://puck.nether.net/mailman/listinfo/outages-discussion <https://puck.nether.net/mailman/listinfo/outages-discussion>
> > 
> _______________________________________________
> Outages-discussion mailing list
> Outages-discussion at outages.org
> https://puck.nether.net/mailman/listinfo/outages-discussion

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://puck.nether.net/pipermail/outages-discussion/attachments/20180912/6a13de43/attachment-0001.html>