<html><head><meta http-equiv="Content-Type" content="text/html charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class=""><br class=""><div><blockquote type="cite" class=""><div class="">On Sep 12, 2018, at 1:48 PM, Steve Mikulasik <<a href="mailto:Steve.Mikulasik@civeo.com" class="">Steve.Mikulasik@civeo.com</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div class="WordSection1" style="page: WordSection1; font-family: Helvetica; font-size: 14px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px;"><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: 'Times New Roman', serif; line-height: 16.959999084472656px;" class=""><span style="font-size: 11pt; line-height: 15.546667098999023px; font-family: Calibri, sans-serif; color: rgb(31, 73, 125);" class="">My bet is that the servers turn themselves off, rather than it be pod by pod. Things late to detect the temp as high don’t turn off before they get cooked. Also they may have long shutdown times depending if the host has to wait for all the VMs to shutdown before it can.</span></div></div></div></blockquote><div><br class=""></div><div>Also all my experience is in old-timey datacenters before all these really high-density servers came out, but I imagine that when you have a single rack pulling the equivalent of say, 25 hair dryers, things get hot way more quickly than you’d think.  Especially if the air handlers aren’t running, and even worse if these are those closed cabinets.</div><div><br class=""></div><div>I can see an engineer wanting to delay the shutdown - I don’t think Windows yet has a CoW filesystem that can survive a hard power-off during heavy disk activity without it requiring a lengthy check/repair on the next boot.</div><div><br class=""></div><div>Charles</div><br class=""><blockquote type="cite" class=""><div class=""><div class="WordSection1" style="page: WordSection1; font-family: Helvetica; font-size: 14px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px;"><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: 'Times New Roman', serif; line-height: 16.959999084472656px;" class=""><span style="font-size: 11pt; line-height: 15.546667098999023px; font-family: Calibri, sans-serif; color: rgb(31, 73, 125);" class=""><o:p class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class=""><span style="font-size: 11pt; font-family: Calibri, sans-serif; color: rgb(31, 73, 125);" class=""><o:p class=""> </o:p></span></div><div class=""><div style="border-style: solid none none; border-top-width: 1pt; border-top-color: rgb(225, 225, 225); padding: 3pt 0in 0in;" class=""><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class=""><b class=""><span style="font-size: 11pt; font-family: Calibri, sans-serif;" class="">From:</span></b><span style="font-size: 11pt; font-family: Calibri, sans-serif;" class=""><span class="Apple-converted-space"> </span>Matt Hoppes <<a href="mailto:mattlists@rivervalleyinternet.net" class="">mattlists@rivervalleyinternet.net</a>><span class="Apple-converted-space"> </span><br class=""><b class="">Sent:</b><span class="Apple-converted-space"> </span>Wednesday, September 12, 2018 11:46 AM<br class=""><b class="">To:</b><span class="Apple-converted-space"> </span>Steve Mikulasik <<a href="mailto:Steve.Mikulasik@civeo.com" class="">Steve.Mikulasik@civeo.com</a>>; <a href="mailto:outages-discussion@outages.org" class="">outages-discussion@outages.org</a><br class=""><b class="">Subject:</b><span class="Apple-converted-space"> </span>Re: [Outages-discussion] Azure Postmortem<o:p class=""></o:p></span></div></div></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class=""><o:p class=""> </o:p></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class="">The only thing that stands out as odd to me is:<br class=""><br class="">"but in this instance, temperatures increased so quickly in parts of the<span class="Apple-converted-space"> </span><br class="">datacenter that some hardware was damaged before it could shut down."<br class=""><br class="">How long does it take to shutdown a server? It shouldn't take more than<span class="Apple-converted-space"> </span><br class="">a minute or two I wouldn't think - and it seems it would be better to<span class="Apple-converted-space"> </span><br class="">just power off than risk hardware damage if the temperature got too extreme.<br class=""><br class="">I don't understand that part.<br class=""><br class="">On 9/12/18 1:22 PM, Steve Mikulasik wrote:<br class="">> MS made a statement about what took them down, sounds like they have<span class="Apple-converted-space"> </span><br class="">> some facility upgrades to do<span class="Apple-converted-space"> </span><br class="">><span class="Apple-converted-space"> </span><a href="https://azure.microsoft.com/en-us/status/history/" style="color: purple; text-decoration: underline;" class="">https://azure.microsoft.com/en-us/status/history/</a><br class="">><span class="Apple-converted-space"> </span><br class="">> *Summary of impact:* In the early morning of September 4, 2018, high<span class="Apple-converted-space"> </span><br class="">> energy storms hit southern Texas in the vicinity of Microsoft Azure’s<span class="Apple-converted-space"> </span><br class="">> South Central US region. Multiple Azure datacenters in the region saw<span class="Apple-converted-space"> </span><br class="">> voltage sags and swells across the utility feeds. At 08:42 UTC,<span class="Apple-converted-space"> </span><br class="">> lightning caused electrical activity on the utility supply, which caused<span class="Apple-converted-space"> </span><br class="">> significant voltage swells.  These swells triggered a portion of one<span class="Apple-converted-space"> </span><br class="">> Azure datacenter to transfer from utility power to generator power.<span class="Apple-converted-space"> </span><br class="">> Additionally, these power swells shutdown the datacenter’s mechanical<span class="Apple-converted-space"> </span><br class="">> cooling systems despite having surge suppressors in place. Initially,<span class="Apple-converted-space"> </span><br class="">> the datacenter was able to maintain its operational temperatures through<span class="Apple-converted-space"> </span><br class="">> a load dependent thermal buffer that was designed within the cooling<span class="Apple-converted-space"> </span><br class="">> system. However, once this thermal buffer was depleted the datacenter<span class="Apple-converted-space"> </span><br class="">> temperature exceeded safe operational thresholds, and an automated<span class="Apple-converted-space"> </span><br class="">> shutdown of devices was initiated. This shutdown mechanism is intended<span class="Apple-converted-space"> </span><br class="">> to preserve infrastructure and data integrity, but in this instance,<span class="Apple-converted-space"> </span><br class="">> temperatures increased so quickly in parts of the datacenter that some<span class="Apple-converted-space"> </span><br class="">> hardware was damaged before it could shut down. A significant number of<span class="Apple-converted-space"> </span><br class="">> storage servers were damaged, as well as a small number of network<span class="Apple-converted-space"> </span><br class="">> devices and power units.<br class="">> While storms were still active in the area, onsite teams took a series<span class="Apple-converted-space"> </span><br class="">> of actions to prevent further damage – including transferring the rest<span class="Apple-converted-space"> </span><br class="">> of the datacenter to generators thereby stabilizing the power supply. To<span class="Apple-converted-space"> </span><br class="">> initiate the recovery of infrastructure, the first step was to recover<span class="Apple-converted-space"> </span><br class="">> the Azure Software Load Balancers (SLBs) for storage scale units. SLB<span class="Apple-converted-space"> </span><br class="">> services are critical in the Azure networking stack, managing the<span class="Apple-converted-space"> </span><br class="">> routing of both customer and platform service traffic. The second step<span class="Apple-converted-space"> </span><br class="">> was to recover the storage servers and the data on these servers. This<span class="Apple-converted-space"> </span><br class="">> involved replacing failed infrastructure components, migrating customer<span class="Apple-converted-space"> </span><br class="">> data from the damaged servers to healthy servers, and validating that<span class="Apple-converted-space"> </span><br class="">> none of the recovered data was corrupted. This process took time due to<span class="Apple-converted-space"> </span><br class="">> the number of servers damaged, and the need to work carefully to<span class="Apple-converted-space"> </span><br class="">> maintain customer data integrity above all else. The decision was made<span class="Apple-converted-space"> </span><br class="">> to work towards recovery of data and not fail over to another<span class="Apple-converted-space"> </span><br class="">> datacenter, since a fail over would have resulted in limited data loss<span class="Apple-converted-space"> </span><br class="">> due to the asynchronous nature of geo replication.<br class="">> Despite onsite redundancies, there are scenarios in which a datacenter<span class="Apple-converted-space"> </span><br class="">> cooling failure can impact customer workloads in the affected<span class="Apple-converted-space"> </span><br class="">> datacenter. Unfortunately, this particular set of issues also caused a<span class="Apple-converted-space"> </span><br class="">> cascading impact to services outside of the region, as described below.<br class="">><span class="Apple-converted-space"> </span><br class="">><span class="Apple-converted-space"> </span><br class="">><span class="Apple-converted-space"> </span><br class="">> _______________________________________________<br class="">> Outages-discussion mailing list<br class="">><span class="Apple-converted-space"> </span><a href="mailto:Outages-discussion@outages.org" style="color: purple; text-decoration: underline;" class="">Outages-discussion@outages.org</a><br class="">><span class="Apple-converted-space"> </span><a href="https://puck.nether.net/mailman/listinfo/outages-discussion" style="color: purple; text-decoration: underline;" class="">https://puck.nether.net/mailman/listinfo/outages-discussion</a><br class="">><span class="Apple-converted-space"> </span><o:p class=""></o:p></div></div><span style="font-family: Helvetica; font-size: 14px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">_______________________________________________</span><br style="font-family: Helvetica; font-size: 14px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><span style="font-family: Helvetica; font-size: 14px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">Outages-discussion mailing list</span><br style="font-family: Helvetica; font-size: 14px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><span style="font-family: Helvetica; font-size: 14px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class=""><a href="mailto:Outages-discussion@outages.org" class="">Outages-discussion@outages.org</a></span><br style="font-family: Helvetica; font-size: 14px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><span style="font-family: Helvetica; font-size: 14px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class=""><a href="https://puck.nether.net/mailman/listinfo/outages-discussion" class="">https://puck.nether.net/mailman/listinfo/outages-discussion</a></span></div></blockquote></div><br class=""></body></html>