<div dir="auto">I also read this blog post and had a very different reaction to it, which I think can be summed up as "it's weird to make your post-mortem so much about the things that another company did wrong." Yeah Flexential undoubtedly made some mistakes/poor decisions, but ultimately those details have no bearing on Cloudflare's issues. They could have said "our data center had a power outage" and that would have been enough information to provide context for the Cloudflare parts of the story.<div dir="auto"><br></div><div dir="auto">I suspect (and I'm just guessing here) that part of the thought process on Cloudflare's part was to draw attention away from their highly visible failure and toward someone else's failure. But regardless of the reasoning, it just seems unwise to publicly throw your vendor under the bus like that....unless you are actively trying not to do business with them anymore.</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sun, Nov 5, 2023, 3:24 AM Chapman, Brad (NBCUniversal) via Outages-discussion <<a href="mailto:outages-discussion@outages.org">outages-discussion@outages.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="auto">
<div dir="ltr">
<div>
<blockquote type="cite"><i>Counter to best practices, Flexential did not inform Cloudflare that they had failed over to generator power. </i></blockquote>
<br>
</div>
<div>Off to a good start, then...</div>
<div><br>
</div>
<div>
<blockquote type="cite"><i>It is also unusual that Flexential ran both the one remaining utility feed and the generators at the same time... we haven't gotten a clear answer why they ran utility power and generator power.</i></blockquote>
<br>
</div>
<div>Yeah, there's a reason the power company tells homeowners to not improvise by backfeeding their house from a generator using a "suicide cord" when the linemen are working outside. You're supposed to install a cutover switch, or at least turn off your
house main circuit breaker.</div>
<div><br>
<blockquote type="cite"><i>Some of what follows is informed speculation based on the most likely series of events as well as what individual Flexential employees have shared with us unofficially.</i></blockquote>
</div>
<div><br>
</div>
<div>Oh boy, this is about to get spicy...</div>
<div><br>
</div>
<div>
<blockquote type="cite"><i>One possible reason they may have left the utility line running is because Flexential was part of a program with PGE called DSG ... [which] allows the local utility to run a data center's generators to help supply additional power
to the grid. In exchange, the power company helps maintain the generators and supplies fuel. We have been unable to locate any record of Flexential informing us about the DSG program. We've asked if DSG was active at the time and have not received an answer. </i></blockquote>
<br>
</div>
<div>You can't ask what you don't know, but it seems like power generation is one of those important things that should be told to your single largest customer who is <b>leasing 10% of your entire facility</b>.</div>
<div><br>
</div>
<div>
<blockquote type="cite"><i>At approximately 11:40 UTC, there was a ground fault on a PGE transformer at PDX-04... [and] ground faults with high voltage (12,470 volt) power lines are very bad.</i></blockquote>
</div>
<div><br>
</div>
<div>That's underselling it a bit.</div>
<div><br>
</div>
<div>
<blockquote type="cite"><i>Fortunately ... PDX-04 also contains a bank of UPS batteries... [that] are supposedly sufficient to power the facility for approximately 10 minutes... In reality, the batteries started to fail after only 4 minutes ... and it took
Flexential far longer than 10 minutes to get the generators restored.</i></blockquote>
</div>
<div><br>
</div>
<div>Correct me if I'm wrong, but aren't UPS batteries supposed to be exercised with deep-cycling on a regular basis? It sounds like they were extremely worn out when they were needed most.</div>
<div><br>
</div>
<div>
<div>
<blockquote type="cite"><i>While we haven't gotten official confirmation, we have been told by employees that [the generators] needed to be physically accessed and manually restarted because of the way the ground fault had tripped circuits. Second, Flexential's
access control system was not powered by the battery backups, so it was offline. </i></blockquote>
</div>
<div><br>
</div>
<div>That sounds objectively dumber than what happened at the Meta/Facebook datacenter outage a while ago, where the doors and badge readers were still online, but the badges couldn't be evaluated via the network due to the BGP crash, and the credentials weren't
cached locally either. </div>
<div><br>
</div>
<div>
<blockquote type="cite"><i>And third, the overnight staffing at the site did not include an experienced operations or electrical expert — the overnight shift consisted of security and an unaccompanied technician who had only been on the job for a week.</i></blockquote>
</div>
</div>
<div><br>
</div>
<div>:picard-facepalm:</div>
<div><br>
</div>
<div>
<blockquote type="cite"><i>Throughout this, Flexential never informed Cloudflare that there was any issue at the facility. [We] attempted to contact Flexential and dispatched our local team to physically travel to the facility.</i></blockquote>
<br>
</div>
<div>Adele: "Hello from the outsiiiiide..."</div>
<div><br>
</div>
<blockquote type="cite"><i>"We have a number of questions that we need answered from Flexential."</i></blockquote>
<div><br>
</div>
<div>Understatement of the year. They must be seething. </div>
<div><br>
</div>
<div>Cloudflare's report here is fairly even-handed and appears to have been fact-checked as well as possible under the circumstances, with corroborated statements from anonymous employees.</div>
<div><br>
</div>
<div>Having read the technical stack in the document and their plans to beef up disaster recovery, Cloudflare has my utmost respect for quickly acknowledging and apologizing for the fact that they didn't assert and require that <b><u>new</u></b> services would
be fully capable of active, redundant operation in the event of catastrophic service loss at their
<b><u>primary</u></b> datacenter—a site which they believed to be reliable and indefatigable. </div>
<div><br>
</div>
<div>They had planned many disaster exercises to combat the loss of PDX-04, but not in the event of a complete loss of power in excess of 10 minutes, or even 4 with shoddy batteries. </div>
<div><br>
</div>
<div>To quote Ricky Ricardo, Flexential has some 'splainin' to do.</div>
<div><br>
</div>
<div>-Brad</div>
<div><br>
</div>
<div> </div>
<div><br>
<div dir="ltr">—Sent from my iPhone</div>
<div dir="ltr"><br>
<blockquote type="cite">On Nov 4, 2023, at 10:30 PM, Bryan Fields via Outages <<a href="mailto:outages@outages.org" target="_blank" rel="noreferrer">outages@outages.org</a>> wrote:<br>
<br>
</blockquote>
</div>
<blockquote type="cite">
<div dir="ltr"><span>On 11/3/23 5:27 PM, Martin Hannigan via Outages wrote:</span><br>
<blockquote type="cite"><span>Maybe there are questions?</span><br>
</blockquote>
<span></span><br>
<span><a href="https://urldefense.com/v3/__https://blog.cloudflare.com/post-mortem-on-cloudflare-control-plane-and-analytics-outage/__;!!PIZeeW5wscynRQ!rIKcGF7oImWvVF7adkn3NY60akAkeAgdFOByQOmkqg-Luu0jLbVDtgLT1VJtx_DR2a0Jb-Kp3CGDkzMvfkc$" target="_blank" rel="noreferrer">https://urldefense.com/v3/__https://blog.cloudflare.com/post-mortem-on-cloudflare-control-plane-and-analytics-outage/__;!!PIZeeW5wscynRQ!rIKcGF7oImWvVF7adkn3NY60akAkeAgdFOByQOmkqg-Luu0jLbVDtgLT1VJtx_DR2a0Jb-Kp3CGDkzMvfkc$</a>
</span><br>
<span>That has some info.</span><br>
<span>-- </span><br>
<span>Bryan Fields</span><br>
<span></span><br>
<span>727-409-1194 - Voice</span><br>
<span><a href="https://urldefense.com/v3/__http://bryanfields.net__;!!PIZeeW5wscynRQ!rIKcGF7oImWvVF7adkn3NY60akAkeAgdFOByQOmkqg-Luu0jLbVDtgLT1VJtx_DR2a0Jb-Kp3CGDh36jGqo$" target="_blank" rel="noreferrer">https://urldefense.com/v3/__http://bryanfields.net__;!!PIZeeW5wscynRQ!rIKcGF7oImWvVF7adkn3NY60akAkeAgdFOByQOmkqg-Luu0jLbVDtgLT1VJtx_DR2a0Jb-Kp3CGDh36jGqo$</a>
</span><br>
<span>_______________________________________________</span><br>
<span>Outages mailing list</span><br>
<span><a href="mailto:Outages@outages.org" target="_blank" rel="noreferrer">Outages@outages.org</a></span><br>
<span><a href="https://urldefense.com/v3/__https://puck.nether.net/mailman/listinfo/outages__;!!PIZeeW5wscynRQ!rIKcGF7oImWvVF7adkn3NY60akAkeAgdFOByQOmkqg-Luu0jLbVDtgLT1VJtx_DR2a0Jb-Kp3CGDOzwod4I$" target="_blank" rel="noreferrer">https://urldefense.com/v3/__https://puck.nether.net/mailman/listinfo/outages__;!!PIZeeW5wscynRQ!rIKcGF7oImWvVF7adkn3NY60akAkeAgdFOByQOmkqg-Luu0jLbVDtgLT1VJtx_DR2a0Jb-Kp3CGDOzwod4I$</a>
</span><br>
</div>
</blockquote>
</div>
</div>
</div>
_______________________________________________<br>
Outages-discussion mailing list<br>
<a href="mailto:Outages-discussion@outages.org" target="_blank" rel="noreferrer">Outages-discussion@outages.org</a><br>
<a href="https://puck.nether.net/mailman/listinfo/outages-discussion" rel="noreferrer noreferrer" target="_blank">https://puck.nether.net/mailman/listinfo/outages-discussion</a><br>
</blockquote></div>