[Outages-discussion] [EXTERNAL] Re: FB Outage AAR I - Engineering Posts Pabulum

Chapman, Brad (NBCUniversal) Brad.Chapman at nbcuni.com
Tue Oct 5 16:00:30 EDT 2021


Two things I gleaned from this article, reading between the lines…

First, whoever wrote this audit tool wasn’t expecting someone to make such a flagrant error.  This is a great example of hubris at work:

This was the source of yesterday’s outage. During one of these routine maintenance jobs, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally. Our systems are designed to audit commands like these to prevent mistakes like this, but a bug in that audit tool prevented it from properly stopping the command.

Labeling a process oversight as a “bug” is disingenuous and misleading.

Second, the rumor about using an angle grinder to open server cages was probably true, at least in one of the datacenters.  Note the careful wording of “the hardware and routers are designed to be difficult to modify.”  I wonder why that would be?  In a dire emergency, a simple text message of “We have access to the routers” would have satisfied the incident management team.  They would not care how it happened.

Our primary and out-of-band network access was down, so we sent engineers onsite to the data centers to have them debug the issue and restart the systems. But this took time, because these facilities are designed with high levels of physical and system security in mind. They’re hard to get into, and once you’re inside, the hardware and routers are designed to be difficult to modify even when you have physical access to them. So it took extra time to activate the secure access protocols needed to get people onsite and able to work on the servers.

—Sent from my iPhone

On Oct 5, 2021, at 12:44 PM, Anthony Hoppe <anthony at vofr.net> wrote:

If that's the case, it's a bit of an oversight.  If you're depending on DNS for OOB access, you'd want OOB DNS servers available too, heh...

Maybe they can go back to the golden days of sticking a modem & POTS line on the console port of their routers/switches.  Or get super fancy and deploy terminal servers at each datacenter to conserve on phone lines :-D.



----- Original Message -----
From: "George Metz" <george.metz at gmail.com>
To: "Ross Tajvar" <ross at tajvar.io>
Cc: "Outages List" <outages-discussion at outages.org>
Sent: Tuesday, October 5, 2021 12:12:23 PM
Subject: Re: [Outages-discussion] FB Outage AAR I - Engineering Posts Pabulum

If I had to guess (because I was wondering about that too), it was
because they didn't have the IPs of their out-of-band stuff available
and were expecting their DNS to be able to answer... and the DNS
servers functionally shut themselves off.

On Tue, Oct 5, 2021 at 2:14 PM Ross Tajvar <ross at tajvar.io> wrote:

"Our primary and out-of-band network access was down"

Sounds like someone doesn't know what "out-of-band" means.

On Tue, Oct 5, 2021, 2:04 PM Jay R. Ashworth <jra at baylink.com> wrote:

This doesn't say anything we don't already know, except where it conflicts
with things we already know.  But it's fun to watch, ain't it?  ;-)

 https://urldefense.com/v3/__https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/__;!!PIZeeW5wscynRQ!8WP_L-PMYNDWbAaKFWqDxe--oA8PC7mmIUB8fk6ivcTaMBY0VvIxUbJJkOGhmYjLkg$

Cheers,
-- jra

--
Jay R. Ashworth                  Baylink                       jra at baylink.com
Designer                     The Things I Think                       RFC 2100
Ashworth & Associates       https://urldefense.com/v3/__http://www.bcp38.info__;!!PIZeeW5wscynRQ!8WP_L-PMYNDWbAaKFWqDxe--oA8PC7mmIUB8fk6ivcTaMBY0VvIxUbJJkOHOHjOc2Q$           2000 Land Rover DII
St Petersburg FL USA      BCP38: Ask For It By Name!           +1 727 647 1274
_______________________________________________
Outages-discussion mailing list
Outages-discussion at outages.org
https://urldefense.com/v3/__https://puck.nether.net/mailman/listinfo/outages-discussion__;!!PIZeeW5wscynRQ!8WP_L-PMYNDWbAaKFWqDxe--oA8PC7mmIUB8fk6ivcTaMBY0VvIxUbJJkOG-Be-9rw$

_______________________________________________
Outages-discussion mailing list
Outages-discussion at outages.org
https://urldefense.com/v3/__https://puck.nether.net/mailman/listinfo/outages-discussion__;!!PIZeeW5wscynRQ!8WP_L-PMYNDWbAaKFWqDxe--oA8PC7mmIUB8fk6ivcTaMBY0VvIxUbJJkOG-Be-9rw$
_______________________________________________
Outages-discussion mailing list
Outages-discussion at outages.org
https://urldefense.com/v3/__https://puck.nether.net/mailman/listinfo/outages-discussion__;!!PIZeeW5wscynRQ!8WP_L-PMYNDWbAaKFWqDxe--oA8PC7mmIUB8fk6ivcTaMBY0VvIxUbJJkOG-Be-9rw$
_______________________________________________
Outages-discussion mailing list
Outages-discussion at outages.org
https://urldefense.com/v3/__https://puck.nether.net/mailman/listinfo/outages-discussion__;!!PIZeeW5wscynRQ!8WP_L-PMYNDWbAaKFWqDxe--oA8PC7mmIUB8fk6ivcTaMBY0VvIxUbJJkOG-Be-9rw$
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://puck.nether.net/pipermail/outages-discussion/attachments/20211005/82c02925/attachment-0001.htm>


More information about the Outages-discussion mailing list