[Outages-discussion] S3 Outages Postmortem

Michael Christian mfletcherchristian at yahoo.com
Thu Mar 2 02:44:37 EST 2017


The outage was abrupt, but the recovery came in stages.  Read traffic first, followed by write traffic ~1.5 hours later.   That makes me think a power problem, or automation gone awry.  We always blame the network team, but that rings hollow to me here.

On strategy, I am fully behind prioritization of read traffic recovery over write traffic.  That's evolving over time, but is still true for most use cases.

For those saying "who cares," you may not understand the number of blended integrated systems out there in this age.  This took down a huge number of correlated services, and it shouldn't have.   We need looser coupling.

- Mike Christian


Sent from my iPad

> On Mar 1, 2017, at 11:25 AM, Chapman, Brad (NBCUniversal) <Brad.Chapman at nbcuni.com> wrote:
> 
> “…lots of services affected…”
>  
> Well, that was pretty obvious from the dashboard yesterday:
>  
> https://i.imgur.com/xTec0Bn.png
>  
> -Brad
>  
> From: Outages-discussion [mailto:outages-discussion-bounces at outages.org] On Behalf Of Kevin Blackham
> Sent: Wednesday, March 1, 2017 11:17 AM
> To: Bob Strecansky <bob at mailchimp.com>
> Cc: outages-discussion at outages.org
> Subject: Re: [Outages-discussion] S3 Outages Postmortem
>  
> I have some insights, but I'm under NDA. This was big enough I expect some public disclosure (my words).
>  
> I can tell you we observed lots of services affected, not just S3. EBS was jacking up IO all over the place, and many machines didn't even ping. SES was quite broken, as was autoscaling. One might conclude it was a network problem.
>  
> On Mar 1, 2017 12:09, "Bob Strecansky" <bob at mailchimp.com> wrote:
> Has anyone heard anything about why S3 was down for 5 hours yesterday?  Usually Amazon doesn't post postmortems, and i'm curious as to what happened.
>  
> Thanks,
>  
> Bob Strecansky
> --
> Thanks,
> 
> -B
> 
> _______________________________________________
> Outages-discussion mailing list
> Outages-discussion at outages.org
> https://puck.nether.net/mailman/listinfo/outages-discussion
> 
> _______________________________________________
> Outages-discussion mailing list
> Outages-discussion at outages.org
> https://puck.nether.net/mailman/listinfo/outages-discussion
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://puck.nether.net/pipermail/outages-discussion/attachments/20170301/0cdc9eb1/attachment.html>


More information about the Outages-discussion mailing list