[Outages-discussion] Amazon AWS Monday Outage preliminary postmortem

Jay Ashworth jra at baylink.com
Tue Oct 23 23:12:11 EDT 2012


Here's what they've posted to their status page, for those who wouldn't 
have thought to look there for it (I wouldn't have, except that I still had
it open in a tab tonight).

Note the All Clear was at +25h.

========
22nd Oct 10:38 AM PDT  We are currently investigating degraded performance for a small number of EBS volumes in a single Availability Zone in the US-EAST-1 Region.

22nd Oct 11:11 AM PDT We can confirm degraded performance for a small number of EBS volumes in a single Availability Zone in the US-EAST-1 Region. Instances using affected EBS volumes will also experience degraded performance.

22nd Oct 11:26 AM PDT We are currently experiencing degraded performance for EBS volumes in a single Availability Zone in the US-EAST-1 Region. New launches for EBS backed instances are failing and instances using affected EBS volumes will experience degraded performance.

22nd Oct 12:32 PM PDT We are working on recovering the impacted EBS volumes in a single Availability Zone in the US-EAST-1 Region.

22nd Oct 1:02 PM PDT We continue to work to resolve the issue affecting EBS volumes in a single availability zone in the US-EAST-1 region. The AWS Management Console for EC2 indicates which availability zone is impaired. EC2 instances and EBS volumes outside of this availability zone are operating normally. Customers can launch replacement instances in the unaffected availability zones but may experience elevated launch latencies or receive ResourceLimitExceeded errors on their API calls, which are being issued to manage load on the system during recovery. Customers receiving this error can retry failed requests.

22nd Oct 2:20 PM PDT We've now restored performance for about half of the volumes that experienced issues. Instances that were attached to these recovered volumes are recovering. We're continuing to work on restoring availability and performance for the volumes that are still degraded. We also want to add some detail around what customers using ELB may have experienced. Customers with ELBs running in only the affected Availability Zone may be experiencing elevated error rates and customers may not be able to create new ELBs in the affected Availability Zone. For customers with multi-AZ ELBs, traffic was shifted away from the affected Availability Zone early in this event and they should not be seeing impact at this time.

22nd Oct 3:48 PM PDT We are continuing to work to restore the remaining affected EBS volumes and the instances that are attached to them. We have been able to increase the rate of recovery in the last thirty minutes and hope to have the majority of the remaining volumes recovered shortly.

22nd Oct 4:42 PM PDT We have restored the ability to launch new EC2 instances in the affected Availability Zone. At this point, customers should be able to launch instances in any Availability Zone in the US-EAST-1 region. We are continuing to restore impaired volumes and their attached instances.

22nd Oct 5:44 PM PDT Performance for almost all affected volumes has recovered. We are continuing to work on restoring IO for the remainder of volumes. While almost all of instances and volumes have recovered, many of the volumes affected by this event will undergo an additional re-mirroring. During this volume re-mirroring, customers may notice increased volume IO latency.

22nd Oct 6:33 PM PDT We are seeing elevated errors rates on APIs related to describing and associating EIP addresses. We are working to resolve these errors. In addition, ELB is experiencing elevated latencies recovering affected load balancers and making changes to existing load balancers. These delays are a result of the EIP related API errors and will improve when that issue is resolved.

22nd Oct 7:36 PM PDT EIP related API calls are fully recovered. This has allowed ELB to continue to recover affected ELBs and we expect ELB to recover more quickly now.

22nd Oct 10:54 PM PDT ELB has now completed recovery of nearly all affected load balancers. We will continue to work to restore IO for the remainder of volumes and will reach out via email to affected customers that own those volumes should action be required on their part. Volumes affected earlier in the day are continuing to re-mirror (which we expect will take several more hours) and while this process continues, customers may notice increased volume IO latency.

23rd Oct 01:46 AM PDT We continue to work to restore IO for the remainder of affected volumes. We will reach out via email to any customers that own volumes for which action is required on their part. Volumes affected earlier in the day are continuing to re-mirror. While this process continues, customers may notice increased volume IO latency.

23rd Oct 4:21 AM PDT We are continuing to work on restoring IO for the remainder of affected volumes. This will take effect over the next few hours. While this process continues, customers may notice increased volume IO latency. The re-mirroring will proceed through the rest of today.

23rd Oct 6:33 AM PDT The remainder of the affected ELB load balancers have been recovered and the service is operating normally.

We have restored IO for the majority of EBS volumes. A small number of volumes will require customer action to restore IO. We are in the process of contacting these customers directly with instructions on how to return their volumes to service.

Volumes affected during this event are continuing to re-mirror (which we expect will continue through the remainder of the day). While this process continues, customers may notice increased volume IO latency.
23rd Oct 11:08 AM PDT The service is now operating normally. We will post back here with an update once we have details on the root cause analysis.
==========

While they haven't yet posted an RCA, that I've seen, it is interesting
to see that even though past outages had a "data plane issue leaks over 
into control plane" cascade problem, and they specifically targeted that 
to fix, it appears to have happened again in this outage.

Just a good reminder, I guess, that "being in the cloud" *does not* 
relieve a system architect of the obligation to do their own fault-
tolerance design and implementation, accounting for as many failures
as are cost-effective, *after making that determination in conjunction
with layers 9 and 10* (money, and lawyers).

https://en.wikipedia.org/wiki/Layer_8

Cheers,
-- jra
-- 
Jay R. Ashworth                  Baylink                       jra at baylink.com
Designer                     The Things I Think                       RFC 2100
Ashworth & Associates     http://baylink.pitas.com         2000 Land Rover DII
St Petersburg FL USA               #natog                      +1 727 647 12


More information about the Outages-discussion mailing list