[Outages-discussion] Another AWS-East outage?

Wed Dec 22 10:57:46 EST 2021

AWS's Status Page still has remnants of two outages, which I've copied
below.

For EC2 in us-east-1, "API Error Rates", starting 12/22 and is still
ongoing (power outage in an AZ).  This one looks specific to AZ
USE1-AZ4.  Which AZ this maps to, on a per-customer basis, varies (e.g.
us-east-1a for Customer X is not necessarily us-east-1a for Customer Y).
Refer to AWS Resource Access Manager for the mapping for your account,
or document "Availability Zone IDs for your AWS resources":

* 4:35 AM PST We are investigating increased EC2 launch failures and networking connectivity issues for some instances in a single Availability Zone (USE1-AZ4) in the US-EAST-1 Region. Other Availability Zones within the US-EAST-1 Region are not affected by this issue.
* 5:01 AM PST We can confirm a loss of power within a single data center within a single Availability Zone (USE1-AZ4) in the US-EAST-1 Region. This is affecting availability and connectivity to EC2 instances that are part of the affected data center within the affected Availability Zone. We are also experiencing elevated RunInstance API error rates for launches within the affected Availability Zone. Connectivity and power to other data centers within the affected Availability Zone, or other Availability Zones within the US-EAST-1 Region are not affected by this issue, but we would recommend failing away from the affected Availability Zone (USE1-AZ4) if you are able to do so. We continue to work to address the issue and restore power within the affected data center.
* 5:18 AM PST We continue to make progress in restoring power to the affected data center within the affected Availability Zone (USE1-AZ4) in the US-EAST-1 Region. We have now restored power to the majority of instances and networking devices within the affected data center and are starting to see some early signs of recovery. Customers experiencing connectivity or instance availability issues within the affected Availability Zone, should start to see some recovery as power is restored to the affected data center. RunInstances API error rates are returning to normal levels and we are working to recover affected EC2 instances and EBS volumes. While we would expect continued improvement over the coming hour, we would still recommend failing away from the Availability Zone if you are able to do so to mitigate this issue.
* 5:39 AM PST We have now restored power to all instances and network devices within the affected data center and are seeing recovery for the majority of EC2 instances and EBS volumes within the affected Availability Zone. Network connectivity within the affected Availability Zone has also returned to normal levels. While all services are starting to see meaningful recovery, services which were hosting endpoints within the affected data center - such as single-AZ RDS databases, ElastiCache, etc. - would have seen impact during the event, but are starting to see recovery now. Given the level of recovery, if you have not yet failed away from the affected Availability Zone, you should be starting to see recovery at this stage.
* 6:13 AM PST We have now restored power to all instances and network devices within the affected data center and are seeing recovery for the majority of EC2 instances and EBS volumes within the affected Availability Zone. We continue to make progress in recovering the remaining EC2 instances and EBS volumes within the affected Availability Zone. If you are able to relaunch affected EC2 instances within the affected Availability Zone, that may help to speed up recovery. We have a small number of affected EBS volumes that are still experiencing degraded IO performance that we are working to recover. The majority of AWS services have also recovered, but services which host endpoints within the customer’s VPCs - such as single-AZ RDS databases, ElasticCache, Redshift, etc. - continue to see some impact as we work towards full recovery.
* 6:51 AM PST We have now restored power to all instances and network devices within the affected data center and are seeing recovery for the majority of EC2 instances and EBS volumes within the affected Availability Zone. For the remaining EC2 instances, we are experiencing some network connectivity issues, which is slowing down full recovery. We believe we understand why this is the case and are working on a resolution. Once resolved, we expect to see faster recovery for the remaining EC2 instances and EBS volumes. If you are able to relaunch affected EC2 instances within the affected Availability Zone, that may help to speed up recovery. Note that restarting an instance at this stage will not help as a restart does not change the underlying hardware. We have a small number of affected EBS volumes that are still experiencing degraded IO performance that we are working to recover. The majority of AWS services have also recovered, but services which host endpoints within the customer’s VPCs - such as single-AZ RDS databases, ElasticCache, Redshift, etc. - continue to see some impact as we work towards full recovery.

And for Elastic Beanstalk also in us-east-1, "Console Application Upload
Errors", starting 12/21 but spanning into 12/22, and *is* resolved:

* 11:33 PM PST We are investigating an issue where customers are unable to upload and deploy new application versions through the Elastic Beanstalk console in multiple Regions. Customers who need to update or deploy a new application version should do so using the AWS CLI. Existing applications are not impacted by this issue
* Dec 22, 12:34 AM PST We continue to investigate an issue where customers are unable to upload and deploy new application versions through the Elastic Beanstalk console in multiple Regions. We are determining the root causes and working through steps to mitigate the issue. Customers who need to update or deploy a new application version should do so using the AWS CLI while we work towards resolving the issue. Existing applications are not impacted by this issue.
* Dec 22, 1:20 AM PST We have identified the root cause and prepared a fix to address the issue that prevents customers from uploading new application versions through the Elastic Beanstalk console in multiple Regions. The service team is testing this fix and preparing for deployment to the Regions that are affected by this issue. We expect to see full recovery by 3:00 AM PST and will continue to keep you updated if this ETA changes. Customers who need to update or deploy a new application version should do so using the AWS CLI until the issue is fully resolved.
* Dec 22, 3:21 AM PST Between December 21, 2021 at 6:37 PM and December 22, 2021 at 03:17 AM PST, customers were unable to upload their code through the Elastic Beanstalk console due to a Content Security Policy (CSP) error. Customers were impacted when they attempted to upload a new application version for existing environments or upload their code when creating a new environment in multiple regions. The issue has been resolved and the service is operating normally.

-- 
| Jeremy Chadwick                              jdc_at_koitsu.org |
| UNIX Systems Administrator                      PGP 0x2A389531 |
| Making life hard for others since 1977.                        |

On Wed, Dec 22, 2021 at 08:55:24AM -0600, Andy Ringsmuth wrote:
> https://www.wsj.com/articles/amazon-web-services-suffers-another-outage-11640182421
> 
> Not sure how big it was but must not have been too horrible as there wasn’t any mention of it on the outages list.
> 
> ----
> Andy Ringsmuth
> 5609 Harding Drive
> Lincoln, NE 68521-5831
> (402) 304-0083
> andy at andyring.com
> 
> _______________________________________________
> Outages-discussion mailing list
> Outages-discussion at outages.org
> https://puck.nether.net/mailman/listinfo/outages-discussion