Amazon Net Providers explains outages and makes it straightforward to trace future outages
Adam Selipsky, CEO of Amazon Web Services, will deliver a keynote address at the AWS re: Invent conference in Las Vegas on November 30, 2021.
Noah Berger | Getty Images
Amazon Web Services on Friday released a statement for an hour-long outage earlier this week that disrupted its retail business and third-party online services. The company also announced that it will revise its status page.
Problems in Amazon’s large US East 1 region of data centers in Virginia began at 10:30 a.m. ET on Tuesday, the company said.
“An automated activity to scale the capacity of one of the AWS services hosted on the AWS main network caused unexpected behavior in a large number of clients on the internal network,” the company wrote in a post on its website. As a result, devices connecting an internal Amazon network and the AWS network became overloaded.
Several AWS tools suffered, including the widespread EC2 service that provides virtual server capacity. AWS engineers worked for the next few hours to resolve the issues and get the services back up and running. The EventBridge service, which can help software developers create applications that take action in response to certain activity, didn’t fully recover until 9:40 p.m. ET.
Downtime can affect the perception that the cloud infrastructure is reliable and ready to handle the migration of applications from physical data centers. This can also have a significant impact on businesses. AWS has millions of customers and is the leader in the market.
AWS apologized for the impact the outage had on its customers.
Popular websites and heavily used services went offline, including Disney +, Netflix, and Ticketmaster. Roomba vacuums, Amazon’s ring surveillance cameras, and other internet-connected devices like smart litter boxes and app-connected ceiling fans were also destroyed by the outage.
Amazon’s own retail operations were stalled in some pockets of the U.S. in-house apps used by Amazon’s warehouse and delivery staff who rely on AWS, leaving staff unable to scan packages most Tuesdays or access delivery routes. Third party vendors were also unable to access a website that is used to manage customer orders.
During the outage, AWS tried to keep customers updated, but the cloud had trouble updating its status page, known as the Service Health Dashboard.
“Because the impact on services during this event was all due to a single cause, we chose to provide updates through a global banner on the Service Health Dashboard,” said AWS.
In addition, customers were unable to create support cases for seven hours during the interruption.
AWS said it is now taking action to address both of these issues.
“We expect to release a new version of our Service Health Dashboard early next year that will make it easier to understand the impact on the service, and a new support system architecture that is actively running in multiple AWS Regions to ensure there are no delays in communicating with customers. “said AWS.
It’s not the first time AWS has changed the way issues are reported.
In 2017, an outage in the popular AWS S3 storage service prevented technicians from viewing the correct color to indicate uptime on the service health dashboard. Amazon posted banners and went to Twitter to post new information.
“We changed the SHD management console to run in multiple AWS Regions,” Amazon said in a message accompanying the episode.
CLOCK: The week that was: Amazon Web Services crash