March 17, 2017
On February 28th, Amazon Web Services (AWS), Amazon’s public cloud offering, suffered a major outage lasting about four hours. Though brief, the outage resulted in a $150 million loss for Amazon’s S&P 500 clients. It was a huge disaster that left thousands of web users without access to data housed in AWS. The cause? A mistyped command on a keyboard.
Disaster recovery is the act of planning a process to restore the business’s IT systems after they have been rendered unusable by just such a disaster. The misleading part of that definition is the word “disaster,” because people often assume it refers to exclusively natural events. Hurricanes, tornadoes, floods, and fires can absolutely damage a data center enough to take it offline, but they are exceedingly rare. Many “disasters” are actually caused by human error. There are no cataclysmic catastrophes needed. As proved by AWS, all you need is a few mistyped commands on the keyboard.
The Cause of the AWS Outage
In an official summary released by Amazon, they explained the cause of the outage. Essentially, the Amazon Simple Storage Service Team (S3) was debugging an error in their Northern Virginia regional data center. The error was causing a few of their systems to run more slowly than expected. The process seemed rather routine and the engineers had an established playbook guiding them through the troubleshooting. At this stage, the maintenance was only being performed by a small number of servers directly related to the error at hand.
In the process of powering down the server to perform the maintenance, one of the engineers inputted an incorrect command and took down more servers than intended. Those extra servers supported a fundamental subsystem that kept that data center operational. It all happened in an instant. The chain reaction caused by the incorrect command forced the team to restart several systems. This rendered a number of AWS services completely unavailable for a large region of the country.
Restarting these systems normally would not take four hours to complete, but S3 has experienced significant growth in the last few years. In that time, the index subsystems or placement subsystems had not been restarted as a whole. Amazon needed to perform a number of safety checks and metadata validations to ensure these subsystems would work properly once they were brought back online. The whole process took about four hours.
The Role of Disaster Recovery
A strong disaster recovery plan could not have prevented this disaster, but what helped Amazon overcome this crisis was a clear plan on what to do. Their engineers understood the processes to restart these systems. It just took several hours to complete. Without this understanding, panic and pressure would be sure create confusion, exacerbate the problem, and delay the recovery attempts.
Though this disaster was unpreventable and unexpected, it goes to show that human error can be just as destructive as anything nature can create.
Like what you read?
Mindsight, a Chicago IT services provider, is an extension of your team. Our culture is built on transparency and trust, and our team is made up of extraordinary people – the kinds of people you would hire. We have one of the largest expert-level engineering teams delivering the full spectrum of IT services and solutions, from cloud to infrastructure, collaboration to contact center. Our highly-certified engineers and process-oriented excellence have certainly been key to our success. But what really sets us apart is our straightforward and honest approach to every conversation, whether it is for an emerging business or global enterprise. Our customers rely on our thought leadership, responsiveness, and dedication to solving their toughest technology challenges.