When Amazon’s S3 cloud service suffered an outage at the end of February, it took down large parts of the internet that rely on Amazon’s platform. The culprit was human error: An Amazon employee debugging the S3 billing system made a typo, as Amazon explained.
The employee had intended to take a few servers offline, but too many servers were taken down, creating a cascading failure in which subsystems critical to S3’s operations went down. The subsystems then needed to be fully restarted, a process that took down many internet services with it.
Businesses rely increasingly on the cloud for critical business functions. If they are prepared, they also have disaster recovery solutions in place to prepare for external factors such as natural disasters or hacking. How can businesses avoid crucial cloud services being taken offline, and how can they mitigate a disaster that happens inside their cloud service provider?
Amazon says it modified its tools to remove capacity more slowly “and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level.” The firm says this “will prevent an incorrect input from triggering a similar event in the future.” Amazon is also auditing its other operational tools to ensure they have similar safety checks, and it will make changes to improve the recovery time of key S3 subsystems.
The first thing a business needs to do is have a disaster recovery plan in place. Businesses need to make sure their mission-critical IT can be replicated and brought back online in the event of a disaster. To do so, companies need to catalog and prioritize their data and IT systems as well as form a business continuity team to help inform IT of exactly what needs to be up and running and what’s most important. They also need to back up data and invest in offsite IT resources, and should make recovery automatic. Finally, they need to document all of this.
Aidan Finn, who writes about Microsoft virtualization for the Petri IT Knowledgebase and is the technical sales lead for MicroWarehouse Ltd., a Microsoft value-added distributor in Ireland, says on Petri IT that businesses should not assume they get “disaster recovery by default” from their cloud vendor.
“Everything in the cloud is a utility, and every utility has a price,” he says. “If you want it, you need to pay for it and deploy it, and this includes a scenario in which a data center burns down and you need to recover. If you didn’t design in and deploy a disaster recovery solution, you’re as cooked as the servers in the smoky data center.”
Although the Amazon outage was caused by an honest mistake, no company wants to give users unfettered control to take down critical systems.
According to a paper by Jon Mark Allen that was published by the SANS Institute, a company specializing in information security and cybersecurity training, companies can and should implement critical security controls, even when in the cloud.
These controls include creating a complete inventory of authorized devices. Another critical access control that should be put in place revolves around administrative privileges; as Allen writes: “employees have access to do their job, but no more — or less — access than is actually needed.”
Companies can use identity and access management tools, often built into cloud platforms, to create administrative accounts that can be granted granular permissions across the entire cloud infrastructure, Allen notes. Additionally, Allen points out that Amazon’s “Best Practices” document “recommends the root account credentials be stored away safely, and general use accounts created for each system administrator or service that requires access.”
In addition to creating access controls that limit what services and privileges users can make to cloud systems, it’s also a wise idea for businesses to work with multiple cloud providers. That way, if one experiences a failure, the company’s services that depend on the platform don’t go offline as well.
However, before deciding to go with multiple cloud providers, organizations can spread applications and workloads across multiple availability zones or sets within regions, as a Networks Asia report notes. Users can also spread their apps across multiple regions.
“The ultimate protection would be to deploy the application across multiple providers, for example using Microsoft Azure, Google Cloud Platform or some internal or hosted infrastructure resource as a backup,” the report notes.
Identifying how much a business depends on the cloud to function could help determine which steps are needed. If a company relies heavily on the cloud, it will want to make its operations as fault tolerant as possible.
Finn notes that using multiple cloud service providers (CSPs) “isn’t a VM replication solution; it’s a data replication solution, so I would have to run networks, storage, and virtual machines in the secondary cloud.”
However, Finn notes that using multiple CSPs requires companies to “manage multiple vendor contracts, maybe including a third ‘witness’ cloud vendor to redirect clients between the primary and secondary clouds.” It also entails more complex infrastructure and higher costs, he says.