Weathering The Storm In Amazon’s Cloud

[This post originally appeared on the Crowd Fusion website, which has been replaced by ceros.com.]

As a company who works with several enterprise customers in Amazon’s cloud, Crowd Fusion would like to remind everyone of what life outside of the cloud can be like. Remember, in 2007 when a Texas pickup truck rammed into a Rackspace Datacenter and took a large part of the internet offline for three hours? That was the result of one truck.

We are much happier living in the cloud than we were back in the days we were using traditional servers. There are multiple benefits to the cloud that are attractive to both us and our customers, including the ability to launch a multitude of servers when traffic starts to heat up for a breaking news story.

However, we also are intimately aware that to live in the cloud, you actually have to be more diligent about architecting and planning for failure. There aren’t any trucks that are going to run into your data centers, but there are days like Thursday April 21, 2011, when Amazon Web Services experienced escalating problems in one of their regions.

The Morning

We experienced EBS failure in the early AM on Thursday. As luck would have it, this only affected 4 of our 28 EBS-based instances, and none of those instances were single points of failure. As a result, none of our clients (like TMZ, Tecca, and News Corp’s The Daily) experienced any downtime during the early parts of Amazon’s issues. We run multiple accounts and Amazon rotates the availability zone names per account, so it’s unclear how many of those instances were in the one affected availability zone. We suffered degraded performance temporarily while we were able to remove those instances from our application’s connection pool.

At roughly 10:30am EDT, an Amazon representative via Amazon Gold support indicated that we would be unable to provision new instances in any US-east availability zone due to EBS API queues being saturated with requests. We asked if we could expect our currently running EBS instances to fail, or were we simply unable to use the EBS API to create/restore/backup volumes? Our rep answered, “Currently running instances are not affected. This only affects the ability to restore and launch.” We were still 100% up but on less hardware, so we accepted temporary higher utilization of our hardware while the EBS API issues were being resolved at Amazon. Amazon asked us to disable all our EBS API calls in order to help alleviate their queue problem.

The advice we received from our Gold Support was corroborated by an Amazon Health Status update at 11:54am EDT:

We’d like to provide additional color on what were working on right now (please note that we always know more and understand issues better after we fully recover and dive deep into the post mortem). A networking event early this morning triggered a large amount of re-mirroring of EBS volumes in US-EAST-1. This re-mirroring created a shortage of capacity in one of the US EAST-1 Availability Zones, which impacted new EBS volume creation as well as the pace with which we could re-mirror and recover affected EBS volumes. Additionally, one of our internal control planes for EBS has become inundated such that it’s difficult to create new EBS volumes and EBS backed instances. We are working as quickly as possible to add capacity to that one Availability Zone to speed up the re-mirroring, and working to restore the control plane issue. We’re starting to see progress on these efforts, but are not there yet. We will continue to provide updates when we have them.

Contingency Planning

In the past, we have seen Amazon EBS volumes suffer degraded performance, often during EBS snapshot operations. During those times, we have had to disable EBS snapshots, and sometimes pull MySQL slave databases from our application’s connection pool or promote a MySQL slave to a MySQL master. After the second time this happened, we decided it was prudent to have a replicating MySQL slave in the US West region as a contingency plan.

We had a disaster recovery process in place for spinning up our entire infrastructure on Amazon’s US West region in under an hour.

Loss of Master Database

At 12:45pm EDT, one of our customers reported having problems posting in their CMS. At 12:50pm EDT, their MySQL master databases went to 100% CPU i/o wait and was unavailable. For a customer whose business is publishing stories ahead of their competition, this outage was mission critical.

Normally, our immediate response would be to promote a slave and continue to operate in the US East region. But because there was no indication from Amazon that these issues weren’t spreading across all EBS volumes in all US East availability zones, we decided the best course of action was to failover to the US West region. Less than 45 minutes later, our customer was back online. Roughly an hour after that, things were stable enough where they were able to continue posting breaking news posts.

Amazon Recovery Call

The next day, on Friday at 3:30pm EDT, our client’s master database in the US East region finally recovered. It was at 100% CPU i/o wait for over 26.5 hours. That’s 25 hours after we had our customer back up and running on the West Coast. Two hours after the database recovered, we received a follow-up phone call from Amazon support informing us that our EBS volume had recovered.

In an attempt to confirm the functionality of their recovery process, we were asked if our volume had been recovered to the state it was in before it was lost. It was. We had not lost any data.

After informing the Amazon representative that we had failed over to the West coast and that we no longer needed this running instance, he urged us to decommission all the US East instances that we were not using in order to free capacity in that region.

He was impressed that we had successfully failed over to the US West region when so many others were still down and said: “You were one of the very few to have a West coast contingency plan and recover quickly. Bravo.”

Plan for Failure

When designing large-scale web applications, if you are not designing for failure at every piece of infrastructure, it’s not a matter of if you’ll fail, it’s a matter of when. This is not specific to the cloud, but the cloud makes planning for failure more essential.

As much as we hate to admit it, the cloud is simply more susceptible to failure than dedicated hardware. There are many reasons for this, but the most important one is complexity. There are just more moving parts. There are layers of virtualization, there are multiple tenants, and there are APIs developed by the cloud providers for the purpose of programmatically controlling hardware resources. The major advantages of this complexity far outweigh the drawbacks: more flexibility and more cost efficiency.

We were downright lucky we had a US West contingency plan. It is an expensive endeavor to have multiple mirrored instances running on the more expensive coast just in case everything goes down. We hoped we’d never have to actually use it, but in hindsight, it was the best possible solution to the situation, and a solution we will continue to use in the future. And like most companies affected by this outage, we already have many planned improvements to our application’s tolerance for failure.

Amazon EBS

Amazon’s EBS technology continues to be the best AWS cloud solution for MySQL database storage. For larger instances, the performance difference compared to an instance’s local drive is substantial. The EBS snapshot feature allows us to create backups without putting additional strain on our instances, and the restoration period is shorter.

The major disadvantages to using EBS are degrading performance and reliability concerns. Like all things in the cloud, as long as you plan for failure of an EBS volume, it is still the best possible option.

It’s been argued that the sites that didn’t go down don’t rely on EBS, but relying on any one piece to be 100% available is still a single point of failure you have to plan for. During one of our EBS failures, we were actually able to restore functionality within minutes by restoring data to the local disk instead of EBS.

We’re sticking with Amazon

Amazon is the only cloud provider that allows us to failover to another region without extensive effort, and they have growing geographical coverage for even more failover options.

Amazon is also leading the pack with more cloud services like SQS, CloudFormation, S3, SimpleDB, SNS, EMR, RDS, etc. As other cloud providers advance their features and options, the choice won’t be as easy.

Update: Amazon finally issued their apology with a summary of the outage details.

Published by Brian Alvey

I build software that makes creative people more powerful.

%d bloggers like this: