How to work around Amazon EC2 outages

Today’s Amazon EC2 outages (which at the time of writing are still ongoing) have meant downtime for lots of their customers including household names like Quora, FourSquare and reddit. The problem is with their Elastic Compute Cloud (EC2) service in one of the availability zones in their Eastern US N.Virginia region.

Often problems like this are localised to one availability zone (datacentre) which gives you a number of ways of working round the problem.

Elastic IP Addresses

By using an Elastic IP Address you can bring up new instance in another availability zone and then bind the Elastic IP to it. There’d likely be some manual intervention from you to do this and you’d need to make sure that you had a decent enough backup on EBS or a snapshot to resume from.

Elastic Load Balancing

Using an Elastic Load Balancer you can spread the load between servers in multiple availability zones. This could allow you to have e.g. one web server in each Eastern US zone and the loss of one zone like today should be handled transparently. This would be easy to implement with a simple website but to create full redundancy of backend data (in RDBMS, etc) you’d need to setup appropriate data replication there too. In theory this approach should allow a zone failure to be completely transparent to your users.

Low DNS TTLs

If you’re not willing to pay for Elastic IPs or Elastic Load Balancing then you could manually redirect traffic in the event of an outage to a new AWS instance or to another ISP for all it matters. Read more about DNS TTLs here: Using DNS TTL to control migrations

Disaster Recovery and Backups

You need to decide what level of Disaster Recovery you require. It’s usually a trade-off between the cost of the downtime to your business and the cost of implementing it. You could decide that in the event of a rare outage it’s acceptable to just display a “sorry we’re having problems page” served from an instance that you only bring up in the event of problems. If your requirement is to bring up a full copy of the site in a new zone here are suggestions as to how you could do this.

Amazon Elastic Block Store (EBS) supports snapshotting which is persisted to Amazon S3 across all zones in that region. This would be a great way of keeping backups if you can live with resuming from some slightly older snapshotted data. All you need to do is bring up the new instance in one of the fully-functioning zones and attach an EBS volume derived from the snapshot.

If using snapshotted data isn’t acceptable then you’d need to look at implementing your own replication of data. Almost all of the commonly used RDBMS/NoSQL applications support replication and setting up replicas is fairly standard operationally.

15 Responses to How to work around Amazon EC2 outages

  1. Pingback: What to Do When Your Cloud is Down « SmoothSpan Blog

  2. WR says:

    How can you bring up an instance “only in the event of problems” without suffering the DNS problem that your site will still be directed to the host that is down?

  3. A few of these options are good in principle, but are not necessarily informed by the reality of operational experience with the more-common failure modes of AWS at a medium to larger scale (~50-100+ instances).

    The author recommends using EBS volumes to provide for backups and snapshots. However, Amazon’s EBS system is one of the more failure-prone components of the AWS infrastructure, and lies at the heart of this morning’s outage [1]. Any steps you can take to reduce your dependence upon a service that is both critical to operation and failure-prone will limit the surface of your vulnerability to such outages. While the snapshotting ability of EBS is nice, waking up to a buzzing pager to find that half of the EBS volumes in your cluster have dropped out, hosing each of the striped RAID arrays you’ve set up to achieve reasonable IO throughput, is not. Instead, consider using the ephemeral drives of your EC2 instances, switching to a non-snapshot-based backup strategy, and replicating data to other instances and AZ’s to improve resilience.

    The author also recommends Elastic Load Balancers to distribute load across services in multiple availability zones. Load balancing across availability zones is excellent advice in principle, but still succumbs to the problem above in the instance of EBS unavailability: ELB instances are also backed by Amazon’s EBS infrastructure. ELB’s can be excellent day-to-day and provide some great monitoring and introspection. However, having a quick chef script to spin up an Nginx or HAProxy balancer and flipping DNS could save your bacon in the event of an outage that also affected ELBs, like today.

    With each service provider incident, you learn more about your availability, dependencies, and assumptions, along with what must improve. Proportional investment following each incident should reduce the impact of subsequent provider issues. Naming and shaming providers in angry Twitter posts will not solve your problem, and it most certainly won’t solve your users’ problem. Owning your availability by taking concrete steps following each outage to analyze what went down and why, mitigating your exposure to these factors, and measuring your progress during the next incident will. It is exciting to see these investments pay off.

    Some of these:

    – *Painfully* thorough monitoring of every subsystem of every component of your infrastructure. When you get paged, it’s good to know *exactly* what’s having issues rather than checking each manually in blind suspicion.

    – Threshold-based alerting.

    – Keeping failover for all systems as automated, quick, and transparent as is reasonably possible.

    – Spreading your systems across multiple availability zones and regions, with the ideal goal of being able to lose an entire AZ/region without a complete production outage.

    – Team operational reviews and incident analysis that expose the root cause of an issue, but also spider out across your system’s dependencies to preemptively identify other components which are vulnerable to the same sort of problem.

    [1] See the response from AWS in the first reply here: https://forums.aws.amazon.com/thread.jspa?messageID=239106&tstart=0

  4. JB says:

    The other challenge with today’s outage, it is affecting us and many of our clients, is Amazon’s RDS (mysql database service) is down right now as well as EC2. We use an ELB across zones but all the same database so this doesn’t help us. We are looking at other database items though like http://www.xeround.com which looks promising.

    Jeremy

  5. George says:

    or maybe old school 🙂 dnsmadeeasy.com and a backup server somewhere

  6. mike mainguy says:

    I like the practical ideas, but I have to agree with a previous poster. In addition, software should be designed for the cloud or rather, designed for failure. Traditionally, we write software that get’s all tangled up with “where” it’s deployed and moving the software to a different place becomes difficult if not impossible.

    The RDBMS is a perfect example. Too often, I see shops that have massive redundancy all over the place, but then everything hinges on a single (even if HA) RDBMS. Typically, this over-reliance on a single point of failure is unnecessary and actually more expensive than using an alternative (NOSQL for example).

  7. Pingback: Mestvork&knuppels uit de stal… de ‘cloud’ is down! « JANWIERSMA.COM

  8. Pingback: Survive AWS Judgement Day – Hank Lin

  9. RightScale customers as a whole haven’t been affected by the EC2 outage. As a best practice, we have our customers architect across multiple Amazon regions and/or public resource pools from other providers. This way, you have servers outside the failing region that you can failover too. You can even automate that workflow to make the system self-healing.

  10. Pingback: Rainy-Day Roundup (AWS cloud Fail) – Opinionated

  11. Pingback: Amazon に起こった大規模ダウンタイムを分析する – Data Center Knowledge « Agile Cat — in the cloud with openness

  12. Pingback: Computed·By

  13. Ian says:

    This is great guide to configuring AWS for availability zone outages, but AWS have also suffered entire REGIONAL outages which affect all availability zones in a specific region (e.g. August 8, 2011 in US East Region). Here’s Amazon’s update on that situation for information:

    “We wanted to provide more detail on the internet connectivity event that occurred from 7:25 PM PDT to 7:55 PM PDT on August 8th in our US East Region. The event affected connectivity between three different Availability Zones and the internet.
    The issue happened in the networks that connect our Availability Zones to the internet. All Availability Zones must have network connectivity to the internet and to each other (to enable a customers resources in one Availability Zone to communicate with resources in other Availability Zones). Our border and Availability Zone networks use standard routing protocols both to isolate themselves from potential failure in other Availability Zones and to assure continued connection to other Availability Zones or the internet in the face of a failure in any portion of our network. To prevent network issues in one Availability Zone from impacting any of our other Availability Zones, we use a network routing architecture we refer to as north/south. Northern routers are at the border, facing toward the internet. Southern routers are part of individual Availability Zones. To prevent Availability Zones from being able to impact each others routes to the internet, we use standard routing protocols to prevent southern routers in one Availability Zone from advertising internet routes to any other southern router in another Availability Zone. Southern routers are also prohibited from telling northern routers what routes to use. This causes routes to only propagate from north to south.
    The event began when a southern router inside one of our Availability Zones briefly stopped exchanging route information with all adjacent devices, going into an incommunicative state. Upon re-establishing its health, the router began advertising an unusable route to other southern routers in other Availability Zones, deviating from its configuration and bypassing the standard protocol restriction on how routes are allowed to flow. The bad default internet route was picked up and used by the routers in other Availability Zones. Internet traffic from multiple Availability Zones in US East was immediately not routable out to the internet through the border. We resolved the problem by removing the router from service.
    We immediately identified that there were no human accesses or automated changes applied to that router. We also have hundreds of thousands of hours of operating experience with this particular router software and hardware configuration. As a result, we had a strong hypothesis that there was an unusual router software failure causing the router to violate the routing protocol. As might have been expected (given the long successful experience we’ve had with this configuration), reproducing the software failure was difficult. Late Wednesday night, working closely with the supplier of this router, we were able to reproduce the behavior and locate the software bug. It confirmed our hypothesis of protocol violation. We’ve developed a mitigation that can both prevent transmittal of a bad internet route and prevent another router from incorporating that route and using it. We’ve tested the mitigation thoroughly and are carefully deploying it throughout our network following our normal change and promotion procedures.
    We apologize for any impact this event may have caused our customers. We build with the tenet that even the most serious network failure should not impact more than one Availability Zone beyond the extremely short convergence times required by the network routing protocols. We will continue to work on eliminating any issues that could jeopardize that important tenet.”

Leave a comment