April 21, 2011 15 Comments
Today’s Amazon EC2 outages (which at the time of writing are still ongoing) have meant downtime for lots of their customers including household names like Quora, FourSquare and reddit. The problem is with their Elastic Compute Cloud (EC2) service in one of the availability zones in their Eastern US N.Virginia region.
Often problems like this are localised to one availability zone (datacentre) which gives you a number of ways of working round the problem.
Elastic IP Addresses
By using an Elastic IP Address you can bring up new instance in another availability zone and then bind the Elastic IP to it. There’d likely be some manual intervention from you to do this and you’d need to make sure that you had a decent enough backup on EBS or a snapshot to resume from.
Elastic Load Balancing
Using an Elastic Load Balancer you can spread the load between servers in multiple availability zones. This could allow you to have e.g. one web server in each Eastern US zone and the loss of one zone like today should be handled transparently. This would be easy to implement with a simple website but to create full redundancy of backend data (in RDBMS, etc) you’d need to setup appropriate data replication there too. In theory this approach should allow a zone failure to be completely transparent to your users.
Low DNS TTLs
If you’re not willing to pay for Elastic IPs or Elastic Load Balancing then you could manually redirect traffic in the event of an outage to a new AWS instance or to another ISP for all it matters. Read more about DNS TTLs here: Using DNS TTL to control migrations
Disaster Recovery and Backups
You need to decide what level of Disaster Recovery you require. It’s usually a trade-off between the cost of the downtime to your business and the cost of implementing it. You could decide that in the event of a rare outage it’s acceptable to just display a “sorry we’re having problems page” served from an instance that you only bring up in the event of problems. If your requirement is to bring up a full copy of the site in a new zone here are suggestions as to how you could do this.
Amazon Elastic Block Store (EBS) supports snapshotting which is persisted to Amazon S3 across all zones in that region. This would be a great way of keeping backups if you can live with resuming from some slightly older snapshotted data. All you need to do is bring up the new instance in one of the fully-functioning zones and attach an EBS volume derived from the snapshot.
If using snapshotted data isn’t acceptable then you’d need to look at implementing your own replication of data. Almost all of the commonly used RDBMS/NoSQL applications support replication and setting up replicas is fairly standard operationally.