How to work around Amazon EC2 outages

Today’s Amazon EC2 outages (which at the time of writing are still ongoing) have meant downtime for lots of their customers including household names like Quora, FourSquare and reddit. The problem is with their Elastic Compute Cloud (EC2) service in one of the availability zones in their Eastern US N.Virginia region.

Often problems like this are localised to one availability zone (datacentre) which gives you a number of ways of working round the problem.

Elastic IP Addresses

By using an Elastic IP Address you can bring up new instance in another availability zone and then bind the Elastic IP to it. There’d likely be some manual intervention from you to do this and you’d need to make sure that you had a decent enough backup on EBS or a snapshot to resume from.

Elastic Load Balancing

Using an Elastic Load Balancer you can spread the load between servers in multiple availability zones. This could allow you to have e.g. one web server in each Eastern US zone and the loss of one zone like today should be handled transparently. This would be easy to implement with a simple website but to create full redundancy of backend data (in RDBMS, etc) you’d need to setup appropriate data replication there too. In theory this approach should allow a zone failure to be completely transparent to your users.

Low DNS TTLs

If you’re not willing to pay for Elastic IPs or Elastic Load Balancing then you could manually redirect traffic in the event of an outage to a new AWS instance or to another ISP for all it matters. Read more about DNS TTLs here: Using DNS TTL to control migrations

Disaster Recovery and Backups

You need to decide what level of Disaster Recovery you require. It’s usually a trade-off between the cost of the downtime to your business and the cost of implementing it. You could decide that in the event of a rare outage it’s acceptable to just display a “sorry we’re having problems page” served from an instance that you only bring up in the event of problems. If your requirement is to bring up a full copy of the site in a new zone here are suggestions as to how you could do this.

Amazon Elastic Block Store (EBS) supports snapshotting which is persisted to Amazon S3 across all zones in that region. This would be a great way of keeping backups if you can live with resuming from some slightly older snapshotted data. All you need to do is bring up the new instance in one of the fully-functioning zones and attach an EBS volume derived from the snapshot.

If using snapshotted data isn’t acceptable then you’d need to look at implementing your own replication of data. Almost all of the commonly used RDBMS/NoSQL applications support replication and setting up replicas is fairly standard operationally.

Advertisements

Bug fixing: Five tricks we can learn from doctors

I had a bit of a health scare this week and a trip to A+E (ER). All’s OK now but the trip made me realise some of the similarities of the “bug fixing” the great doctors/nurses were attempting on me and how a good engineer will address a problem. Most of these concepts work in any field of engineering but I’m going to focus down on IT Operations more specifically.

#1 Symptoms and Cause

It’s important to remember the difference between symptoms and cause. Treating backache with pain killers will be useful in the short term but you’ve got to identify what’s causing the pain: posture, your desk chair, etc

Making sure you understand what the root cause of the issue is should be your ultimate goal. In the short term treating the symptoms might be best to get your system back up and running quickly.

#2 Monitoring

Both trend monitoring and threshold monitoring are amazingly important when it comes to identifying and resolving issues. This is why patients are so often hooked up to pulse, ECG, blood pressure monitors and why key readings are recorded regularly.

In engineering perhaps the CPU usage of the server you’re working on looks high: Is it normally this high? Is the trend that it’s increasing/decreasing?

Be sure to use tools like Cacti, Ganglia or Nagios and graph everything that’s service or business critical. This could include technical data like CPU usage, connection counts, cache hit rates as well as business data like: user logins, registrations, eCommerce basket value. I’d argue that having a little too much data is far better than having too little.

#3 Triage

When you’re presented with multiple problems you’ve got to identify which of them is more critical? Allow users to assign priority or assign one yourself in triage. Perhaps use a defect matrix to assign this according to how many users are effected, whether it’s on a production site, whether there’s a workaround or not.

This way you treat the most business critical problems first and not the ones that are most interesting!

#4 Case history

Doctors will talk with you about when this problem first started and ask related questions which might be of use with their diagnosis. Good bug reports are often critical for you to be able to fully understand and replicate the bug. It’s important that the reporter of the bug understands this through training or are forced to give detailed info in the reporting process. PHP’s Report a Bugpage is a reasonably good example of the latter.

If you can keeping some kind of history of changes/problems relating to a device or system can be really valuable. A well searchable bug/ticketing system is somewhere close to self-documenting and I’d strongly recommend version control of all server configuration files.

#5 Double-checking

If you’re getting nowhere with a diagnosis of a problem get a second opinion. If, after gaining a second opinion you’re no closer to identifying the problem then it could be worth the second engineer going through the same steps of diagnosis that you did and not just taking your word for it. Sometimes a second set of eyes will spot something subtle that was easy to miss.

Happy bug fixing!