How to work around Amazon EC2 outages

Today’s Amazon EC2 outages (which at the time of writing are still ongoing) have meant downtime for lots of their customers including household names like Quora, FourSquare and reddit. The problem is with their Elastic Compute Cloud (EC2) service in one of the availability zones in their Eastern US N.Virginia region.

Often problems like this are localised to one availability zone (datacentre) which gives you a number of ways of working round the problem.

Elastic IP Addresses

By using an Elastic IP Address you can bring up new instance in another availability zone and then bind the Elastic IP to it. There’d likely be some manual intervention from you to do this and you’d need to make sure that you had a decent enough backup on EBS or a snapshot to resume from.

Elastic Load Balancing

Using an Elastic Load Balancer you can spread the load between servers in multiple availability zones. This could allow you to have e.g. one web server in each Eastern US zone and the loss of one zone like today should be handled transparently. This would be easy to implement with a simple website but to create full redundancy of backend data (in RDBMS, etc) you’d need to setup appropriate data replication there too. In theory this approach should allow a zone failure to be completely transparent to your users.


If you’re not willing to pay for Elastic IPs or Elastic Load Balancing then you could manually redirect traffic in the event of an outage to a new AWS instance or to another ISP for all it matters. Read more about DNS TTLs here: Using DNS TTL to control migrations

Disaster Recovery and Backups

You need to decide what level of Disaster Recovery you require. It’s usually a trade-off between the cost of the downtime to your business and the cost of implementing it. You could decide that in the event of a rare outage it’s acceptable to just display a “sorry we’re having problems page” served from an instance that you only bring up in the event of problems. If your requirement is to bring up a full copy of the site in a new zone here are suggestions as to how you could do this.

Amazon Elastic Block Store (EBS) supports snapshotting which is persisted to Amazon S3 across all zones in that region. This would be a great way of keeping backups if you can live with resuming from some slightly older snapshotted data. All you need to do is bring up the new instance in one of the fully-functioning zones and attach an EBS volume derived from the snapshot.

If using snapshotted data isn’t acceptable then you’d need to look at implementing your own replication of data. Almost all of the commonly used RDBMS/NoSQL applications support replication and setting up replicas is fairly standard operationally.

Big datasets for full-text search benchmarking

A few times recently I’ve looked for large datasets to experiment/benchmark against and I usually manage to come up blank.

I managed to spend longer than usual on this problem yesterday and came up with some which I’ll share with you.

Project Gutenberg

This project hosts the content of over 33,000 books. You can download the data as one book per file and there are full instructions for downloading/mirroring here. It seems that they’ve blocked Amazon AWS IP ranges from mirroring content from their site which is a shame.

The Westbury Lab USENET Corpus

The contents of some USENET groups have been anonymised and cleaned up to form a set containing >28M documents and >25B words. Each week’s data is stored as a single text file and each post simply delimited which makes parsing a breeze. You can easily load this data into a MySQL database with a command similar to:


It’s also available as part of Amazon’s AWS Public Datasets offering as EBS snapshot snap-­c1d156aa in the US West AWS region. Using AWS is a really quick way of getting hold of this dataset without the need to wait for any downloading to complete.

I found this a really nice dataset with each document at around 5kB, it seemed to be a sensible size for benchmarking email body text, blog/publishing posts, etc

Wikipedia Text
Wikipedia provide huge database dumps. It seems that there’s an AWS Snapshot snap-­8041f2e9 which contains this data too but it’s a couple of years old. There’s also a “WEX” extract (snap-­1781757e on AWS) created by the Freebase team which is provided as XML markup ready to be easily imported into a db table, one row per article.

In doing this research I came across a couple of v.interesting projects that extract/compile metadata from various sources including Wikipedia. They’re Freebase and DBpedia. I hope to play with some of their datasets and write a post on that in the future.

MySQL Server’s built-in profiling support

MySQL’s SHOW PROFILES command and its profiling support is something that I can’t believe I hadn’t spotted before today.

It allows you to enable profiling for a session and then record performance information about the queries executed. It shows details of the different stages in the query execution (as usually displayed in the thread state output of SHOW PROCESSLIST) and how long each of these stages took.

I’ll demonstrate using an example. First within our session we need to enable profiling, you should only do this in sessions that you want to profile as there’s some overhead in performing/recording the profiling information:

mysql> SET profiling=1;
Query OK, 0 rows affected (0.00 sec)

Now let’s run a couple of regular SELECT queries

mysql> SELECT COUNT(*) FROM myTable WHERE extra LIKE '%zkddj%';
| COUNT(*) |
|        0 | 
1 row in set (0.32 sec)

mysql> SELECT COUNT(id) FROM myTable;
| COUNT(id) |
|    513635 | 
1 row in set (0.00 sec)

Followed up with some stuff that we know’s going to execute a bit slower:

Query OK, 0 rows affected (0.00 sec)

mysql> INSERT INTO foo SELECT * FROM myTable;
Query OK, 513635 rows affected (33.53 sec)
Records: 513635  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.06 sec)

Now we’ve run the queries let’s look at their summary with SHOW PROFILES

| Query_ID | Duration    | Query                                                             |
|        1 |  0.33174700 | SELECT COUNT(*) FROM myTable WHERE extra LIKE '%zkddj%'           | 
|        2 |  0.00036600 | SELECT COUNT(id) FROM myTable                                     | 
|        3 |  0.00087700 | CREATE TEMPORARY TABLE foo LIKE myTable                           | 
|        4 | 33.52952000 | INSERT INTO foo SELECT * FROM myTable                             | 
|        5 |  0.06431200 | DROP TEMPORARY TABLE foo                                          | 
5 rows in set (0.00 sec)

It’s not like any of those numbers are a surprise as we saw them from the client but it’s a handy record of the execution times and could easily be queried within an application just before the connection to a db is closed or e.g. at the end of the web request.

We can also dig deeper into each of the commands, let’s look at the first query we ran:

| Status                         | Duration |
| starting                       | 0.000033 | 
| checking query cache for query | 0.000073 | 
| Opening tables                 | 0.000013 | 
| System lock                    | 0.000007 | 
| Table lock                     | 0.000035 | 
| init                           | 0.000032 | 
| optimizing                     | 0.000014 | 
| statistics                     | 0.000016 | 
| preparing                      | 0.000014 | 
| executing                      | 0.000009 | 
| Sending data                   | 0.331296 | 
| end                            | 0.000016 | 
| end                            | 0.000003 | 
| query end                      | 0.000005 | 
| storing result in query cache  | 0.000105 | 
| freeing items                  | 0.000012 | 
| closing tables                 | 0.000007 | 
| logging slow query             | 0.000003 | 
| logging slow query             | 0.000048 | 
| cleaning up                    | 0.000006 | 
20 rows in set (0.00 sec)

Looks like almost all of the time there was spent executing the query, definitely one worth investigating further with EXPLAIN

Now let’s look at the slow INSERT sub-select we ran to see what took the time. I’ve enabled CPU profiling here too.

| Status               | Duration  | CPU_user  | CPU_system |
| starting             |  0.000069 |  0.000000 |   0.000000 | 
| checking permissions |  0.000010 |  0.000000 |   0.000000 | 
| Opening tables       |  0.000217 |  0.000000 |   0.000000 | 
| System lock          |  0.000006 |  0.000000 |   0.000000 | 
| Table lock           |  0.000014 |  0.000000 |   0.000000 | 
| init                 |  0.000041 |  0.000000 |   0.000000 | 
| optimizing           |  0.000007 |  0.000000 |   0.000000 | 
| statistics           |  0.000014 |  0.000000 |   0.000000 | 
| preparing            |  0.000013 |  0.000000 |   0.000000 | 
| executing            |  0.000006 |  0.000000 |   0.000000 | 
| Sending data         |  4.326303 |  3.544221 |   0.324020 | 
| Creating index       |  0.000029 |  0.000000 |   0.000000 | 
| Repair by sorting    | 29.202254 | 17.133071 |  11.616726 | 
| Saving state         |  0.000040 |  0.000000 |   0.000000 | 
| Creating index       |  0.000007 |  0.000000 |   0.000000 | 
| Sending data         |  0.000389 |  0.000000 |   0.000000 | 
| end                  |  0.000009 |  0.000000 |   0.000000 | 
| end                  |  0.000012 |  0.000000 |   0.000000 | 
| query end            |  0.000006 |  0.000000 |   0.000000 | 
| freeing items        |  0.000015 |  0.000000 |   0.000000 | 
| closing tables       |  0.000007 |  0.000000 |   0.000000 | 
| logging slow query   |  0.000005 |  0.000000 |   0.000000 | 
| logging slow query   |  0.000040 |  0.000000 |   0.000000 | 
| cleaning up          |  0.000007 |  0.000000 |   0.000000 | 
24 rows in set (0.00 sec)

It seems that building the indexes for the new table were what took the time. The General Thread States page of the MySQL Documentation is a useful reference. Interestingly we can see the “logging slow query” state here too, something that sails by too quickly to ever see when looking at SHOW PROCESSLIST output.

This profiling support doesn’t fulfil the same role as MySQL’s EXPLAIN command and is only useful in some places but if you were to look at implementing profiling or instrumentation for your app this could be really handy.

You can find the full documentation for MySQL’s profiling support at SHOW PROFILES Syntax it appears to be supported from at least MySQL 5.0 and it’s worth noting is only available in the MySQL Community (non-Enterprise) builds.

Bug fixing: Five tricks we can learn from doctors

I had a bit of a health scare this week and a trip to A+E (ER). All’s OK now but the trip made me realise some of the similarities of the “bug fixing” the great doctors/nurses were attempting on me and how a good engineer will address a problem. Most of these concepts work in any field of engineering but I’m going to focus down on IT Operations more specifically.

#1 Symptoms and Cause

It’s important to remember the difference between symptoms and cause. Treating backache with pain killers will be useful in the short term but you’ve got to identify what’s causing the pain: posture, your desk chair, etc

Making sure you understand what the root cause of the issue is should be your ultimate goal. In the short term treating the symptoms might be best to get your system back up and running quickly.

#2 Monitoring

Both trend monitoring and threshold monitoring are amazingly important when it comes to identifying and resolving issues. This is why patients are so often hooked up to pulse, ECG, blood pressure monitors and why key readings are recorded regularly.

In engineering perhaps the CPU usage of the server you’re working on looks high: Is it normally this high? Is the trend that it’s increasing/decreasing?

Be sure to use tools like Cacti, Ganglia or Nagios and graph everything that’s service or business critical. This could include technical data like CPU usage, connection counts, cache hit rates as well as business data like: user logins, registrations, eCommerce basket value. I’d argue that having a little too much data is far better than having too little.

#3 Triage

When you’re presented with multiple problems you’ve got to identify which of them is more critical? Allow users to assign priority or assign one yourself in triage. Perhaps use a defect matrix to assign this according to how many users are effected, whether it’s on a production site, whether there’s a workaround or not.

This way you treat the most business critical problems first and not the ones that are most interesting!

#4 Case history

Doctors will talk with you about when this problem first started and ask related questions which might be of use with their diagnosis. Good bug reports are often critical for you to be able to fully understand and replicate the bug. It’s important that the reporter of the bug understands this through training or are forced to give detailed info in the reporting process. PHP’s Report a Bugpage is a reasonably good example of the latter.

If you can keeping some kind of history of changes/problems relating to a device or system can be really valuable. A well searchable bug/ticketing system is somewhere close to self-documenting and I’d strongly recommend version control of all server configuration files.

#5 Double-checking

If you’re getting nowhere with a diagnosis of a problem get a second opinion. If, after gaining a second opinion you’re no closer to identifying the problem then it could be worth the second engineer going through the same steps of diagnosis that you did and not just taking your word for it. Sometimes a second set of eyes will spot something subtle that was easy to miss.

Happy bug fixing!

Using DNS TTL to control migrations

Often when you’re moving services from one piece of hardware/location to another it will involve a DNS change. From my experience the DNS change is usually the final change that’s used to move the traffic.

DNS entries can have TTLs. TTL means “time to live” and is the expiry time of the record. For a normal running website you could expect a TTL of 86400 (seconds) or one day. This means that once a DNS server or other DNS client has requested the record it’ll hold onto a cached copy for up to a day before re-requesting it.

If you were to leave your TTL at 86400 and change the DNS entry to point to your new server it could take up to a day for the changeover to happen.

Let’s consider two common use cases:

You want the DNS switchover to happen quickly

Assuming you have a TTL of 86400 make sure you reduce the TTL at least before you want to perform the migration.

I would normally change the TTL to 3600 (1 hour) the day before the planned migration. Then at least one hour before the migration time reduce the TTL down to 600 secs (10 mins). Then at least 10 mins before down to 60 (1 min).

Then make the change and your traffic should flip over to the new IP address pretty quickly and if anything goes wrong you can change the entry back and all traffic should fail back within a minute.

After you’re confident all is working well from the new host you should increase the TTL back up to it’s normal setting (in steps if you want).

You want to move the traffic gradually over to a new service

If you’re not too fussed about which server the traffic hits (e.g. serving static content from file-synced servers) then you might want the traffic to move over gradually. This is a nice approach if you want more of a “soft launch” and don’t want to risk something bad happening to 100% of your traffic if there are problems on the new hardware.

In this case a larger TTL might be desirable. I’d probably go for an hour but it really depends on the situation.

You’d follow similar steps to the ones described above slowly reducing down the TTL till it’s at the value you want. Now modify the DNS record to point to your service but as you do this set a low TTL on the new record.

The low TTL on the new record won’t effect the speed that the new record rolls out but it does mean that if you need to fail back then entries should be re-cached quickly.

When everything’s failed over nicely increase the TTL again.

Why not keep my TTL low all the time then?

You can do but it’s generally not accepted as good practice. It’ll generate much more DNS traffic to your authoritative DNS servers too as other resolvers will need to re-cache entries.


These might catch you out:

  • Lots of OSes/browsers will cache a DNS entry for a minimum of 30 minutes so TTLs less than this might not be respected
  • Some caching name servers ignore the published TTL and will apply their own minimum (this is out of RFC and really frustrating)

MERGE table gotcha with PRIMARY KEY

I know that MERGE isn’t everyone’s favourite storage engine but we use them quite extensively and ran into a problem the other day that I thought I’d share.

A common usage pattern is to CREATE TEMPORARY TABLE LIKE an existing MyISAM table and then ALTER TABLE to turn it into a MERGE table and specify it’s UNION. Like this:


It’s a pattern we’ve been using for ages and seems to work well in most cases.

In one application we keep aggregated data in daily tables (inspired by this article). Within the application we work out which of the daily tables we need and build them into a temporary merge table before querying it. There’s more to it than that (there’s a pyramid of daily/monthly/yearly tables) but they’re not relevant right now.

You can replicate our usage pattern (and the bug we ran into) with this SQL:

CREATE TABLE clicks_20101004 (
   user_id INT,
   total_clicks INT,
   PRIMARY KEY (user_id)
INSERT INTO clicks_20101004 VALUES (1,10), (2,10);

CREATE TABLE clicks_20101005 (
   user_id INT,
   total_clicks INT,
   PRIMARY KEY (user_id)
INSERT INTO clicks_20101005 VALUES (1,10), (2,10), (3,10);

CREATE TABLE merge_demo LIKE clicks_20101005;
ALTER TABLE merge_demo ENGINE=MERGE, UNION=(clicks_20101004, clicks_20101005);

The table was created fine and querying the data it looks like everything’s happy:

mysql> SELECT * FROM merge_demo;
| user_id | total_clicks |
|       1 |           10 | 
|       2 |           10 | 
|       1 |           10 | 
|       2 |           10 | 
|       3 |           10 | 
5 rows in set (0.00 sec)

mysql> SELECT user_id, SUM(total_clicks) FROM merge_demo GROUP BY user_id;
| user_id | SUM(total_clicks) |
|       1 |                20 | 
|       2 |                20 | 
|       3 |                10 | 
3 rows in set (0.00 sec)

All OK so far. Let’s try another query:

mysql> SELECT * FROM merge_demo WHERE user_id=1;
| user_id | total_clicks |
|       1 |           10 | 
1 row in set (0.00 sec)

mysql> SELECT user_id, SUM(total_clicks) FROM merge_demo WHERE user_id=2 GROUP BY user_id;
| user_id | SUM(total_clicks) |
|       2 |                10 | 
1 row in set (0.00 sec)

Neither of these give the result we’d expect and it looks like it has something to do with the WHERE clause. The problem is the PRIMARY key that’s been defined in the parent MERGE. The documentation does talk about this and the inability for MERGE tables to enforce PRIMARY keys across their underlying tables.

If we want to still take advantage of the index in the user_id column we can fix it but replacing the PRIMARY key with a normal index like this:

mysql> ALTER TABLE merge_demo DROP PRIMARY KEY, ADD KEY (user_id);
Query OK, 0 rows affected (0.00 sec)
Records: 0  Duplicates: 0  Warnings: 0

Trying our problem queries again they seem to work fine:

mysql> SELECT * FROM merge_demo WHERE user_id=1;
| user_id | total_clicks |
|       1 |           10 | 
|       1 |           10 | 
2 rows in set (0.00 sec)

mysql> SELECT user_id, SUM(total_clicks) FROM merge_demo WHERE user_id=2 GROUP BY user_id;
| user_id | SUM(total_clicks) |
|       2 |                20 | 
1 row in set (0.00 sec)

In terms of a generic fix in your application for this type of usage pattern I’d suggest you read in the output of SHOW CREATE TABLE a; and string substitute out the PRIMARY KEY, you could even tie in the ENGINE and UNION too and reduce the requirement for any ALTER TABLE to be performed.

Using lsof to get information about open files

lsof allows you to get a list of all open files and the processes that have opened them. It’s a useful tool to have in the sysadmin or linux hack’s toolbox.

Here are some example situations which should show you how useful it can be.

Which process is using this file?

This can be handy for cases like finding out which process/user is logging to a given log file, who is editing a given file, etc

[root@www ~]# lsof /var/log/squid/cache.log
squid   18671 squid    5u   REG  253,0 3073 784030 /var/log/squid/cache.log

or the lazier:

[root@www ~]# lsof | grep /var/log/squid/cache.log
squid     18671     squid    5u      REG              253,0     3073     784030 /var/log/squid/cache.log

Which process is still using that directory?

Useful for finding that pesky process which has an open file handle a filesystem you’re trying to unmount.

[root@www ~]# lsof +D /var/log/squid
squid   18671 squid    5u   REG  253,0 3073 784030 /var/log/squid/cache.log
squid   18671 squid    8w   REG  253,0 4041 784031 /var/log/squid/store.log

or, again, the lazy option:

[root@www ~]# lsof | grep /var/log/squid
squid     18671     squid    5u      REG              253,0     3073     784030 /var/log/squid/cache.log
squid     18671     squid    8w      REG              253,0     4041     784031 /var/log/squid/store.log

Where does process X log to?

You can use the command along with some grep messing to do:

[root@www ~]# lsof | grep squid | grep log
squid     18671     squid    5u      REG              253,0     3073     784030 /var/log/squid/cache.log
squid     18671     squid    8w      REG              253,0     4041     784031 /var/log/squid/store.log

Why hasn’t disk space been freed up after that delete?

[root@www ~]# lsof | grep deleted
httpd      1519      root   25w      REG              253,0   200918     688282 /usr/local/apache/logs/mapping-access_log (deleted)
httpd      9515    apache   25w      REG              253,0   200918     688282 /usr/local/apache/logs/mapping-access_log (deleted)
httpd      9516    apache   25w      REG              253,0   200918     688282 /usr/local/apache/logs/mapping-access_log (deleted)
httpd      9517    apache   25w      REG              253,0   200918     688282 /usr/local/apache/logs/mapping-access_log (deleted)