Bug fixing: Five tricks we can learn from doctors

I had a bit of a health scare this week and a trip to A+E (ER). All’s OK now but the trip made me realise some of the similarities of the “bug fixing” the great doctors/nurses were attempting on me and how a good engineer will address a problem. Most of these concepts work in any field of engineering but I’m going to focus down on IT Operations more specifically.

#1 Symptoms and Cause

It’s important to remember the difference between symptoms and cause. Treating backache with pain killers will be useful in the short term but you’ve got to identify what’s causing the pain: posture, your desk chair, etc

Making sure you understand what the root cause of the issue is should be your ultimate goal. In the short term treating the symptoms might be best to get your system back up and running quickly.

#2 Monitoring

Both trend monitoring and threshold monitoring are amazingly important when it comes to identifying and resolving issues. This is why patients are so often hooked up to pulse, ECG, blood pressure monitors and why key readings are recorded regularly.

In engineering perhaps the CPU usage of the server you’re working on looks high: Is it normally this high? Is the trend that it’s increasing/decreasing?

Be sure to use tools like Cacti, Ganglia or Nagios and graph everything that’s service or business critical. This could include technical data like CPU usage, connection counts, cache hit rates as well as business data like: user logins, registrations, eCommerce basket value. I’d argue that having a little too much data is far better than having too little.

#3 Triage

When you’re presented with multiple problems you’ve got to identify which of them is more critical? Allow users to assign priority or assign one yourself in triage. Perhaps use a defect matrix to assign this according to how many users are effected, whether it’s on a production site, whether there’s a workaround or not.

This way you treat the most business critical problems first and not the ones that are most interesting!

#4 Case history

Doctors will talk with you about when this problem first started and ask related questions which might be of use with their diagnosis. Good bug reports are often critical for you to be able to fully understand and replicate the bug. It’s important that the reporter of the bug understands this through training or are forced to give detailed info in the reporting process. PHP’s Report a Bugpage is a reasonably good example of the latter.

If you can keeping some kind of history of changes/problems relating to a device or system can be really valuable. A well searchable bug/ticketing system is somewhere close to self-documenting and I’d strongly recommend version control of all server configuration files.

#5 Double-checking

If you’re getting nowhere with a diagnosis of a problem get a second opinion. If, after gaining a second opinion you’re no closer to identifying the problem then it could be worth the second engineer going through the same steps of diagnosis that you did and not just taking your word for it. Sometimes a second set of eyes will spot something subtle that was easy to miss.

Happy bug fixing!

Using DNS TTL to control migrations

Often when you’re moving services from one piece of hardware/location to another it will involve a DNS change. From my experience the DNS change is usually the final change that’s used to move the traffic.

DNS entries can have TTLs. TTL means “time to live” and is the expiry time of the record. For a normal running website you could expect a TTL of 86400 (seconds) or one day. This means that once a DNS server or other DNS client has requested the record it’ll hold onto a cached copy for up to a day before re-requesting it.

If you were to leave your TTL at 86400 and change the DNS entry to point to your new server it could take up to a day for the changeover to happen.

Let’s consider two common use cases:

You want the DNS switchover to happen quickly

Assuming you have a TTL of 86400 make sure you reduce the TTL at least before you want to perform the migration.

I would normally change the TTL to 3600 (1 hour) the day before the planned migration. Then at least one hour before the migration time reduce the TTL down to 600 secs (10 mins). Then at least 10 mins before down to 60 (1 min).

Then make the change and your traffic should flip over to the new IP address pretty quickly and if anything goes wrong you can change the entry back and all traffic should fail back within a minute.

After you’re confident all is working well from the new host you should increase the TTL back up to it’s normal setting (in steps if you want).

You want to move the traffic gradually over to a new service

If you’re not too fussed about which server the traffic hits (e.g. serving static content from file-synced servers) then you might want the traffic to move over gradually. This is a nice approach if you want more of a “soft launch” and don’t want to risk something bad happening to 100% of your traffic if there are problems on the new hardware.

In this case a larger TTL might be desirable. I’d probably go for an hour but it really depends on the situation.

You’d follow similar steps to the ones described above slowly reducing down the TTL till it’s at the value you want. Now modify the DNS record to point to your service but as you do this set a low TTL on the new record.

The low TTL on the new record won’t effect the speed that the new record rolls out but it does mean that if you need to fail back then entries should be re-cached quickly.

When everything’s failed over nicely increase the TTL again.

Why not keep my TTL low all the time then?

You can do but it’s generally not accepted as good practice. It’ll generate much more DNS traffic to your authoritative DNS servers too as other resolvers will need to re-cache entries.

Exceptions

These might catch you out:

  • Lots of OSes/browsers will cache a DNS entry for a minimum of 30 minutes so TTLs less than this might not be respected
  • Some caching name servers ignore the published TTL and will apply their own minimum (this is out of RFC and really frustrating)

MERGE table gotcha with PRIMARY KEY

I know that MERGE isn’t everyone’s favourite storage engine but we use them quite extensively and ran into a problem the other day that I thought I’d share.

A common usage pattern is to CREATE TEMPORARY TABLE LIKE an existing MyISAM table and then ALTER TABLE to turn it into a MERGE table and specify it’s UNION. Like this:

CREATE TABLE tmpMerge LIKE a;
ALTER TABLE tmpMerge ENGINE=MERGE, UNION=(a,b);

It’s a pattern we’ve been using for ages and seems to work well in most cases.

In one application we keep aggregated data in daily tables (inspired by this article). Within the application we work out which of the daily tables we need and build them into a temporary merge table before querying it. There’s more to it than that (there’s a pyramid of daily/monthly/yearly tables) but they’re not relevant right now.

You can replicate our usage pattern (and the bug we ran into) with this SQL:

CREATE TABLE clicks_20101004 (
   user_id INT,
   total_clicks INT,
   PRIMARY KEY (user_id)
) ENGINE=MyISAM;
INSERT INTO clicks_20101004 VALUES (1,10), (2,10);

CREATE TABLE clicks_20101005 (
   user_id INT,
   total_clicks INT,
   PRIMARY KEY (user_id)
) ENGINE=MyISAM;
INSERT INTO clicks_20101005 VALUES (1,10), (2,10), (3,10);

CREATE TABLE merge_demo LIKE clicks_20101005;
ALTER TABLE merge_demo ENGINE=MERGE, UNION=(clicks_20101004, clicks_20101005);

The table was created fine and querying the data it looks like everything’s happy:

mysql> SELECT * FROM merge_demo;
+---------+--------------+
| user_id | total_clicks |
+---------+--------------+
|       1 |           10 | 
|       2 |           10 | 
|       1 |           10 | 
|       2 |           10 | 
|       3 |           10 | 
+---------+--------------+
5 rows in set (0.00 sec)

mysql> SELECT user_id, SUM(total_clicks) FROM merge_demo GROUP BY user_id;
+---------+-------------------+
| user_id | SUM(total_clicks) |
+---------+-------------------+
|       1 |                20 | 
|       2 |                20 | 
|       3 |                10 | 
+---------+-------------------+
3 rows in set (0.00 sec)

All OK so far. Let’s try another query:

mysql> SELECT * FROM merge_demo WHERE user_id=1;
+---------+--------------+
| user_id | total_clicks |
+---------+--------------+
|       1 |           10 | 
+---------+--------------+
1 row in set (0.00 sec)

mysql> SELECT user_id, SUM(total_clicks) FROM merge_demo WHERE user_id=2 GROUP BY user_id;
+---------+-------------------+
| user_id | SUM(total_clicks) |
+---------+-------------------+
|       2 |                10 | 
+---------+-------------------+
1 row in set (0.00 sec)

Neither of these give the result we’d expect and it looks like it has something to do with the WHERE clause. The problem is the PRIMARY key that’s been defined in the parent MERGE. The documentation does talk about this and the inability for MERGE tables to enforce PRIMARY keys across their underlying tables.

If we want to still take advantage of the index in the user_id column we can fix it but replacing the PRIMARY key with a normal index like this:

mysql> ALTER TABLE merge_demo DROP PRIMARY KEY, ADD KEY (user_id);
Query OK, 0 rows affected (0.00 sec)
Records: 0  Duplicates: 0  Warnings: 0

Trying our problem queries again they seem to work fine:

mysql> SELECT * FROM merge_demo WHERE user_id=1;
+---------+--------------+
| user_id | total_clicks |
+---------+--------------+
|       1 |           10 | 
|       1 |           10 | 
+---------+--------------+
2 rows in set (0.00 sec)

mysql> SELECT user_id, SUM(total_clicks) FROM merge_demo WHERE user_id=2 GROUP BY user_id;
+---------+-------------------+
| user_id | SUM(total_clicks) |
+---------+-------------------+
|       2 |                20 | 
+---------+-------------------+
1 row in set (0.00 sec)

In terms of a generic fix in your application for this type of usage pattern I’d suggest you read in the output of SHOW CREATE TABLE a; and string substitute out the PRIMARY KEY, you could even tie in the ENGINE and UNION too and reduce the requirement for any ALTER TABLE to be performed.