What makes a good MySQL index? Part 1: Column Size

We all know that it’s good practice for queries to use indexes but it’s not that clear-cut, there are good indexes and poor indexes. By making indexes more efficient you stand to increase query speeds and the places that MySQL can use the indexes effectively.

Column Size

The size of the column or columns you’re indexing is relevant because the less data the database server has to search through or index the faster it’ll be and the less storage you’ll use on disk. In this first post we’ll just be concentrating on column size optimizations.

There’s a full list of all data types and how much storage they require in the MySQL documentation: Data Type Storage Requirements

Common places where the wrong data type could be specified include:

Numeric types

This is probably the biggest offender and most mis-understood. Use the table on the Numeric Types page in the MySQL documentation to work out which the smallest field type is. For example if you were storing a person’s age TINYINT UNSIGNED with a range of 0-255 would make sense instead of wasting 3 more bytes by using and INT field.

Date and Time Types

If you’re storing a DATE (3 bytes) don’t store it in a DATETIME (8 bytes) column.

Many times the data you’re storing in a DATETIME (8 bytes) field could fit in a TIMESTAMP (4 bytes) field though they don’t behave in exactly the same way. Check the documentation for details.

Enumerated values

You might be storing string values that have a really low entropy. Let’s say you have an office field in your database that could either be “London” or “Paris”. It makes more sense to keep it in an ENUM(“London”, “Paris”) (1 byte) than a VARCHAR()

IP addresses

An IPv4 address is just 4 bytes of data but to make it more human readable we usually talk about it in four integers of 0-255. A good way to store this data would be in an INT UNSIGNED and to use the handy INET_ATON() and INET_NTOA() functions to convert the INT data to/from a human readable IP address.

MD5, SHA1 or other hashes

As with IP addresses the hexadecimal representation of the hash we’re used to is just to make it more human readable. To illustrate this an MD5 value could either be stored in a VARCHAR(32) or a much more efficient BINARY(16), an SHA1 will fit into a BINARY(20). You can use the functions HEX() and UNHEX() to cast to/from the hex/binary format.

PROCEDURE ANALYSE

PROCEDURE ANALYSE is built into MySQL and can help you a lot with choosing the best data type if the table/data exists already.

Let’s use wp_options a table which is shipped as part of WordPress as an example. Here’s the table structure:

CREATE TABLE `wp_options` (
`option_id` bigint(20) unsigned NOT NULL auto_increment,
`blog_id` int(11) NOT NULL default '0',
`option_name` varchar(64) NOT NULL default '',
`option_value` longtext NOT NULL,
`autoload` varchar(20) NOT NULL default 'yes',
PRIMARY KEY  (`option_id`,`blog_id`,`option_name`),
KEY `option_name` (`option_name`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

and here’s how we’d analyse selected columns in there:

mysql> SELECT option_id, blog_id, autoload FROM wp_options PROCEDURE ANALYSE() \G
*************************** 1. row ***************************
Field_name: foo.wp_options.option_id
Min_value: 1
Max_value: 3860
Min_length: 1
Max_length: 4
Empties_or_zeros: 0
Nulls: 0
Avg_value_or_avg_length: 1169.5359
Std: 1416.1368
Optimal_fieldtype: SMALLINT(4) UNSIGNED NOT NULL
*************************** 2. row ***************************
Field_name: foo.wp_options.blog_id
Min_value: 0
Max_value: 0
Min_length: 1
Max_length: 1
Empties_or_zeros: 181
Nulls: 0
Avg_value_or_avg_length: 0.0000
Std: 0.0000
Optimal_fieldtype: ENUM('0') NOT NULL
*************************** 3. row ***************************
Field_name: foo.wp_options.autoload
Min_value: no
Max_value: yes
Min_length: 2
Max_length: 3
Empties_or_zeros: 0
Nulls: 0
Avg_value_or_avg_length: 2.7514
Std: NULL
Optimal_fieldtype: ENUM('no','yes') NOT NULL
3 rows in set (0.00 sec)

It looks like there were some good recommendations from it for the option_id and autoload columns but that the blog_id field it got wrong. Presumably there’s just one blog with id=0 in there at the moment so it’s understandable.

3 Responses to What makes a good MySQL index? Part 1: Column Size

  1. Mike Pearce says:

    Great post. I’ve found that slow queries should be run through EXPLAIN. Gives you the order that MySQL will use the tables (if joins), columns and you can then index accordingly. I’ve never seen PROCEDURE ANALYSE() though (Nice to see it’s the UK spelling ANALYSE())

    Nothing like getting the coltypes right first though. 😉

  2. James Cohen says:

    EXPLAIN is definitely your friend and probably worthy of a post on here.

    MySQL has a bit of inconsistency with “PROCEDURE ANALYSE” and “ANALYZE TABLE” both being valid!

  3. I would like to add a caveat that such optimizations, if made too early into your database, can cripple the application. For instance, if we use the ENUM type and later find that we need to add more options to it, it’s not going to be pretty. Certain datatypes also don’t play too well with Web frameworks, which much address a large number of databases and can’t rely on particular quirks of a database. So, good to know all this, but always keep in mind the consequences.

Leave a comment