What makes a good MySQL index? Part 2: Cardinality

edit: added “Low cardinality isn’t always bad” section after Morgan’s comment

As we’ve seen already column size is important for indexes. Cardinality is really important too, it’s the uniqueness of the values included in the index.

Indexes are used by MySQL (and in any RDBMS) to help find rows quickly. We want to make it as easy as possible for MySQL to find the relevant rows, the more precise or specific we are the less the number of rows MySQL has to fetch.

Example Table

For this post I’ve created a table with some dummy data in it. Here’s the SQL I used to create it (in case you want to) which was generated with this hack script.

The table is fairly simple. Note the three indexes: PRIMARY plus two composite indexes on (sex, age) and (age, sex)

CREATE TABLE `user` (
  `id` mediumint(9) NOT NULL auto_increment,
  `name` varchar(20) default NULL,
  `age` tinyint(3) unsigned default NULL,
  `sex` enum('MALE','FEMALE') default NULL,
  PRIMARY KEY  (`id`),
  KEY `age_sex` (`age`,`sex`),
  KEY `sex_age` (`sex`,`age`)
) ENGINE=MyISAM AUTO_INCREMENT=4917 DEFAULT CHARSET=utf8

If we run ANALYZE TABLE on the table and then SHOW INDEXES we can see the following:

mysql> SHOW INDEXES FROM user;
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| user  |          0 | PRIMARY  |            1 | id          | A         |        4916 |     NULL | NULL   |      | BTREE      |         |
| user  |          1 | age_sex  |            1 | age         | A         |         100 |     NULL | NULL   | YES  | BTREE      |         |
| user  |          1 | age_sex  |            2 | sex         | A         |         196 |     NULL | NULL   | YES  | BTREE      |         |
| user  |          1 | sex_age  |            1 | sex         | A         |           2 |     NULL | NULL   | YES  | BTREE      |         |
| user  |          1 | sex_age  |            2 | age         | A         |         196 |     NULL | NULL   | YES  | BTREE      |         |
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
5 rows in set (0.00 sec)

Assuming the male:female ratio in our dataset is 50:50 then a query with WHERE sex="MALE" is only ever going to help us find 50% of the rows. MySQL will then have to sift through that 50% of rows. Let’s demonstrate that:

mysql> EXPLAIN SELECT id FROM user WHERE sex="MALE";
+----+-------------+-------+------+---------------+---------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key     | key_len | ref   | rows | Extra       |
+----+-------------+-------+------+---------------+---------+---------+-------+------+-------------+
|  1 | SIMPLE      | user  | ref  | sex_age       | sex_age | 2       | const | 3335 | Using where |
+----+-------------+-------+------+---------------+---------+---------+-------+------+-------------+
1 row in set (0.00 sec)

If age has a fairly even distribution across the age range you could use a query with WHERE age=31 and MySQL should be able to limit the relevant rows down to just ~1% of the dataset. This will mean that any further searching, either using the next part of a composite index or analysis or scanning the records will be much faster. Again EXPLAIN proves this:

mysql> EXPLAIN SELECT id FROM user WHERE age=32;
+----+-------------+-------+------+---------------+---------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key     | key_len | ref   | rows | Extra       |
+----+-------------+-------+------+---------------+---------+---------+-------+------+-------------+
|  1 | SIMPLE      | user  | ref  | age_sex       | age_sex | 2       | const |    5 | Using where |
+----+-------------+-------+------+---------------+---------+---------+-------+------+-------------+
1 row in set (0.00 sec)

Cardinality and composite indexes

Keeping with our example dataset if we commonly query the table knowing the age and sex of the people. A typical query might look like:

SELECT id FROM people WHERE age=31 AND sex="MALE";

We’ve got the two composite indexes on the table (age, sex) and (sex, age) but which one will work best for this results. Based around everything learnt so far in this post (age, sex) should do. First let’s prove that the MySQL Optimiser agrees:

mysql> EXPLAIN SELECT id FROM user WHERE age=32 AND sex="MALE";
+----+-------------+-------+------+-----------------+---------+---------+-------------+------+-------------+
| id | select_type | table | type | possible_keys   | key     | key_len | ref         | rows | Extra       |
+----+-------------+-------+------+-----------------+---------+---------+-------------+------+-------------+
|  1 | SIMPLE      | user  | ref  | age_sex,sex_age | age_sex | 4       | const,const |    2 | Using where |
+----+-------------+-------+------+-----------------+---------+---------+-------------+------+-------------+
1 row in set (0.00 sec)

How do I know what cardinality my data has?

If you’re looking at indexes that are there already ANALYZE TABLE followed by SHOW INDEXES FROM user as above will do the trick. If the data isn’t there yet you can do a simple SELECT DISTINCT like this:

mysql> SELECT COUNT(DISTINCT age) AS cardinality FROM user;
+-------------+
| cardinality |
+-------------+
|         100 |
+-------------+
1 row in set (0.00 sec)

Why isn’t MySQL using my index?

Just because you have the index there it doesn’t mean that MySQL will decide to use it. Have a look at the following similar queries:

mysql> EXPLAIN SELECT id FROM user WHERE age>80;
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+
| id | select_type | table | type  | possible_keys | key     | key_len | ref  | rows | Extra       |
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+
|  1 | SIMPLE      | user  | range | age_sex       | age_sex | 2       | NULL |  764 | Using where |
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+
1 row in set (0.00 sec)
mysql> EXPLAIN SELECT id FROM user WHERE age<80;
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows | Extra       |
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
|  1 | SIMPLE      | user  | ALL  | age_sex       | NULL | NULL    | NULL | 4916 | Using where |
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
1 row in set (0.00 sec)

The first query uses the index but the second one chooses to do a table scan. That’s no co-incidence, MySQL is smart enough to know about the distribution of the data and decides that it’d be faster to just munch through all of the data rather than the 80% that it could return from the index lookup. When our query is restrictive however with the >80 query it knows it’ll be better to use the index. Most of the time the optimizer gets it right!

There’s something very important to learn from this issue of distribution of data. Unless you’re checking query optimisation with real world queries and a real world dataset then you’re not optimising properly! In some cases it could pay to create some sample data (if you can predict what it’ll be like) with a simple script like the one I spun up for the user table we used as an example.

Low cardinality isn’t always bad

It could be that although the cardinality is low the selectivity is still high because of the distribution of data. Let’s change the characteristics of the sex column in the table:

mysql> UPDATE user SET SEX="MALE";
Query OK, 2503 rows affected (0.58 sec)
Rows matched: 4916  Changed: 2503  Warnings: 0

mysql> UPDATE user SET SEX="FEMALE" WHERE id<10;
Query OK, 9 rows affected (0.00 sec)
Rows matched: 9  Changed: 9  Warnings: 0

mysql> SELECT sex, COUNT(*) FROM user GROUP BY sex;
+--------+----------+
| sex    | COUNT(*) |
+--------+----------+
| MALE   |     4907 | 
| FEMALE |        9 | 
+--------+----------+
2 rows in set (0.00 sec)

We’ve now got low cardinality (just two possible values) but if we’re looking for just "FEMALE" rows then they should be easy to find with this index. Again EXPLAIN agrees:

mysql> EXPLAIN SELECT id FROM user WHERE sex="MALE";
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows | Extra       |
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
|  1 | SIMPLE      | user  | ALL  | sex_age       | NULL | NULL    | NULL | 3687 | Using where | 
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
1 row in set (0.11 sec)

mysql> EXPLAIN SELECT id FROM user WHERE sex="FEMALE";
+----+-------------+-------+------+---------------+---------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key     | key_len | ref   | rows | Extra       |
+----+-------------+-------+------+---------------+---------+---------+-------+------+-------------+
|  1 | SIMPLE      | user  | ref  | sex_age       | sex_age | 2       | const |    9 | Using where | 
+----+-------------+-------+------+---------------+---------+---------+-------+------+-------------+
1 row in set (0.00 sec)
Advertisements

About James Cohen
LAMP geek with interests in building scalable web applications

11 Responses to What makes a good MySQL index? Part 2: Cardinality

  1. Shlomi Noach says:

    Another source of information is Percona Server’s INNODB_INDEX_STATS addition to the INFORMATION_SCHEMA.

    It shows you the number of rows InnoDB estimates per column entry per indexed column.
    Use:

    select * from information_schema.INNODB_INDEX_STATS;

  2. There’s really two concepts here: cardinality and selectivity. An index having low cardinality doesn’t necessarily mean it will be useless.

    I wrote about this here:
    http://www.mysqlperformanceblog.com/2009/10/16/how-not-to-find-unused-indexes/

  3. Dieter_be says:

    Great post. Thank you!

  4. It’s great,

    I was search about what the percentage of data that MySQL decide to use or not to use the index and you mention it’s 80%, i want to know Are you get this by try or is it documented?

    Regards,

    • James Cohen says:

      Hi Mohammad,

      80% is not a magic number there. I’m not sure how the weighting in the optimizer works as there are many factors. E.g. for a table of 2 rows it might be cheaper to perform a table scan than even to work out the best index to use!

      There could be some information here: MySQL_Internals_Optimizer

  5. Very interesting! Thank you for sharing!

  6. dhaval says:

    I have a related question about table design. for example, I have a users table, wants to add a flag ‘reviewed’ . there will be only 3% of users reviewed. should I add a flag (and create index) for this information to users table or create a new table ‘non_reviewed_users’ which contain only user_id (references to users(id))?.

    which option would be better ?

    • James Cohen says:

      Having a column to indicate the reviewed flag would make most sense to me. It’s important to understand how you might query this column to understand how best to index it.

  7. Roch says:

    Thanks!

  8. tempestnathan says:

    Another situation where low cardinality isn’t necessarily bad is when you’re considering range queries. For example, say you have .. age and weight, where age is as you have here, and weight is a float with 2 decimal precision. And say your queries look something like “WHERE age > 30 AND age 150 AND weight < 155". For any decent-sized table, your age index is going ot have orders of magnitude lower cardinality than weight, but a composite index on (age,weight) will be far more effective here than on (weight,age), because it will only need to do a single lookup of the second level of the index for each value of age matched at the first level. If, on the other hand, you used (weight, age), it would have to individually check the age value for every single matching weight. The composite index would still provide a slight benefit since it could read this data sequentially from the index rather than from random places in the table, but it will still be much slower than having the lower cardinality value first.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: