Using lsof to get information about open files

lsof allows you to get a list of all open files and the processes that have opened them. It’s a useful tool to have in the sysadmin or linux hack’s toolbox.

Here are some example situations which should show you how useful it can be.

Which process is using this file?

This can be handy for cases like finding out which process/user is logging to a given log file, who is editing a given file, etc

[root@www ~]# lsof /var/log/squid/cache.log
COMMAND   PID  USER   FD   TYPE DEVICE SIZE   NODE NAME
squid   18671 squid    5u   REG  253,0 3073 784030 /var/log/squid/cache.log

or the lazier:

[root@www ~]# lsof | grep /var/log/squid/cache.log
squid     18671     squid    5u      REG              253,0     3073     784030 /var/log/squid/cache.log

Which process is still using that directory?

Useful for finding that pesky process which has an open file handle a filesystem you’re trying to unmount.

[root@www ~]# lsof +D /var/log/squid
COMMAND   PID  USER   FD   TYPE DEVICE SIZE   NODE NAME
squid   18671 squid    5u   REG  253,0 3073 784030 /var/log/squid/cache.log
squid   18671 squid    8w   REG  253,0 4041 784031 /var/log/squid/store.log

or, again, the lazy option:

[root@www ~]# lsof | grep /var/log/squid
squid     18671     squid    5u      REG              253,0     3073     784030 /var/log/squid/cache.log
squid     18671     squid    8w      REG              253,0     4041     784031 /var/log/squid/store.log

Where does process X log to?

You can use the command along with some grep messing to do:

[root@www ~]# lsof | grep squid | grep log
squid     18671     squid    5u      REG              253,0     3073     784030 /var/log/squid/cache.log
squid     18671     squid    8w      REG              253,0     4041     784031 /var/log/squid/store.log

Why hasn’t disk space been freed up after that delete?

[root@www ~]# lsof | grep deleted
httpd      1519      root   25w      REG              253,0   200918     688282 /usr/local/apache/logs/mapping-access_log (deleted)
httpd      9515    apache   25w      REG              253,0   200918     688282 /usr/local/apache/logs/mapping-access_log (deleted)
httpd      9516    apache   25w      REG              253,0   200918     688282 /usr/local/apache/logs/mapping-access_log (deleted)
httpd      9517    apache   25w      REG              253,0   200918     688282 /usr/local/apache/logs/mapping-access_log (deleted)

What makes a good MySQL index? Part 2: Cardinality

edit: added “Low cardinality isn’t always bad” section after Morgan’s comment

As we’ve seen already column size is important for indexes. Cardinality is really important too, it’s the uniqueness of the values included in the index.

Indexes are used by MySQL (and in any RDBMS) to help find rows quickly. We want to make it as easy as possible for MySQL to find the relevant rows, the more precise or specific we are the less the number of rows MySQL has to fetch.

Example Table

For this post I’ve created a table with some dummy data in it. Here’s the SQL I used to create it (in case you want to) which was generated with this hack script.

The table is fairly simple. Note the three indexes: PRIMARY plus two composite indexes on (sex, age) and (age, sex)

CREATE TABLE `user` (
  `id` mediumint(9) NOT NULL auto_increment,
  `name` varchar(20) default NULL,
  `age` tinyint(3) unsigned default NULL,
  `sex` enum('MALE','FEMALE') default NULL,
  PRIMARY KEY  (`id`),
  KEY `age_sex` (`age`,`sex`),
  KEY `sex_age` (`sex`,`age`)
) ENGINE=MyISAM AUTO_INCREMENT=4917 DEFAULT CHARSET=utf8

If we run ANALYZE TABLE on the table and then SHOW INDEXES we can see the following:

mysql> SHOW INDEXES FROM user;
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| user  |          0 | PRIMARY  |            1 | id          | A         |        4916 |     NULL | NULL   |      | BTREE      |         |
| user  |          1 | age_sex  |            1 | age         | A         |         100 |     NULL | NULL   | YES  | BTREE      |         |
| user  |          1 | age_sex  |            2 | sex         | A         |         196 |     NULL | NULL   | YES  | BTREE      |         |
| user  |          1 | sex_age  |            1 | sex         | A         |           2 |     NULL | NULL   | YES  | BTREE      |         |
| user  |          1 | sex_age  |            2 | age         | A         |         196 |     NULL | NULL   | YES  | BTREE      |         |
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
5 rows in set (0.00 sec)

Assuming the male:female ratio in our dataset is 50:50 then a query with WHERE sex="MALE" is only ever going to help us find 50% of the rows. MySQL will then have to sift through that 50% of rows. Let’s demonstrate that:

mysql> EXPLAIN SELECT id FROM user WHERE sex="MALE";
+----+-------------+-------+------+---------------+---------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key     | key_len | ref   | rows | Extra       |
+----+-------------+-------+------+---------------+---------+---------+-------+------+-------------+
|  1 | SIMPLE      | user  | ref  | sex_age       | sex_age | 2       | const | 3335 | Using where |
+----+-------------+-------+------+---------------+---------+---------+-------+------+-------------+
1 row in set (0.00 sec)

If age has a fairly even distribution across the age range you could use a query with WHERE age=31 and MySQL should be able to limit the relevant rows down to just ~1% of the dataset. This will mean that any further searching, either using the next part of a composite index or analysis or scanning the records will be much faster. Again EXPLAIN proves this:

mysql> EXPLAIN SELECT id FROM user WHERE age=32;
+----+-------------+-------+------+---------------+---------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key     | key_len | ref   | rows | Extra       |
+----+-------------+-------+------+---------------+---------+---------+-------+------+-------------+
|  1 | SIMPLE      | user  | ref  | age_sex       | age_sex | 2       | const |    5 | Using where |
+----+-------------+-------+------+---------------+---------+---------+-------+------+-------------+
1 row in set (0.00 sec)

Cardinality and composite indexes

Keeping with our example dataset if we commonly query the table knowing the age and sex of the people. A typical query might look like:

SELECT id FROM people WHERE age=31 AND sex="MALE";

We’ve got the two composite indexes on the table (age, sex) and (sex, age) but which one will work best for this results. Based around everything learnt so far in this post (age, sex) should do. First let’s prove that the MySQL Optimiser agrees:

mysql> EXPLAIN SELECT id FROM user WHERE age=32 AND sex="MALE";
+----+-------------+-------+------+-----------------+---------+---------+-------------+------+-------------+
| id | select_type | table | type | possible_keys   | key     | key_len | ref         | rows | Extra       |
+----+-------------+-------+------+-----------------+---------+---------+-------------+------+-------------+
|  1 | SIMPLE      | user  | ref  | age_sex,sex_age | age_sex | 4       | const,const |    2 | Using where |
+----+-------------+-------+------+-----------------+---------+---------+-------------+------+-------------+
1 row in set (0.00 sec)

How do I know what cardinality my data has?

If you’re looking at indexes that are there already ANALYZE TABLE followed by SHOW INDEXES FROM user as above will do the trick. If the data isn’t there yet you can do a simple SELECT DISTINCT like this:

mysql> SELECT COUNT(DISTINCT age) AS cardinality FROM user;
+-------------+
| cardinality |
+-------------+
|         100 |
+-------------+
1 row in set (0.00 sec)

Why isn’t MySQL using my index?

Just because you have the index there it doesn’t mean that MySQL will decide to use it. Have a look at the following similar queries:

mysql> EXPLAIN SELECT id FROM user WHERE age>80;
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+
| id | select_type | table | type  | possible_keys | key     | key_len | ref  | rows | Extra       |
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+
|  1 | SIMPLE      | user  | range | age_sex       | age_sex | 2       | NULL |  764 | Using where |
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+
1 row in set (0.00 sec)
mysql> EXPLAIN SELECT id FROM user WHERE age<80;
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows | Extra       |
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
|  1 | SIMPLE      | user  | ALL  | age_sex       | NULL | NULL    | NULL | 4916 | Using where |
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
1 row in set (0.00 sec)

The first query uses the index but the second one chooses to do a table scan. That’s no co-incidence, MySQL is smart enough to know about the distribution of the data and decides that it’d be faster to just munch through all of the data rather than the 80% that it could return from the index lookup. When our query is restrictive however with the >80 query it knows it’ll be better to use the index. Most of the time the optimizer gets it right!

There’s something very important to learn from this issue of distribution of data. Unless you’re checking query optimisation with real world queries and a real world dataset then you’re not optimising properly! In some cases it could pay to create some sample data (if you can predict what it’ll be like) with a simple script like the one I spun up for the user table we used as an example.

Low cardinality isn’t always bad

It could be that although the cardinality is low the selectivity is still high because of the distribution of data. Let’s change the characteristics of the sex column in the table:

mysql> UPDATE user SET SEX="MALE";
Query OK, 2503 rows affected (0.58 sec)
Rows matched: 4916  Changed: 2503  Warnings: 0

mysql> UPDATE user SET SEX="FEMALE" WHERE id<10;
Query OK, 9 rows affected (0.00 sec)
Rows matched: 9  Changed: 9  Warnings: 0

mysql> SELECT sex, COUNT(*) FROM user GROUP BY sex;
+--------+----------+
| sex    | COUNT(*) |
+--------+----------+
| MALE   |     4907 | 
| FEMALE |        9 | 
+--------+----------+
2 rows in set (0.00 sec)

We’ve now got low cardinality (just two possible values) but if we’re looking for just "FEMALE" rows then they should be easy to find with this index. Again EXPLAIN agrees:

mysql> EXPLAIN SELECT id FROM user WHERE sex="MALE";
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows | Extra       |
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
|  1 | SIMPLE      | user  | ALL  | sex_age       | NULL | NULL    | NULL | 3687 | Using where | 
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
1 row in set (0.11 sec)

mysql> EXPLAIN SELECT id FROM user WHERE sex="FEMALE";
+----+-------------+-------+------+---------------+---------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key     | key_len | ref   | rows | Extra       |
+----+-------------+-------+------+---------------+---------+---------+-------+------+-------------+
|  1 | SIMPLE      | user  | ref  | sex_age       | sex_age | 2       | const |    9 | Using where | 
+----+-------------+-------+------+---------------+---------+---------+-------+------+-------------+
1 row in set (0.00 sec)

What makes a good MySQL index? Part 1: Column Size

We all know that it’s good practice for queries to use indexes but it’s not that clear-cut, there are good indexes and poor indexes. By making indexes more efficient you stand to increase query speeds and the places that MySQL can use the indexes effectively.

Column Size

The size of the column or columns you’re indexing is relevant because the less data the database server has to search through or index the faster it’ll be and the less storage you’ll use on disk. In this first post we’ll just be concentrating on column size optimizations.

There’s a full list of all data types and how much storage they require in the MySQL documentation: Data Type Storage Requirements

Common places where the wrong data type could be specified include:

Numeric types

This is probably the biggest offender and most mis-understood. Use the table on the Numeric Types page in the MySQL documentation to work out which the smallest field type is. For example if you were storing a person’s age TINYINT UNSIGNED with a range of 0-255 would make sense instead of wasting 3 more bytes by using and INT field.

Date and Time Types

If you’re storing a DATE (3 bytes) don’t store it in a DATETIME (8 bytes) column.

Many times the data you’re storing in a DATETIME (8 bytes) field could fit in a TIMESTAMP (4 bytes) field though they don’t behave in exactly the same way. Check the documentation for details.

Enumerated values

You might be storing string values that have a really low entropy. Let’s say you have an office field in your database that could either be “London” or “Paris”. It makes more sense to keep it in an ENUM(“London”, “Paris”) (1 byte) than a VARCHAR()

IP addresses

An IPv4 address is just 4 bytes of data but to make it more human readable we usually talk about it in four integers of 0-255. A good way to store this data would be in an INT UNSIGNED and to use the handy INET_ATON() and INET_NTOA() functions to convert the INT data to/from a human readable IP address.

MD5, SHA1 or other hashes

As with IP addresses the hexadecimal representation of the hash we’re used to is just to make it more human readable. To illustrate this an MD5 value could either be stored in a VARCHAR(32) or a much more efficient BINARY(16), an SHA1 will fit into a BINARY(20). You can use the functions HEX() and UNHEX() to cast to/from the hex/binary format.

PROCEDURE ANALYSE

PROCEDURE ANALYSE is built into MySQL and can help you a lot with choosing the best data type if the table/data exists already.

Let’s use wp_options a table which is shipped as part of WordPress as an example. Here’s the table structure:

CREATE TABLE `wp_options` (
`option_id` bigint(20) unsigned NOT NULL auto_increment,
`blog_id` int(11) NOT NULL default '0',
`option_name` varchar(64) NOT NULL default '',
`option_value` longtext NOT NULL,
`autoload` varchar(20) NOT NULL default 'yes',
PRIMARY KEY  (`option_id`,`blog_id`,`option_name`),
KEY `option_name` (`option_name`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

and here’s how we’d analyse selected columns in there:

mysql> SELECT option_id, blog_id, autoload FROM wp_options PROCEDURE ANALYSE() \G
*************************** 1. row ***************************
Field_name: foo.wp_options.option_id
Min_value: 1
Max_value: 3860
Min_length: 1
Max_length: 4
Empties_or_zeros: 0
Nulls: 0
Avg_value_or_avg_length: 1169.5359
Std: 1416.1368
Optimal_fieldtype: SMALLINT(4) UNSIGNED NOT NULL
*************************** 2. row ***************************
Field_name: foo.wp_options.blog_id
Min_value: 0
Max_value: 0
Min_length: 1
Max_length: 1
Empties_or_zeros: 181
Nulls: 0
Avg_value_or_avg_length: 0.0000
Std: 0.0000
Optimal_fieldtype: ENUM('0') NOT NULL
*************************** 3. row ***************************
Field_name: foo.wp_options.autoload
Min_value: no
Max_value: yes
Min_length: 2
Max_length: 3
Empties_or_zeros: 0
Nulls: 0
Avg_value_or_avg_length: 2.7514
Std: NULL
Optimal_fieldtype: ENUM('no','yes') NOT NULL
3 rows in set (0.00 sec)

It looks like there were some good recommendations from it for the option_id and autoload columns but that the blog_id field it got wrong. Presumably there’s just one blog with id=0 in there at the moment so it’s understandable.