How to Avoid Character Encoding Problems in PHP

Character sets can be confusing at the best of times. This post aims to explain the potential problems and suggest solutions.

Although this is applied to PHP and a typical LAMP stack you can apply the same principles to any multi-tier stack.

If you’re in a hurry you can skim past this first “The boring history” section.

The boring history

Back in 1963 ASCII was published, it was a simple character set conceived in the US and designed as a standard to allow different systems to interact with one another. It includes alphanumeric characters, numbers and some common symbols. It’s a 7-bit character set (ASCII Table)

This works OK for English speaking countries but doesn’t help with other languages that have different characters, accented characters like é. Twenty or so years later the ISO-8859 set of standards were established. By then bytes (8-bits) had become a standard sized chunk of data to send information around in. These new character sets allowed space for another 128 characters. This was enough space to create different sets for different languages/regions but not enough to put everything into a single character-set.

ISO-8859-1 is probably the most commonly used (also known as “latin1” or “Western European”) and other 15 other character sets were defined too including ISO-8859-2 (Central European), ISO-8859-3 (South European), etc, etc. There’s a full list on Wikipedia.

This created a big problem, you need to know which character-set you’re using because although the common ASCII characters are the same in different languages the same sign is £ in one character-set and Ł, Ŗ, Ѓ or ฃ in various different sets!

An easier solution would be to have all possible characters in some single character set and that’s what UTF-8 does. It’s shares the same first 7 bytes with ASCII (it’s backwardly compatible) but can be anything from one byte to four bytes in length. That gives it a staggering choice of 1,112,064 different characters. That makes life a bunch easier, because you can use UTF-8 with your web application and it’ll work for everyone around the world.

There is another used character set called UTF-16 but it’s not backwardly compatible with ASCII and less widely used.

Conclusion of the boring history section

If you didn’t bother to read all of the section above there’s just one thing to take away from it: Use UTF-8

Where do the problems occur?

You have a potential for problems to occur anywhere that one part of your system talks to another. For a PHP/LAMP setup these components are:

* Your editor that you’re creating the PHP/HTML files in
* The web browser people are viewing your site through
* Your PHP web application running on the web server
* The MySQL database
* Anywhere else external you’re reading/writing data from (memcached, APIs, RSS feeds, etc)

To avoid these potential problems we’re going to make sure that every component is configured to use UTF-8 so that no mis-translation goes on anywhere.

Configuring your editor

Ensure that your text editor, IDE or whatever you’re writing the PHP code in saves your files in UTF-8 format. Your FTP client, scp, SFTP client doesn’t need any special UTF-8 setting.

Making sure that web browsers know to use UTF-8

To make sure your users’ browsers all know to read/write all data as UTF-8 you can set this in two places.

The content-type <META> tag
Ensure the content-type META header specifies UTF-8 as the character set like this:

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">

The HTTP response headers
Make sure that the Content-Type response header also specifies UTF-8 as the character-set like this:

ini_set('default_charset', 'utf-8')

Configuring the MySQL Connection

Now you know that all of the data you’re receiving from the users is in UTF-8 format we need to configure the client connection between the PHP and the MySQL database.

There’s a generic way of doing by simply executing the MySQL query:

SET NAMES utf8;

…and depending on which client/driver you’re using there are helper functions to do this more easily instead:

With the built in mysql functions

mysql_set_charset('utf8', $link);

With MySQLi

$mysqli->set_charset("utf8")

With PDO_MySQL (as you connect)

$pdo = new PDO( 
    'mysql:host=hostname;dbname=defaultDbName', 
    'username', 
    'password', 
    array(PDO::MYSQL_ATTR_INIT_COMMAND => "SET NAMES utf8") 
);

The MySQL Database

We’re pretty much there now, you just need to make sure that MySQL knows to store the data in your tables as UTF-8. You can check their encoding by looking at the Collation value in the output of SHOW TABLE STATUS (in phpmyadmin this is shown in the list of tables).

If your tables are not already in UTF-8 (it’s likely they’re in latin1) then you’ll need to convert them by running the following command for each table:

ALTER TABLE myTable CHARACTER SET utf8 COLLATE utf8_general_ci;

One last thing to watch out for

With all of these steps complete now your application should be free of any character set problems.

There is one thing to watch out for, most of the PHP string functions are not unicode aware so for example if you run strlen() against a multi-byte character it’ll return the number of bytes in the input, not the number of characters. You can work round this by using the Multibyte String PHP extension though it’s not that common for these byte/character issues to cause problems.

How to work around Amazon EC2 outages

Today’s Amazon EC2 outages (which at the time of writing are still ongoing) have meant downtime for lots of their customers including household names like Quora, FourSquare and reddit. The problem is with their Elastic Compute Cloud (EC2) service in one of the availability zones in their Eastern US N.Virginia region.

Often problems like this are localised to one availability zone (datacentre) which gives you a number of ways of working round the problem.

Elastic IP Addresses

By using an Elastic IP Address you can bring up new instance in another availability zone and then bind the Elastic IP to it. There’d likely be some manual intervention from you to do this and you’d need to make sure that you had a decent enough backup on EBS or a snapshot to resume from.

Elastic Load Balancing

Using an Elastic Load Balancer you can spread the load between servers in multiple availability zones. This could allow you to have e.g. one web server in each Eastern US zone and the loss of one zone like today should be handled transparently. This would be easy to implement with a simple website but to create full redundancy of backend data (in RDBMS, etc) you’d need to setup appropriate data replication there too. In theory this approach should allow a zone failure to be completely transparent to your users.

Low DNS TTLs

If you’re not willing to pay for Elastic IPs or Elastic Load Balancing then you could manually redirect traffic in the event of an outage to a new AWS instance or to another ISP for all it matters. Read more about DNS TTLs here: Using DNS TTL to control migrations

Disaster Recovery and Backups

You need to decide what level of Disaster Recovery you require. It’s usually a trade-off between the cost of the downtime to your business and the cost of implementing it. You could decide that in the event of a rare outage it’s acceptable to just display a “sorry we’re having problems page” served from an instance that you only bring up in the event of problems. If your requirement is to bring up a full copy of the site in a new zone here are suggestions as to how you could do this.

Amazon Elastic Block Store (EBS) supports snapshotting which is persisted to Amazon S3 across all zones in that region. This would be a great way of keeping backups if you can live with resuming from some slightly older snapshotted data. All you need to do is bring up the new instance in one of the fully-functioning zones and attach an EBS volume derived from the snapshot.

If using snapshotted data isn’t acceptable then you’d need to look at implementing your own replication of data. Almost all of the commonly used RDBMS/NoSQL applications support replication and setting up replicas is fairly standard operationally.

Big datasets for full-text search benchmarking

A few times recently I’ve looked for large datasets to experiment/benchmark against and I usually manage to come up blank.

I managed to spend longer than usual on this problem yesterday and came up with some which I’ll share with you.

Project Gutenberg

http://www.gutenberg.org/

This project hosts the content of over 33,000 books. You can download the data as one book per file and there are full instructions for downloading/mirroring here. It seems that they’ve blocked Amazon AWS IP ranges from mirroring content from their site which is a shame.

The Westbury Lab USENET Corpus

http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html

The contents of some USENET groups have been anonymised and cleaned up to form a set containing >28M documents and >25B words. Each week’s data is stored as a single text file and each post simply delimited which makes parsing a breeze. You can easily load this data into a MySQL database with a command similar to:

LOAD DATA LOCAL INFILE '/path/to/file.txt' INTO TABLE myTable LINES TERMINATED BY '---END.OF.DOCUMENT---';

It’s also available as part of Amazon’s AWS Public Datasets offering as EBS snapshot snap-­c1d156aa in the US West AWS region. Using AWS is a really quick way of getting hold of this dataset without the need to wait for any downloading to complete.

I found this a really nice dataset with each document at around 5kB, it seemed to be a sensible size for benchmarking email body text, blog/publishing posts, etc

Wikipedia Text

http://dumps.wikimedia.org/
Wikipedia provide huge database dumps. It seems that there’s an AWS Snapshot snap-­8041f2e9 which contains this data too but it’s a couple of years old. There’s also a “WEX” extract (snap-­1781757e on AWS) created by the Freebase team which is provided as XML markup ready to be easily imported into a db table, one row per article.

In doing this research I came across a couple of v.interesting projects that extract/compile metadata from various sources including Wikipedia. They’re Freebase and DBpedia. I hope to play with some of their datasets and write a post on that in the future.

MySQL Server’s built-in profiling support

MySQL’s SHOW PROFILES command and its profiling support is something that I can’t believe I hadn’t spotted before today.

It allows you to enable profiling for a session and then record performance information about the queries executed. It shows details of the different stages in the query execution (as usually displayed in the thread state output of SHOW PROCESSLIST) and how long each of these stages took.

I’ll demonstrate using an example. First within our session we need to enable profiling, you should only do this in sessions that you want to profile as there’s some overhead in performing/recording the profiling information:

mysql> SET profiling=1;
Query OK, 0 rows affected (0.00 sec)

Now let’s run a couple of regular SELECT queries

mysql> SELECT COUNT(*) FROM myTable WHERE extra LIKE '%zkddj%';
+----------+
| COUNT(*) |
+----------+
|        0 | 
+----------+
1 row in set (0.32 sec)

mysql> SELECT COUNT(id) FROM myTable;
+-----------+
| COUNT(id) |
+-----------+
|    513635 | 
+-----------+
1 row in set (0.00 sec)

Followed up with some stuff that we know’s going to execute a bit slower:

mysql> CREATE TEMPORARY TABLE foo LIKE myTable;
Query OK, 0 rows affected (0.00 sec)

mysql> INSERT INTO foo SELECT * FROM myTable;
Query OK, 513635 rows affected (33.53 sec)
Records: 513635  Duplicates: 0  Warnings: 0

mysql> DROP TEMPORARY TABLE foo;
Query OK, 0 rows affected (0.06 sec)

Now we’ve run the queries let’s look at their summary with SHOW PROFILES

mysql> SHOW PROFILES;
+----------+-------------+-------------------------------------------------------------------+
| Query_ID | Duration    | Query                                                             |
+----------+-------------+-------------------------------------------------------------------+
|        1 |  0.33174700 | SELECT COUNT(*) FROM myTable WHERE extra LIKE '%zkddj%'           | 
|        2 |  0.00036600 | SELECT COUNT(id) FROM myTable                                     | 
|        3 |  0.00087700 | CREATE TEMPORARY TABLE foo LIKE myTable                           | 
|        4 | 33.52952000 | INSERT INTO foo SELECT * FROM myTable                             | 
|        5 |  0.06431200 | DROP TEMPORARY TABLE foo                                          | 
+----------+-------------+-------------------------------------------------------------------+
5 rows in set (0.00 sec)

It’s not like any of those numbers are a surprise as we saw them from the client but it’s a handy record of the execution times and could easily be queried within an application just before the connection to a db is closed or e.g. at the end of the web request.

We can also dig deeper into each of the commands, let’s look at the first query we ran:

mysql> SHOW PROFILE FOR QUERY 1;
+--------------------------------+----------+
| Status                         | Duration |
+--------------------------------+----------+
| starting                       | 0.000033 | 
| checking query cache for query | 0.000073 | 
| Opening tables                 | 0.000013 | 
| System lock                    | 0.000007 | 
| Table lock                     | 0.000035 | 
| init                           | 0.000032 | 
| optimizing                     | 0.000014 | 
| statistics                     | 0.000016 | 
| preparing                      | 0.000014 | 
| executing                      | 0.000009 | 
| Sending data                   | 0.331296 | 
| end                            | 0.000016 | 
| end                            | 0.000003 | 
| query end                      | 0.000005 | 
| storing result in query cache  | 0.000105 | 
| freeing items                  | 0.000012 | 
| closing tables                 | 0.000007 | 
| logging slow query             | 0.000003 | 
| logging slow query             | 0.000048 | 
| cleaning up                    | 0.000006 | 
+--------------------------------+----------+
20 rows in set (0.00 sec)

Looks like almost all of the time there was spent executing the query, definitely one worth investigating further with EXPLAIN

Now let’s look at the slow INSERT sub-select we ran to see what took the time. I’ve enabled CPU profiling here too.

mysql> SHOW PROFILE CPU FOR QUERY 4;
+----------------------+-----------+-----------+------------+
| Status               | Duration  | CPU_user  | CPU_system |
+----------------------+-----------+-----------+------------+
| starting             |  0.000069 |  0.000000 |   0.000000 | 
| checking permissions |  0.000010 |  0.000000 |   0.000000 | 
| Opening tables       |  0.000217 |  0.000000 |   0.000000 | 
| System lock          |  0.000006 |  0.000000 |   0.000000 | 
| Table lock           |  0.000014 |  0.000000 |   0.000000 | 
| init                 |  0.000041 |  0.000000 |   0.000000 | 
| optimizing           |  0.000007 |  0.000000 |   0.000000 | 
| statistics           |  0.000014 |  0.000000 |   0.000000 | 
| preparing            |  0.000013 |  0.000000 |   0.000000 | 
| executing            |  0.000006 |  0.000000 |   0.000000 | 
| Sending data         |  4.326303 |  3.544221 |   0.324020 | 
| Creating index       |  0.000029 |  0.000000 |   0.000000 | 
| Repair by sorting    | 29.202254 | 17.133071 |  11.616726 | 
| Saving state         |  0.000040 |  0.000000 |   0.000000 | 
| Creating index       |  0.000007 |  0.000000 |   0.000000 | 
| Sending data         |  0.000389 |  0.000000 |   0.000000 | 
| end                  |  0.000009 |  0.000000 |   0.000000 | 
| end                  |  0.000012 |  0.000000 |   0.000000 | 
| query end            |  0.000006 |  0.000000 |   0.000000 | 
| freeing items        |  0.000015 |  0.000000 |   0.000000 | 
| closing tables       |  0.000007 |  0.000000 |   0.000000 | 
| logging slow query   |  0.000005 |  0.000000 |   0.000000 | 
| logging slow query   |  0.000040 |  0.000000 |   0.000000 | 
| cleaning up          |  0.000007 |  0.000000 |   0.000000 | 
+----------------------+-----------+-----------+------------+
24 rows in set (0.00 sec)

It seems that building the indexes for the new table were what took the time. The General Thread States page of the MySQL Documentation is a useful reference. Interestingly we can see the “logging slow query” state here too, something that sails by too quickly to ever see when looking at SHOW PROCESSLIST output.

This profiling support doesn’t fulfil the same role as MySQL’s EXPLAIN command and is only useful in some places but if you were to look at implementing profiling or instrumentation for your app this could be really handy.

You can find the full documentation for MySQL’s profiling support at SHOW PROFILES Syntax it appears to be supported from at least MySQL 5.0 and it’s worth noting is only available in the MySQL Community (non-Enterprise) builds.