How to Avoid Character Encoding Problems in PHP

Character sets can be confusing at the best of times. This post aims to explain the potential problems and suggest solutions.

Although this is applied to PHP and a typical LAMP stack you can apply the same principles to any multi-tier stack.

If you’re in a hurry you can skim past this first “The boring history” section.

The boring history

Back in 1963 ASCII was published, it was a simple character set conceived in the US and designed as a standard to allow different systems to interact with one another. It includes alphanumeric characters, numbers and some common symbols. It’s a 7-bit character set (ASCII Table)

This works OK for English speaking countries but doesn’t help with other languages that have different characters, accented characters like é. Twenty or so years later the ISO-8859 set of standards were established. By then bytes (8-bits) had become a standard sized chunk of data to send information around in. These new character sets allowed space for another 128 characters. This was enough space to create different sets for different languages/regions but not enough to put everything into a single character-set.

ISO-8859-1 is probably the most commonly used (also known as “latin1″ or “Western European”) and other 15 other character sets were defined too including ISO-8859-2 (Central European), ISO-8859-3 (South European), etc, etc. There’s a full list on Wikipedia.

This created a big problem, you need to know which character-set you’re using because although the common ASCII characters are the same in different languages the same sign is £ in one character-set and Ł, Ŗ, Ѓ or ฃ in various different sets!

An easier solution would be to have all possible characters in some single character set and that’s what UTF-8 does. It’s shares the same first 7 bytes with ASCII (it’s backwardly compatible) but can be anything from one byte to four bytes in length. That gives it a staggering choice of 1,112,064 different characters. That makes life a bunch easier, because you can use UTF-8 with your web application and it’ll work for everyone around the world.

There is another used character set called UTF-16 but it’s not backwardly compatible with ASCII and less widely used.

Conclusion of the boring history section

If you didn’t bother to read all of the section above there’s just one thing to take away from it: Use UTF-8

Where do the problems occur?

You have a potential for problems to occur anywhere that one part of your system talks to another. For a PHP/LAMP setup these components are:

* Your editor that you’re creating the PHP/HTML files in
* The web browser people are viewing your site through
* Your PHP web application running on the web server
* The MySQL database
* Anywhere else external you’re reading/writing data from (memcached, APIs, RSS feeds, etc)

To avoid these potential problems we’re going to make sure that every component is configured to use UTF-8 so that no mis-translation goes on anywhere.

Configuring your editor

Ensure that your text editor, IDE or whatever you’re writing the PHP code in saves your files in UTF-8 format. Your FTP client, scp, SFTP client doesn’t need any special UTF-8 setting.

Making sure that web browsers know to use UTF-8

To make sure your users’ browsers all know to read/write all data as UTF-8 you can set this in two places.

The content-type <META> tag
Ensure the content-type META header specifies UTF-8 as the character set like this:

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">

The HTTP response headers
Make sure that the Content-Type response header also specifies UTF-8 as the character-set like this:

ini_set('default_charset', 'utf-8')

Configuring the MySQL Connection

Now you know that all of the data you’re receiving from the users is in UTF-8 format we need to configure the client connection between the PHP and the MySQL database.

There’s a generic way of doing by simply executing the MySQL query:

SET NAMES utf8;

…and depending on which client/driver you’re using there are helper functions to do this more easily instead:

With the built in mysql functions

mysql_set_charset('utf8', $link);

With MySQLi

$mysqli->set_charset("utf8")

With PDO_MySQL (as you connect)

$pdo = new PDO( 
    'mysql:host=hostname;dbname=defaultDbName', 
    'username', 
    'password', 
    array(PDO::MYSQL_ATTR_INIT_COMMAND => "SET NAMES utf8") 
);

The MySQL Database

We’re pretty much there now, you just need to make sure that MySQL knows to store the data in your tables as UTF-8. You can check their encoding by looking at the Collation value in the output of SHOW TABLE STATUS (in phpmyadmin this is shown in the list of tables).

If your tables are not already in UTF-8 (it’s likely they’re in latin1) then you’ll need to convert them by running the following command for each table:

ALTER TABLE myTable CHARACTER SET utf8 COLLATE utf8_general_ci;

One last thing to watch out for

With all of these steps complete now your application should be free of any character set problems.

There is one thing to watch out for, most of the PHP string functions are not unicode aware so for example if you run strlen() against a multi-byte character it’ll return the number of bytes in the input, not the number of characters. You can work round this by using the Multibyte String PHP extension though it’s not that common for these byte/character issues to cause problems.

About these ads

About James Cohen
LAMP geek with interests in building scalable web applications

19 Responses to How to Avoid Character Encoding Problems in PHP

  1. Ivan Frantar says:

    I would go with the simpler:<meta charset="utf-8">

  2. Martin Stricker says:

    From my experience: The “One last thing to watch out for” should have the heading “… always to watch out for”, at least if you are handling non-English input. Otherwise you’ll have really unpleasant surprises with functions like substr() et. al.

  3. Rocco says:

    What’s easier instead of calling

    header('Content-type: text/html; charset=utf-8');
    

    is

    ini_set('default_charset', 'utf-8');
    

    This way you don’t have to worry about the infamous ‘headers already sent’ error and also works if you need to send a different content-type on occasion

  4. Pingback: Brian Swan's Blog

  5. mslade says:

    The article calls ASCII a 7-byte character set, but I believe you mean 7-bit.

  6. Thank you for such a clear explanation… Being fighting with Charsets for years!

    still 7-byte instead of 7-bit error… ;-)

    I would like to know how to clean entries from Forms, Post/Get, and also how to have all info ready to insert to database and read from database

    Thanks

    Daniel

  7. I found this:

    You should never use SET NAMES in a query with the mysql extension. If you do, mysql_real_escape_string() won’t be notified of the change, and i’ll keep escaping your data for latin1, which can open security holes. Use mysql_set_charset(‘utf8′); instead. Same thing, much safer ;)

    In this post at the bottom feedback:
    http://tympanus.net/codrops/2009/08/31/solving-php-mysql-utf-8-issues/

  8. isaac says:

    my .php file was encoded as utf8 and not sending the accented characters to the database. I saved it encoded as ISO-8859-1 (Latin 1) and it solved the problem.

  9. Pingback: PHP vs. The Developer: Encoding Character Sets - Digital Conversations

  10. Aleksandra says:

    Thank you very much, great article..you helped me .. :))

    Best regards,
    Aleksandra

  11. Erik says:

    Thanks! this post helped me out a lot!

  12. Dotan Cohen says:

    The article states:

    > You can check their encoding by looking at the Collation value in the output of SHOW TABLE STATUS

    That is incorrect. The collation value determines in what order MySQL will sort. The value that you should look at to check the encoding is “CHARACTER SET” or “CHARSET”.

  13. Tim says:

    Nice article.

    What about setting mb_internal_encoding as well?

  14. Dan says:

    Great article, the point about string functions not being unicode aware helped me alot!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: