Article: Building Unicode LAMP applications »
FERDY CHRISTANT - NOV 14, 2008 (08:28:04 AM)
This article describes in detail what it takes to build a LAMP application that understands Unicode. That is, an application that allows for the input, processing, storage and display of Unicode characters. Many developers are aware of what Unicode is, and some apply it on a regular basis. For long, I've been in the category of developers who can get away with not knowing about Unicode at all. When you're working in a controlled environment with a fixed audience and corporate standards for language and character sets, Unicode is less relevant. However, as I started development on Project JungleDragon, a global web application, Unicode instantly became relevant. In a global web application, you surely do not want to alienate a large portion of your audience by allowing them to input latin characters only.
With a clear requirement for Unicode support in my project, I ignorantly assumed it was a matter of setting the correct HTML headers in my scripts. Needless to say, I would not be writing this article if it were that simple. The good news is that it is quite simple, it's just that it takes a lot more than I expected, and you really have to know what you're doing.
What is Unicode?
I will not attempt to fully describe what Unicode is, there are better places for that. Instead, I will only list things you need to know and remember as a developer:
- Unicode is a character set, much like the latin character set. Unicode, however, is a standardized character set that contains over a million characters, enough to contain most living languages of the planet, along with some additional symbols. A character set is simply a list of character definitions mapped to unique numbers (code points).
- Unicode is fully compatible with the latin character set (which is often used as a default character set in most programming environments). This means that you can convert from the latin character set to Unicode. You can also convert from Unicode to the latin character set, as long as the Unicode characters to convert are within the range of the latin character set.
- Unicode characters can be stored in different ways, called encodings. The most common Unicode encoding format for web applications is UTF-8. UTF-8 uses between 1 and 3 bytes per character, depending on the code point of the character. A latin character will take 1 byte, whilst a chinese character is likely to take 3 bytes in storage.
How to use Unicode in a LAMP application
What better way to spoil an article then to drop to conclusions right away? The two most important things to keep in mind when supporting Unicode in your application are:
- To control the complete stack. Every part of your application needs to be Unicode enabled, including the database, web server, scripts, html and forms. Some parts of the stack make character set encoding assumptions based on other parts of the stack. Do not rely on this, enforce UTF-8 on all parts. It's the only way to get it to work reliably, you do not want to be dependent on a web server configuration or browser behavior.
- To support Unicode from the start. Trust me, you do not want to convert your stack to Unicode when you're in production with live data that is in a different encoding.
This article will focus on supporting Unicode throughout the LAMP stack, starting with the database.
MySQL and Unicode
In a typical LAMP application, we store our data in a MySQL database. In order for MySQL to store data in UTF-8 format, we need to explicitly tell it to do so. This happens at multiple levels. There are actually two things to set: the character set, and the collation. Character set is actually a poorly chosen term, as it really is the encoding that we're setting. It will be UTF-8 for what we want. Collation is the way in which we are sorting that data. For example, the German language sorts characters differently from the English language. Language-specific collation is only relevant if you intend to localize your application (referred to as L10N), that is, you want to provide a language and culture-specific experience. We will not go this far, however. We want a collation that supports all languages, not one specifically. Therefore, we will go for the most common collation value: UTF_General_CI.
MySQL server settings
At the MySQL server level, you can set the default character set and collation. In the screenshot below you can see how in PHPMyAdmin my server is set to the correct values for Unicode support:
You can set this value at the server level by editing the my.cnf configuration file:
This change requires a server restart, after which new databases will use the character set/encoding by default. If you do not have control over this setting, you can enforce it at a lower level, which is a good practice anway.
MySQL database settings
MySQL allows you to set the encoding and collation per database, upon creation or afterwards. Any good MySQL administration tool allows you to do this. In plain SQL it looks like this:
CREATE DATABASE myDatabase DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;
MySQL table settings
Next to the server and database settings, MySQL allows you to set the collation per table. Use your favorite MySQL administration tool to set this to utf_general_ci.
MySQL column settings
Still not there yet. Even at the column level, you can set the character set and collation. This only applies to text columns, such as CHAR, VARCHAR, TEXT, ENUM and SET. By the way, MySQL intelligently handles the length of these columns. For example, if you allow for a length of 2 for a country code column, you do not have to triple the length to cope with the byte size of UTF-8 characters. MySQL's length means character length, not byte length. On another note, it is up to you to intelligently decide which text columns should be UTF-8. A common scenario is to UTF-enable only user-input text columns, such as the body of a comment, whereas administration data (such as a list of countries) can be stored in latin just fine (unless you want localize the country names themselves).
Tip: Do not use CHAR for your UTF-8 text columns, it will triple the byte size of the column.
MySQL connection settings
Are we done yet? Sigh. No. There is one more thing to take care of: the MySQL connection. Each time you fire one or more queries from your scripts, it opens a MySQL connection. You need to explicitly tell MySQL that your connection contains UTF-8 data. There are two ways to do this:
- Server-level. You can configure the MySQL server in my.cnf:
init-connect = 'SET NAMES utf8'
- Query-level. Before each query you make, you explicitly send the query "SET NAMES utf8" beforehand. This sounds like a cumbersome thing to do, yet it is a common approach, since you may not have control over the server settings. If you're lucky, you make use of a framework that has a centralized data access class. This way you only have to add this query in one place. I'm using Code Igniter for my project, which takes care of this automatically. However, if you have direct MySQL access across your scripts, you need to edit every single query.
Finally. We're done with the database. Oh wait, there's one more thing. Some things in MySQL may behave weird when using UTF-8 for text columns. Things like sorting and some of the built-in SQL functions. So far I've had no issues yet, but consider this a warning.
Now would be the time to test if your database setup is correctly configured. Head over to this site, copy some crazy Unicode characters and insert them into your database. Next, check the result. If you're seeing "????" instead of the Unicode characters, something's wrong.
The source code
With the database fully setup to accept and store UTF-8, let us now move on to an easier part. This is a part you might not expect to be relevant, but it is. Your script files, your actual code, the files you work with in your IDE of choice...they too have to be in UTF-8. Really. To start with, your code may contain UTF-8 comments. Even when this is not the case, your script may not be able to parse incoming data correctly if the script itself is not stored as UTF-8. I've yet to discover why this is the case, but there are countless reports of this issue on the web. Luckily, the fix is easy, most modern IDEs allow you to set the encoding format of your code files. In Zend, it looks like this:
Check the manual of your IDE on how to do this. If you have existing code files that you want to convert to Unicode, you may want to use a tool. On Linux systems, iconv is a common one. And please, stay away from things like Notepad to edit your PHP, it will likely destroy the encoding of the file.
HTML and UTF-8
So, we have our database and code files in UTF-8. It is time to move on to another easy, yet neccessary part: our front-end. A typical web application will output HTML or XML markup. The way browsers assume the character set to use for the markup is not reliable. Instead, we will want to explicitly tell the browser what character set and encoding to use. This is easy to do. In PHP, on every page we serve, we need to explicitly set this in the header before we output any markup:
<?php header('Content-Type: text/html; charset=utf-8'); ?>
This tells the browser that the document requested is encoded as UTF-8. Note that you can also set this at the web server level, in Apache or in a .htaccess file. I advocate to explicitly do this in code though, just to be sure, and to not be reliant on a web server configuration.
Since we're so obsessed with UTF-8 now, we also want to give browsers another hint that we're serving UTF-8 content, by setting the content type in the header of the actual HTML:
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
You need to place that tag as high in the <head> section as possible (even before <title>, because for some browsers it is a trigger to rerender the page).
To finish this part, the last thing to do is make all our forms aware of UTF-8:
<form action="yourposturl" method="post" accept-charset="UTF-8">
Finally, let us test if UTF-8 content is served to the browser. Open your page in Firefox, right-click on it and choose "View Page Info":
Note: If you're serving XML (for example for a RSS feed), you can explicitly specificy the content type in the XML header.
PHP and Unicode
If you think you're done now, think again. The hardest part is still to come. Well, it depends, you may be lucky. If all you're doing is taking UTF-8 user input, storing and displaying it without any processing, you're done. It is more likely though that your application will at least have some basic string processing. The bad news is that PHP currently is not aware of Unicode strings, at least not until version 6 (PDF).
What this means in practice is that some of the PHP functions you often rely on for string processing will not work correctly for Unicode strings. For example, the strlen() function which you use to count the characters of a string does not work as expected. strlen() assumes one character equals one byte, which is not the case for Unicode characters, which take 1-3 bytes per character. Another troublesome function is substr(), and there are others.
For many of these unsafe operations, there is an easy fix. The mb_string module (multi-byte string) provides multi-byte equivalents of the regular string functions. Note that this module is not part of PHP 5 itself, you have to compile it in. Many systems, such as Debian, have it compiled by default. To check if mb_string is available on your installation, output the phpinfo() command.
Once you have the mb_string module available, you simply replace your unsafe string function call with mb_<function>. For example, strlen() becomes mb_strlen(). You can even provide an elegant fail-over:
$length = function_exists('mb_strlen') ? mb_strlen($str) : strlen($str);
It's not as elegant as it looks though. If mb_string is not available, the code above will resolve to the classic strlen() function, which will still fail for Unicode strings. Note that it is also possible to override the classic functions (using an .htaccess file or override function) so that if you call strlen(), it will implicitly call mb_string(). You can consider doing this in the scenario of a huge existing code base with a lot of string processing. Personally, I try to avoid such things as it decreases the readability of the code.
The mb_string functions cover the most basic problems. There are other possible issues that go too far to explain in this article, yet I will briefly mention them:
- If there is no mb_string equivalent of your function, or, if you do not have mb_string available, regular expressions are the answer.
- On top of providing UTF-8 safe string operations, also be aware that you need to check if user-supplied Unicode strings are correct (well-formed).
- Be aware that Unicode allows for multiple seperate characters to be combined into an entirely new character, so there is no guaranteed relationship between a single code point and a character. This makes fail-safe string processing even more complex. In some cases, mb_string is still not good enough!
- Sending out UTF-8 encoded emails has its own challenges.
- To make matters worse, there are also OS differences. This article assumes a Linux environment.
Putting it to the test
We've now covered all aspects to support UTF-8 in a LAMP application. It is time to put it to the test. I'll demonstrate this using a project I'm working on.
First, I head over to this Unicode character map site and produce some wacky Unicode string. Be sure to pick some non-latin characters for a good test:
This string contains Cyrillic and Arab characters combined. On a funny side note, I had trouble selecting it in order to copy it. The text direction of Arab is from right-to-left, so you also have to select text that way :)
Let's inject this string into a form. For this example I'm using a comment form:
It's displaying fine. Let's hit "Save comment". Time to check the raw data using a MySQL administration tool, PHPMyAdmin in my case:
A bit hard to see with the naked eye, but the highlighted line shows our comment correctly stored. Finally, let us display the data on a web page:
The second column of the row shows the Unicode displaying just fine, which means we're done!
Enabling Unicode support for your application can be rewarding, yet is a task that is easily underestimated. Luckily, basic support is not too hard to comprehend and achieve once you know what to do. I hope this article helps you in understanding what to do in your situation. Keep in mind the following advise:
- Enable Unicode support from the start, it makes your life so much easier.
- Explicity enforce UTF-8 encoding in all parts of the system, do not rely on defaults, server-specific configurations or an educated guess of your browser.
- Beware of the dragons. Even with all of the steps described above, you may still have situation-specific border cases that make your life miserable. For this I invite you to do your own research, there is quite a lot of help out there.
If you enjoyed this article, please show your support by using the bookmarking, rating and comment system below!