Andre's Blog • UNICODE vs. Developer

UNICODE vs. Developer

Posted Wed, 07 May 2008 09:27:13 GMT in Programming by Andre

Unicode is a form of character encoding that represents characters from most human languages, as well as special symbols, such as math and musical notation symbols. In the last few years, more and more companies turned to Unicode to adapt their software to work with languages other than the original language of the application. However, despite seemingly simple concept, Unicode proved for many companies to be a tough nut to crack.

Somehow, many developers cannot grasp the concept of multiple types of character encodings used in a single application and continuously mix them up, producing gibberish in the output. Unfortunately, some of the Microsoft's approaches to dealing with Unicode in C++ lead to even more confusion.

How big is it, anyway?

So, how many bytes are in a Unicode character? The simple answer is one to four, depending on the value of a character and the Unicode format. That is, the same ASCII symbol may be represented as one byte in UTF-8, as two bytes in UTF-16 or as four bytes in UTF-32. For example, character 'A' is represented as 41 in UTF-8, 0041 in UTF-16 and 00000041 in UCS-32. At the same time, character '£' is represented as a sequence of C2 A3 in UTF-8, as 00A3 in UTF-16 and as 000000A3 in UTF-32.

But why!!?

At the first glance, this seems like a lot of bytes and ways to represent a single character. Does it even make any sense? Yes it does. Unicode formats make a lot of sense when it comes to dealing with strings of characters in the code. Let's consider two programs. One that parses text "£123.56" and another that sends the same text over HTTP.

In the first case, it's important that characters are of the same width, so that the code can work with this text counting characters, not bytes. Given character values represented as 00A3 0031 0032 0033 002E 0035 0036 in UTF-16, parser output would indicate that the currency symbol is at the offset of zero, the number begins at the offset of one and the decimal point is at the offset of four. Characters and sequences of characters can be evaluated and accessed at these offsets in fixed amount of time, as opposed to having to determine whether the next byte is a start of a new character or simply a character on its own.

When the same text is sent over HTTP, on the other hand, it is important to minimize the amount of data transferred over the wire and, since the code dealing with the transfer does not need to evaluate individual characters, but instead just needs to know about the byte size of text being transferred, the byte sequence C3 A3 31 32 33 2E 35 36 in UTF-8 much better suites the purpose than any other formats.

Conversion chains

It is quite common for an application to work with more than one form of character encoding throughout its life cycle. For example, an HTML browser can read HTML content in a variety of encodings served by web servers (e.g. UTF-8, ISO-8859-1, Shift-JIS, etc). Once in memory, received content is converted to a form more appropriate for HTML parsing, such as UTF-16, and then rendered on the screen. If the HTML page has any forms, which are submitted back to the web server, the browser takes form fields and encodes them according to the original content of the page (e.g. UTF-8) before sending them to the web server.

Many developers mistake character set of a source file stored on a disk with the character set used in the application. For example, a JavaScript source file may be stored in UTF-8, but when the browser retrieves the file from a web server, it will convert UTF-8 characters to UCS-2 and only then execute the script. Similarly, if a UTF-8 page is rendered in the browser and one selects some text and copies it into the clipboard, it does not mean that the clipboard contains UTF-8 characters.

Consequently, it is important to know what character set you are working with before any text-related work can be done.

Narrow and Wide

Each programming language defines one or more native character type. These types are used to build more complex object to work with text.

The are two types of characters defined by C++ - char and wchar_t. The former is a one-byte character, also known as a narrow character, and the latter is a two- or four-byte character, depending on the compiler. C++ also places a restriction on sequences of characters to be terminated with a special null character to indicate the end of the sequence, or string. Beyond defining that each of the types should support most of the ASCII symbols, called basic source character set, C++ standard doesn't define any additional restriction on the content of char and wchar_t variables.

In practice, wchar_t in Visual C++ is used to represent characters from the Universal Character Set (UCS-2) defined by the ISO standard 10646. Characters of this set map exactly to the UTF-16 character set, with the exception of so-called surrogate pairs, which are not used in most applications.

Characters of type char can store any byte-size characters or parts of a character, which means that a narrow character string must be assumed to be of a certain character set or accompanied with the character set information. For example, a sequence of two characters C2 A3 may represent a single character '£', if interpreted as a UTF-8 sequence or may represent a sequence of characters 'Â£', if interpreted as a character from the Western Alphabet (ISO-8859-1).

#ifdef UNICODE

In the earlier days of national character support, Microsoft came up with an approach that was supposed to make building Unicode applications almost transparent to the developer.

The idea behind this approach was to define the UNICODE symbol for a project and make the source blissfully unaware whether it is working with narrow or wide characters. Microsoft went a great length to implement this idea - a generic string function was created for each narrow and wide string function, many functions of the SDK were modified to take TCHAR instead of char and have a different name for each type of character. For example, instead of calling strcpy, the code was supposed to call _tcscpy, which would map to strcpy when UNICODE was not defined and to wcscpy if it was.

However, the UNICODE symbol trick works only for projects with no third-party components, one type of character set (e.g. Western Alphabet) and when all projects in a solution are compiled the same way. If some projects in a solution are build with UNICODE defined and some without, this usually results in linker errors because a header file containing TCHAR's included in mixed projects will result in mismatched function signatures.

Another problem with this approach is that while wide characters are assumed to be UCS-2 characters in VC++, narrow characters may be from more than one character set and have to be accompanied by a character set identifier, which is often omitted or simply overlooked, making this approach very error-prone.

A better way to structure a project is to use explicit character types, such as char, std::string and ASCII versions of Windows SDK functions (e.g. LoadLibraryA) when you want to work with narrow characters and whcar_t, std::wstring and wide versions of SDK functions (e.g. LoadLibraryW) when you want to work with wide characters. A set of conversion functions based on MultiByteToWideChar and WideCharToMultibyte or similar functions, should be used to convert characters from one character set to another in a uniform manner.

Implicit character conversions

Character pointer type of the COM interface is BSTR, which is a pointer to a wide-character string allocated using such functions as SysAllocString. _bstr_t is a convenience class that will manage these strings in an exception-safe way. However, this convenience sometimes comes at a price. Many developers do not realize that _bstr_t will implicitly convert wide characters to the default character set of the machine running the code, which often results in lost data:

_bstr_t ws(L"abc\u0394def");   // abcΔdef
const char *cp = ws;           // abc?def

The character 'Δ' wouldn't be lost if the proper target character set (ISO-8859-7; Latin/Greek) is not used when the conversion was done.

Server-side scripting

Server-side scripts, such as ASP, are the first interface in the text processing chain of many web applications and need to be made aware of the character set they are expect as input and the character set of the output they generate. For example, if an ASP page is supposed to accept UTF-8 input, each ASP page of a web application should have these lines at the very top of every ASP file (syntax should be adjusted to the actual language used):

<%@ CodePage=65001 language="JScript" %> 
<%  
Response.CodePage = 65001;
Response.CharSet = "utf-8"; 
%>

This will not only instruct ASP to process input as UTF-8, but also will tag all outgoing HTML as UTF-8.

Note, however, that even though ASP code page is set to UTF-8, the code on the page, be that JScript or VBScript, will still be using wide characters and ASP will then convert UTF-8 characters to UCS-2 before executing the first script statement.

Database

Database is another piece of the Unicode puzzle that often gets misplaced. When storing Unicode data in the database, make sure to pick Unicode-compatible storage format.

For example, when creating a table in a SQL Server database, make sure to use nvarchar or ntext instead of varchar or text.

create table t1 (
      ip_address varchar(32),                  -- default (Latin-1)
      comment nvarchar(255)                    -- Unicode (UCS-2)
);

When creating a MySQL table, specify the character set for those columns that may contain Unicode characters:

create table t1 (
      ip_address varchar(32),                   -- default (Latin-1)
      comment varchar(255) character set utf8   -- Unicode (UTF-8)
);

When communicating with the database, it's important to use the correct data type for Unicode columns. That is, if you try to insert a Unicode string bound as an adVarChar parameter in ASP, Unicode characters in this parameter will be lost. Instead, use Unicode-compatible data types, such as adVarWChar or adLongVarWChar.

When passing Unicode as a part of SQL, make sure to prepend the string with an N to ensure that it's treated as a Unicode string:

select * from t1 where comment like N'%Ǻ%'

When retrieving data, make sure to bind record fields to a Unicode-compatible data type.

XML

Many developers see XML files as plain text with a bunch of angle brackets and when they need to create one, use something along these lines:

out << "<root>£</root>";

However, this tiny XML content may fail to parse. The problem is that in the absence of explicit encoding, XML files are treated as if they are encoded in UTF-8, but the code above will output '£' using the default machine character set, which usually is Latin-1 or Windows 1252 for Western versions of Windows. The '£' character in Latin-1 is represented as a single byte with the value A3, which is not a valid UTF-8 character. Adding character set to the generated XML fixes the problem:

out << "<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>"
       "<root>£</root>";

Alternatively, the original string may be converted to UTF-8 before writing it into the output stream.

URLs

URLs may only contain certain ASCII symbols, which means that Unicode characters have to be escaped using multiple %HH sequences, where HH is a hexadecimal representation of one the encoded bytes a Unicode character consists of. The problem is that URL syntax [RFC 2396] does not specify the character set of encoded characters it contains and if the URL is considered outside of the context in which it was generated, there is no reliable way to recreate original characters.

For example, the sequence %C3%B5 may represent characters Ãµ if interpreted as Latin-1 or as a single character õ if interpreted as a UTF-8 sequence.

Bottom Line

Always know your character encoding at any level in your application and always tag your output if it supports tagging. For example, when generating HTML, add the charset parameter to the document Content-Type, which may be done either through HTTP or as an http-equiv meta tag in the generated HTML.

Avoid implicit conversions, which often default to the character set of the machine running the application and in the web context doesn't have much to do with the character set the application is expected to handle.

Use explicit data types and functions, such as wchar_t and wcslen when working with wide characters. Remember, though, that your memory consumption for text will double or even quadruple when you switch to wide-character strings.

Useful Links

A few Unicode-related links you may find useful when dealing with Unicode issues.

UTF-8/16/32 FAQ

Unicode FAQ

Comments: