Home : Network Programming
Unicode, UTF-8 and Devanagari

Contents


What is Unicode and why do we need it?

Computers store characters by assigning a number to each one. This process is known as encoding. Most of us are familair with ASCII which is a 7 bit encoding of the characters in the english langauge (it can store at most 128 characters). With the passage of time, the need was felt for a single encoding that could contain enough characters to accomodate all the languages in the world. To enable sharing of information, this encoding would need to be a standard accepted universally. That standard is Unicode. Unicode is a 32 bit encoding which can potentially give a unique number to each character in all languages known to man.

Actually, there is another international standard, the ISO 10646 of the International Organization for Standardization (ISO), which defines the Universal Character Set (UCS). Fortunately, the participants of both projects (ISO and Unicode) realized in around 1991 that two different unified character sets is not exactly what the world needs. They joined their efforts and worked together on creating a single encoding. Both projects still exist and publish their respective standards independently but have agreed to keep the encoding of the Unicode and ISO 10646 standards compatible.

Various Encoding Forms

Encoding standards define the numerical value, or code point, of a particular charcacter, but that is not all. They must also define how this value will be represented in bits when stored in a computer file or transmitted over the Internet. The Unicode Standard defines three encoding forms that define how a particular character will be represented in bits while being transmitted. The three encoding forms allow the same data to be transmitted in a byte, word or double word oriented format (i.e. in 8, 16 or 32-bits). All three encoding forms encode the same common character repertoire and can be efficiently transformed into one another without loss of data.The three encoding forms as defined by the Unicode Consortium are:

UTF-8
"UTF-8 is popular for HTML and similar protocols. UTF-8 is a way of transforming all Unicode characters into a variable length encoding of bytes. It has the advantages that the Unicode characters corresponding to the familiar ASCII set have the same byte values as ASCII, and that Unicode characters transformed into UTF-8 can be used with much existing software without extensive software rewrites."
UTF-16
"UTF-16 is popular in many environments that need to balance efficient access to characters with economical use of storage. It is reasonably compact and all the heavily used characters fit into a single 16-bit code unit, while all other characters are accessible via pairs of 16-bit code units."
UTF-32
"UTF-32 is popular where memory space is no concern, but fixed width, single code unit access to characters is desired. Each Unicode character is encoded in a single 32-bit code unit when using UTF- 32."

By the way, UTF stands for UCS Transformation Format.

What is UTF-8?

UTF-8 has the benefit that the ASCII characters are still represented as a single byte "providing compatibility with file systems, parsers and other software that rely on US-ASCII values but are transparent to other values". Any document created using the ASCII encoding is a valid UTF-8 document.

Non-ASCII characters are encoded using a variable length scheme and may range from 2 to 6 bytes in size, however, the most commonly used characters are only up to three bytes long. The way that Non-ASCII characters are encoded is:

A little trivia-The encoding known today as UTF-8 was invented by Ken Thompson. It was born during the evening hours of 1992-09-02 in a New Jersey diner, where he designed it in the presence of Rob Pike on a placemat (see Rob Pike's UTF-8 history).

Unicode and the Web

Unicode is used to design multilingual web pages. The preferred encoding form for Unicode characters on the web is UTF-8. To view the web pages created using Unicode, the user requires a browser that supports Unicode (most new browsers do) and fonts for rendereing the characters.

The browser has to be informed about the encoding form (UTF-8, UTF-16, UTF-32) used for the page. This allows the browser to display the page correctly without any intervention from the user. There are two ways of doing this:

Unicode and Devanagari

In Unicode, Devanagari occupies the range from 0900 to 097F. To see the how the various characters are encoded, see this pdf document. Details about Devanagari and other South Asian scripts can be found here . Before you can view Devanagari documents, you will need to install one of the many freely available Unicode fonts. See the next section for a detailed description of how to install fonts in Linux. You can check if your browser is correctly displaying Devanagari by visiting the BBC's hindi website or Alan Wood's Devanagari Test Page.

Installing Devanagari fonts in Linux

Fonts come in a variety of formats. Older versions of linux supported PCF (".pcf"), BDF (".bdf") and SNF (".snf") font files. Newer vesions also have support for TrueType fonts (".ttf"). Given below is the procedure for obtaining and installing the Sibal Devanagari BDF font, produced by the Computing Research Labs, New Mexico State University. I am assuming that you don't have privileges to modify the XF86Config file or copy the font file to the system wide font directories.

The procedure for installing TrueType fonts is slightly different. You will find the required information by visiting the links listed at the end of this page. If your system does not support TrueType, you can convert the TrueType font files("*.ttf") into BDF font files ("*.bdf"). For details, see the next section.

The font is now installed and ready to be used by the applications. Start mozilla or any other browser that supports Unicode and visit any Hindi web site. Try the Hindi version of BBC or Alan Wood's Devanagari Test Page. I haven't been able to get the fonts working perfectly, there are some problems with rendering. It is not a problem with the font. I am working on them :-). If you manage to get them working, do let me know.

You will need to rerun the two xset commands everytime you restart your X session. To avoid doing this, you can put the two commands into the .Xclients file (or possibly your .xinitrc or .xsession file depending on how you start X) in your home directory. This will cause them to be run automatically every time you start X. Another way to have the commands set automatically is to edit XF86Config, but as i said earlier you may not have the privileges to tamper with it.

NOTE: You must take care to ensure that the fonts you use are encoded for unicode. There are many fonts available that do not use the Unicode encoding. For example, many Devanagari fonts (including fonts from CDAC) use the ISFOC encoding. These fonts can not be used for viewing multilingual web pages that use Unicode. If the fonts.dir file contains "iso8859" then the fonts are not Unicode. If the fonts.dir file contains "iso10646" then the fonts are Unicode. You can also view the ".bdf" files in any ascii viewer to verify that the ranges used for characters conforms to the Unicode standard. Lines starting with STARTCHAR denote the code point for the character defined next. If you see lines such as "STARTCHAR U+0901" then you know that the font is a unicode font. On the other hand, lines such as "STARTCHAR 0021" mean that the font is not Unicode.

Converting TrueType fonts to BDF

If your system does not support TrueType, you can convert the TrueType font files ("*.ttf") into BDF font files ("*.bdf"). The ttf2bdf utility is freely available and can be used for this purpose. Read the man page for details on how to use the utility.

Installing and configuring "Yudit", the Unicode Text Editor

To write multilingual documents, we require a Unicode Text Editor. A very versatile Unicode Text Editor is "Yudit". Given below are the steps required to install Yudit and configure it for Devanagari:

Have Fun :-)

Further References

Hopefully, you have learnt enough, but just in case you are thirsty for more, visit the links given below:


Back to Network Programming Valid XHTML 1.0! Valid CSS!