Unicode, UTF-8 and Devanagari

Unicode, UTF-8 and Devanagari

What is Unicode and why do we need it?
Various Encoding Forms
What is UTF-8?
Unicode and the Web
Unicode and Devanagari
Installing Devanagari fonts in Linux
Converting TrueType fonts to BDF
Installing and configuring "Yudit", the Unicode Text Editor
Further References

What is Unicode and why do we need it?

Computers store characters by assigning a number to each one. This process is known as encoding. Most of us are familair with ASCII which is a 7 bit encoding of the characters in the english langauge (it can store at most 128 characters). With the passage of time, the need was felt for a single encoding that could contain enough characters to accomodate all the languages in the world. To enable sharing of information, this encoding would need to be a standard accepted universally. That standard is Unicode. Unicode is a 32 bit encoding which can potentially give a unique number to each character in all languages known to man.

Actually, there is another international standard, the ISO 10646 of the International Organization for Standardization (ISO), which defines the Universal Character Set (UCS). Fortunately, the participants of both projects (ISO and Unicode) realized in around 1991 that two different unified character sets is not exactly what the world needs. They joined their efforts and worked together on creating a single encoding. Both projects still exist and publish their respective standards independently but have agreed to keep the encoding of the Unicode and ISO 10646 standards compatible.

Various Encoding Forms

Encoding standards define the numerical value, or code point, of a particular charcacter, but that is not all. They must also define how this value will be represented in bits when stored in a computer file or transmitted over the Internet. The Unicode Standard defines three encoding forms that define how a particular character will be represented in bits while being transmitted. The three encoding forms allow the same data to be transmitted in a byte, word or double word oriented format (i.e. in 8, 16 or 32-bits). All three encoding forms encode the same common character repertoire and can be efficiently transformed into one another without loss of data.The three encoding forms as defined by the Unicode Consortium are:

UTF-8: "UTF-8 is popular for HTML and similar protocols. UTF-8 is a way of transforming all Unicode characters into a variable length encoding of bytes. It has the advantages that the Unicode characters corresponding to the familiar ASCII set have the same byte values as ASCII, and that Unicode characters transformed into UTF-8 can be used with much existing software without extensive software rewrites."
UTF-16: "UTF-16 is popular in many environments that need to balance efficient access to characters with economical use of storage. It is reasonably compact and all the heavily used characters fit into a single 16-bit code unit, while all other characters are accessible via pairs of 16-bit code units."
UTF-32: "UTF-32 is popular where memory space is no concern, but fixed width, single code unit access to characters is desired. Each Unicode character is encoded in a single 32-bit code unit when using UTF- 32."

By the way, UTF stands for UCS Transformation Format.

What is UTF-8?

UTF-8 has the benefit that the ASCII characters are still represented as a single byte "providing compatibility with file systems, parsers and other software that rely on US-ASCII values but are transparent to other values". Any document created using the ASCII encoding is a valid UTF-8 document.

Non-ASCII characters are encoded using a variable length scheme and may range from 2 to 6 bytes in size, however, the most commonly used characters are only up to three bytes long. The way that Non-ASCII characters are encoded is:

Non-ASCII characters are encoded as a sequence of several bytes, each of which has the most significant bit set. This means that all bytes representing non-ASCII characters are invalid under ASCII encoding (since all ASCII characters stored in bytes have their most significant bit not set). This allows the application to differentiate between ASCII and non-ASCII characters. Bytes representing non-ASCII characters will never be mistaken for ASCII characters.
The first byte of a multibyte sequence, that represents a non-ASCII character, indicates how many bytes follow for this character. All further bytes in the multibyte sequence are used to encode the actual character.

A little trivia-The encoding known today as UTF-8 was invented by Ken Thompson. It was born during the evening hours of 1992-09-02 in a New Jersey diner, where he designed it in the presence of Rob Pike on a placemat (see Rob Pike's UTF-8 history).

Unicode and the Web

Unicode is used to design multilingual web pages. The preferred encoding form for Unicode characters on the web is UTF-8. To view the web pages created using Unicode, the user requires a browser that supports Unicode (most new browsers do) and fonts for rendereing the characters.

The browser has to be informed about the encoding form (UTF-8, UTF-16, UTF-32) used for the page. This allows the browser to display the page correctly without any intervention from the user. There are two ways of doing this:

Making sure that the HTTP header of a document contains the line
Content-Type: text/html; charset=utf-8
for HTML files, or
Content-Type: text/plain; charset=utf-8
for TEXT files.
The header that is sent along with the document is determined by the web server.Sometimes, it may not be possible to influence the HTTP headers that the web server prefixes automatically. In that case, the second method may be used.
In a HTML document, add the following line under HEAD the element
< META http-equiv=Content-Type content="text/html; charset=UTF-8" >
This obviously works only for HTML files, not for plain text. It also announces the encoding of the file to the parser only after the parser has already started to read the file, so it is clearly the less elegant approach.

Unicode and Devanagari

In Unicode, Devanagari occupies the range from 0900 to 097F. To see the how the various characters are encoded, see this pdf document. Details about Devanagari and other South Asian scripts can be found here . Before you can view Devanagari documents, you will need to install one of the many freely available Unicode fonts. See the next section for a detailed description of how to install fonts in Linux. You can check if your browser is correctly displaying Devanagari by visiting the BBC's hindi website or Alan Wood's Devanagari Test Page.

Installing Devanagari fonts in Linux

Fonts come in a variety of formats. Older versions of linux supported PCF (".pcf"), BDF (".bdf") and SNF (".snf") font files. Newer vesions also have support for TrueType fonts (".ttf"). Given below is the procedure for obtaining and installing the Sibal Devanagari BDF font, produced by the Computing Research Labs, New Mexico State University. I am assuming that you don't have privileges to modify the XF86Config file or copy the font file to the system wide font directories.

The procedure for installing TrueType fonts is slightly different. You will find the required information by visiting the links listed at the end of this page. If your system does not support TrueType, you can convert the TrueType font files("*.ttf") into BDF font files ("*.bdf"). For details, see the next section.

Download the Sibal Devanagari font from this website.
Make a directory say sibal in your home and copy the font file into it. The font file should have the extension ".bdf".
```
   # mkdir ~/sibal
```
Run the mkfontdir command to create an index of the font file.
```
   # mkfontdir ~/sibal
```
This will create a fonts.dir file in the font directory. This file contains an entry for each font in the directory. You can view this text file to see that the entries have been added. If the file is empty, that means that the mkfontdircommand did not work.
Add the directory to the current fontpath.
```
   # xset fp+ ~/sibal
```
If xset returns an error, check that you have given appropriate permissions so that the font directory, ~/sibal, can be accessed by other users. You will need to give read and execute permissions to all directories lying on the path (including your home directory). If this doesn't solve yor problem, then check to see that the fonts.dir file was created correctly.
Force the server to re-scan for available fonts.
```
   # xset fp rehash
```
To verify that the font has been correctly installed, you can run the following command
```
   # xlsfonts | grep sibal 
```
You should see some output similar to the one given below
-sibal-devanagari-medium-r-normal--0-0-75-75-p-0-iso10646-1 -sibal-devanagari-medium-r-normal--18-180-75-75-p-100-iso10646-1

The font is now installed and ready to be used by the applications. Start mozilla or any other browser that supports Unicode and visit any Hindi web site. Try the Hindi version of BBC or Alan Wood's Devanagari Test Page. I haven't been able to get the fonts working perfectly, there are some problems with rendering. It is not a problem with the font. I am working on them :-). If you manage to get them working, do let me know.

You will need to rerun the two xset commands everytime you restart your X session. To avoid doing this, you can put the two commands into the .Xclients file (or possibly your .xinitrc or .xsession file depending on how you start X) in your home directory. This will cause them to be run automatically every time you start X. Another way to have the commands set automatically is to edit XF86Config, but as i said earlier you may not have the privileges to tamper with it.

NOTE: You must take care to ensure that the fonts you use are encoded for unicode. There are many fonts available that do not use the Unicode encoding. For example, many Devanagari fonts (including fonts from CDAC) use the ISFOC encoding. These fonts can not be used for viewing multilingual web pages that use Unicode. If the fonts.dir file contains "iso8859" then the fonts are not Unicode. If the fonts.dir file contains "iso10646" then the fonts are Unicode. You can also view the ".bdf" files in any ascii viewer to verify that the ranges used for characters conforms to the Unicode standard. Lines starting with STARTCHAR denote the code point for the character defined next. If you see lines such as "STARTCHAR U+0901" then you know that the font is a unicode font. On the other hand, lines such as "STARTCHAR 0021" mean that the font is not Unicode.

Converting TrueType fonts to BDF

If your system does not support TrueType, you can convert the TrueType font files ("*.ttf") into BDF font files ("*.bdf"). The ttf2bdf utility is freely available and can be used for this purpose. Read the man page for details on how to use the utility.

Installing and configuring "Yudit", the Unicode Text Editor

To write multilingual documents, we require a Unicode Text Editor. A very versatile Unicode Text Editor is "Yudit". Given below are the steps required to install Yudit and configure it for Devanagari:

Obtain the latest sources for Yudit from its website. The version available at the time of writing this document was yudit-2.7.6.
Extract the sources from the tarball.
Read the README.TXT file, in the directory containing the extracted sources, for instructions on compiling and installing Yudit. The commands to be executed are:
```
   configure --prefix=~/yudit
   make
   make install
```
~/yudit in the first command can be replaced with the name of any directory where you want Yudit to be installed.
After the above steps have been completed, Yudit can be started by executing the binary (yudit) in ~/yudit/bin/. Invoke and exit Yudit so that appropriate configuration files are created in your home directory.
To learn how to enable support for Devanagari, read the HOWTO-devanagari.txt file within the doc directory in the original untarred sources. The steps required are:
- Download the raghu.ttf font from here
- Copy the font to ~/.yudit/fonts/
- Edit the ~/.yudit/yudit.properties file. Add raghu.ttf to the yudit.font.TrueType line. Add *-iso10646-dev to the yudit.font.Misc line
You can now use Yudit with Devanagari support. Start Yudit and select Devanagari for input in the GUI.

Have Fun :-)

Further References

Hopefully, you have learnt enough, but just in case you are thirsty for more, visit the links given below:

Back to Network Programming

Contents