Home : Network Programming |
Unicode, UTF-8 and Devanagari |
Computers store characters by assigning a number to each one. This process is known as encoding. Most of us are familair with ASCII which is a 7 bit encoding of the characters in the english langauge (it can store at most 128 characters). With the passage of time, the need was felt for a single encoding that could contain enough characters to accomodate all the languages in the world. To enable sharing of information, this encoding would need to be a standard accepted universally. That standard is Unicode. Unicode is a 32 bit encoding which can potentially give a unique number to each character in all languages known to man.
Actually, there is another international standard, the ISO 10646 of the International Organization for Standardization (ISO), which defines the Universal Character Set (UCS). Fortunately, the participants of both projects (ISO and Unicode) realized in around 1991 that two different unified character sets is not exactly what the world needs. They joined their efforts and worked together on creating a single encoding. Both projects still exist and publish their respective standards independently but have agreed to keep the encoding of the Unicode and ISO 10646 standards compatible.
Encoding standards define the numerical value, or code point, of a particular charcacter, but that is not all. They must also define how this value will be represented in bits when stored in a computer file or transmitted over the Internet. The Unicode Standard defines three encoding forms that define how a particular character will be represented in bits while being transmitted. The three encoding forms allow the same data to be transmitted in a byte, word or double word oriented format (i.e. in 8, 16 or 32-bits). All three encoding forms encode the same common character repertoire and can be efficiently transformed into one another without loss of data.The three encoding forms as defined by the Unicode Consortium are:
By the way, UTF stands for UCS Transformation Format.
UTF-8 has the benefit that the ASCII characters are still represented as a single byte "providing compatibility with file systems, parsers and other software that rely on US-ASCII values but are transparent to other values". Any document created using the ASCII encoding is a valid UTF-8 document.
Non-ASCII characters are encoded using a variable length scheme and may range from 2 to 6 bytes in size, however, the most commonly used characters are only up to three bytes long. The way that Non-ASCII characters are encoded is:
A little trivia-The encoding known today as UTF-8 was invented by Ken Thompson. It was born during the evening hours of 1992-09-02 in a New Jersey diner, where he designed it in the presence of Rob Pike on a placemat (see Rob Pike's UTF-8 history).
Unicode is used to design multilingual web pages. The preferred encoding form for Unicode characters on the web is UTF-8. To view the web pages created using Unicode, the user requires a browser that supports Unicode (most new browsers do) and fonts for rendereing the characters.
The browser has to be informed about the encoding form (UTF-8, UTF-16, UTF-32) used for the page. This allows the browser to display the page correctly without any intervention from the user. There are two ways of doing this:
Content-Type: text/html; charset=utf-8
Content-Type: text/plain; charset=utf-8
The header that is sent along with the document is determined by the web server.Sometimes, it may not be possible to influence the HTTP headers that the web server prefixes automatically. In that case, the second method may be used.
< META http-equiv=Content-Type
content="text/html; charset=UTF-8" >
In Unicode, Devanagari occupies the range from 0900 to 097F. To see the how the various characters are encoded, see this pdf document. Details about Devanagari and other South Asian scripts can be found here . Before you can view Devanagari documents, you will need to install one of the many freely available Unicode fonts. See the next section for a detailed description of how to install fonts in Linux. You can check if your browser is correctly displaying Devanagari by visiting the BBC's hindi website or Alan Wood's Devanagari Test Page.
Fonts come in a variety of formats. Older versions of linux supported PCF (".pcf"), BDF (".bdf") and SNF (".snf") font files. Newer vesions also have support for TrueType fonts (".ttf"). Given below is the procedure for obtaining and installing the Sibal Devanagari BDF font, produced by the Computing Research Labs, New Mexico State University. I am assuming that you don't have privileges to modify the XF86Config file or copy the font file to the system wide font directories.
The procedure for installing TrueType fonts is slightly different. You will find the required information by visiting the links listed at the end of this page. If your system does not support TrueType, you can convert the TrueType font files("*.ttf") into BDF font files ("*.bdf"). For details, see the next section.
# mkdir ~/sibal
# mkfontdir ~/sibalThis will create a fonts.dir file in the font directory. This file contains an entry for each font in the directory. You can view this text file to see that the entries have been added. If the file is empty, that means that the mkfontdircommand did not work.
# xset fp+ ~/sibalIf xset returns an error, check that you have given appropriate permissions so that the font directory, ~/sibal, can be accessed by other users. You will need to give read and execute permissions to all directories lying on the path (including your home directory). If this doesn't solve yor problem, then check to see that the fonts.dir file was created correctly.
# xset fp rehash
# xlsfonts | grep sibalYou should see some output similar to the one given below
-sibal-devanagari-medium-r-normal--0-0-75-75-p-0-iso10646-1
-sibal-devanagari-medium-r-normal--18-180-75-75-p-100-iso10646-1
The font is now installed and ready to be used by the applications. Start mozilla or any other browser that supports Unicode and visit any Hindi web site. Try the Hindi version of BBC or Alan Wood's Devanagari Test Page. I haven't been able to get the fonts working perfectly, there are some problems with rendering. It is not a problem with the font. I am working on them :-). If you manage to get them working, do let me know.
You will need to rerun the two xset commands everytime you restart your X session. To avoid doing this, you can put the two commands into the .Xclients file (or possibly your .xinitrc or .xsession file depending on how you start X) in your home directory. This will cause them to be run automatically every time you start X. Another way to have the commands set automatically is to edit XF86Config, but as i said earlier you may not have the privileges to tamper with it.
NOTE: You must take care to ensure that the fonts you use are encoded for unicode. There are many fonts available that do not use the Unicode encoding. For example, many Devanagari fonts (including fonts from CDAC) use the ISFOC encoding. These fonts can not be used for viewing multilingual web pages that use Unicode. If the fonts.dir file contains "iso8859" then the fonts are not Unicode. If the fonts.dir file contains "iso10646" then the fonts are Unicode. You can also view the ".bdf" files in any ascii viewer to verify that the ranges used for characters conforms to the Unicode standard. Lines starting with STARTCHAR denote the code point for the character defined next. If you see lines such as "STARTCHAR U+0901" then you know that the font is a unicode font. On the other hand, lines such as "STARTCHAR 0021" mean that the font is not Unicode.
If your system does not support TrueType, you can convert the TrueType font files ("*.ttf") into BDF font files ("*.bdf"). The ttf2bdf utility is freely available and can be used for this purpose. Read the man page for details on how to use the utility.
To write multilingual documents, we require a Unicode Text Editor. A very versatile Unicode Text Editor is "Yudit". Given below are the steps required to install Yudit and configure it for Devanagari:
configure --prefix=~/yudit make make install~/yudit in the first command can be replaced with the name of any directory where you want Yudit to be installed.
Have Fun :-)
Hopefully, you have learnt enough, but just in case you are thirsty for more, visit the links given below:
Back to Network Programming |