Motivation for Technical Choices
The problem of storing documents written in English (and most other European languages) so that they can be served to users with no specialized software, is a no-brainer. This is not so for Bengali (and other Indian languages) due to the complex nature of its written script. A few words of explanation about the technical choices we made are thus in order.
There are two important questions to be settled:
- How the documents would be stored
- How the documents would be viewed
Format of the documents
Our documents will be HTML files or plain text files, with the Unicode code definition used to represent bengali characters. The encoding used is UTF-8. This part is quite well-implemented in almost all modern platforms.
Viewing the documents
Unicode represents bengali text as a sequence of bengali characters. Unlike most European scripts, just rendering these characters is not enough for Bengali (and other Indic scripts), it is necessary to form new glyphs by combining several characters. In the picture below, the characters on the left is an example of what might be in the utf-8 encoded html file, while on the right is what we would expect to actually see on the screen.
The rules for converting from the first form to the second are not that difficult, however, until recently, there was no accepted standard that described it. This has been addressed in the extenstion to the TrueType font format known as Open Type (along with some rules in Unicode for reordering the characters before combining them). A description of the parts of the specification relevant to indic scripts is available through Microsoft's typography site (you will have to look around a bit, the exact links seem to move from time to time). ![]()
Support in various operating systems
Although unicode is well supported, Opentype is evolving technology, and may not be supported on all platforms. This page should tell you more about what the current state of affairs is on the platform of your choice.
It should be noted that Open Type is not the only technology that could potentially deal with complex scripts like Bengali (i.e., deal with the problem of correctly rendering text supplied as unicode). However, it is currently the only technology that we are familiar with, so all discussion here will be restricted to that.