ColdFusion, SQL Server, Flash and Globalization
Contents:
Part 1: Introduction
Part 2: Code pages
Part 3: Unicode encodings
Part 4: Unicode and ColdFusion & html
Part 5: Resource files
Part 6: SQL Server/ Flash MX and unicode
Part 7: Time & date and globalization
Part 8: Miscalenious notes
1. Introduction
At some point in the development of websites a developer will be asked to create a site that would cater to audience that
speaks more than one language. I have created this small guide/ tutorial for myself to be looked at later on as a reference
on how to create sites that are trully international. By website that supports globalization I mean one that supports more
than one language and more than one style of display (of dates and currency for example) in organized way. For two or
three languages on a website one may get away with multiple copies of the same front-end with some hacking around, but
for more sophisticated websites I introduce the notion of resource files. I also push for use of unicode throughout the
whole website - even if the website is in English it doesn't hurt to set its encoding as UTF-8.
By far these short notes are incomplete. They may also contain some inaccuracies as I am not an expert in the subject. So
please send me your comments and corrections - it will help me in undertanding the topic better and make this paper of greater
value to the web community.
Guide version 0.1 last updated on 26/01/2007.
I provide this guide as is, without any guarantees, explicit or implied, as
to its contents. You may use the information contained herein in your computer
career, however I take no responsibility for any damages you may incur as a
result of following this guide. You may use this document freely and share it
with anybody as long as you provide the whole document in one piece and do not
charge any money for it. If you find any mistakes, please feel free to inform
me about them Tom Kitta. Legal stuff aside, let us start.
2. Code pages
- Legacy Windows Code pages/ character encodings are a thing of the past - use Unicode.
- Definition: Code page is the traditional IBM term used for a specific character encoding table, see more from wikipedia
- Before unicode code page was a way to encode certain set of characters for one (or more) languages.
- For most wester languages the encodeing was iso-8859-1. Which is part 1 of ISO-8859 specification. Many webpages are encoded in this format.
- Using specialized (limited) code page is fine till one realizes how many of these code pages would be needed for all languages.
There arecode pages that are not the best fit for some lanuages (missing characters). Throw in the mix the fact that there are
multiple code pages for the same lanuage.
- Unicode solves this problem - one encoding for all characters.
3. Unicode encodings
- The first 256 unicode codes correspond with those of ISO 8859-1, the most popular 8-bit character encoding in the Western world.
- There is more than one way to implement unicode. Unicode is "Universal Character Set" and is made of 65,536 basic
characters (most common) and over a million other characters. Read more on wikipedia
Information about how characters are mapped in latest version of Unicode 5 Can be found here
- UTF-8 is by far most popular for web pages, uses one byte for most common wester characters, up to 4 bytes for
complex characters. For more info read wikipedia.
- UTF-16 is used by SQL server 2000. This is the only encoding SQL server supports for unicode. Takes at least 2 bytes per
character (most common 64k characters) 4 bytes for other characters. Read more about it on
wikipedia.
- UTF-32 - not of interest to web developers, uses 4 bytes to represent every chatacter in unicode specifications.
- Unicode is not just a mapping of characters to numbers, its a standard - for example it defines how bi-directional
language characters should be displayed (like Arabic).
4. Unicode and ColdFusion & html
- All you need to do is place ">meta http-equiv="Content-Type" content="text/html; charset=utf-8" /<" in the
header section of every HTML template. Now the browser will treat text on the page as unicode in UTF-8 encoding. You don't
need to do much to your HTML text as UTF-8 works well with ASCI. Note CF will overwrite this tag.
- Use language meta tag to help text to speach readers figure out the langauge of the document - use for example
<meta http-equiv="content-language" content="en_us,en_ca">.
- You can also use <cfprocessingdirective> tag with "pageencoding" attribute set to "utf-8".
- To test your new ability to display any character in any language on a webpage, go to Unicode plane 0 and
copy some characters from say chinese or arabic. Past them into your webpage (i.e. into the IDE with which you are editing your webpage).
If you see garbage then you need to change the font in which the IDE renders the text you are editing. By default most
application use font that can only display glyphs (formal name for characters) for iso-8859-1 character set. Chose font
that has unicode in its name. If the application still has issues, take a look at the windows notepad - in XP it can save and read
unicode encoded files.
- Now save your page and load it in a browser. You should see the characters you want to display shown on the page. Note that
some older browsers may have issues with unicode. IE7/6 and FF1/2 seem to have no trouble.
- When you select a font in which there is no glyph for the character you want to show the browser will search the harddisk for
a font that has the glyph - so parts of the text may display in different font.
- Sometimes there is no font with the glyph for given character on user's PC - user must install a font or font pack to
see the characters. This should not be an issue since chinese user should already have all the fonts installed on their PCs to
view chinese characters.
- Note that your PC most likely doesn't have glyphs for all characters even in the basic plane (i.e. most common 65k of chatacters).
In case such character is encountered you may see its number or some graphics if you have a fallback font installed. Otherwise
you may see a question mark or a white box with black borders or something like that.
- There are very few fonts with characters form planes other then the basic multilingual plane - i.e. from any of the
supplemental multilingual planes
5. Resource files
- When implementing multiple lanuages to be displayed by the interface of the same application it is a good idea
to de-couple presentation layer and language layer (i.e. don't have actual text shown on the display page
- Resource files are made up of name value pairs where name is a tag used in the presentation layer, for example:
"title=English Version title".
- Load resource files into application scope with locale as a key in structure
- You can either use the notation \u0E44 or é to represent unicode characters
- You can create resource files by hand in notepad that will be parsed with standard ColdFusion and will require the
use of notepad - or you can create standard Java resource files that are created with open source tools and require
Java classes to read
6. SQL Server/ Flash MX and unicode
- Sorting is not easy - it is hard to sort correctly for many locales, people in different European countries sort
differently. Some languages in CJK set sort based on strokes in letters or accents.
- First rule of sorting - try to use DB as much as possible
- ColdFusion sorts by default in Unicode order of characters, which is not OK for special characters that will appear out of
order - use Java.text.collator class as a helper or ICU4J class
- You should expect that people in different countries will have different specifications for addresses - make
your DB tables very flexible
- In SQL Query Analyzer you probably will need to change your font to unicode - as with other applications to see some
non-western characters. Inserting and selecting unicode data from SQL is very simple. All you need to remember is that
you cannot store SQL unicode in regular varchar and char fields, use nvarchar and nchar and ntext. The 'n' indicates
unicode. Since SQL server uses UTF-16 encoding your unicode nvarchar field can store at most 4000 characters v.s. 8000
allowed for regular varchar. Each time you operate on unicode data you need to precede the data with 'N' to indicate unicode, as in
"SELECT * FROM test WHERE name = N'Tom'", same for other operations. Instead of "Tom" you could type in any unicode characters, for example
Ƒƛǂ. You need to change font for the output screen of the analyzer to see results from the DB that are not in
western characters.
- Flash handles unicode well - you cannot type actual characters into the fields that are not in your IDEs locale, but
you can load unicode text. Users computer will have to have a font with the glyphs that you want to show. If you are
uncertain of that you must embed the characters and increase the file size a lot.
7. Time & date and globalization
- There are many calendars in use around the world, the western world uses Gregorian calendar
- When recording date and time use UTC - its more accurate than GMT and a standard
- Use Java ICU4J library to convert between different calendars
- Use third part Java classes (encapsulated in CFCs) to do time zone conversions and handle DST
8. Miscalenious notes
- It is a good idea to think about globalization before coding starts (as its hard to develop clean multi-locale application add on later on)
- Take a look at Java ICU4J library and some if its CFC based encapsulations - it has better support for locales than core Java available through ColdFusion
- Consider not to use little flags for different languages - some Britons don't want to click on US flag. How about using geo-locating to determine user's locale?
- Its not only about language and time, there are other formatting issues as well, such as website color skinning
(some clutures react differently to different colors) consider also sorting order and measurment units/ printing format
- G11N stands for GlobalizatioN - first and last letter, there are 11 in between
- I18N stands for InternationalizatioN - first and last letter, there are 18 in between
- Dont forget that in right to left languages the whole page should be right aligned v.s. left aligned