Additional notes on
Unicode-based documents
[I]. Set up
1) Outlook-98 in Win-NT
Tools -- Options -- Mail Format
* Message Format: HTML
* Stationary & Fonts: Character Set - Universal Alphabet (UTF-8) -
Set as Default
2) For IE 6.x
View -- Font / Encoding --
Universal Alphabet (UTF-8)
or Right-click the mouse, then: Language -- Universal Alphabet
(UTF-8)
3) For Netscape:
* View -- Encoding (or Character
Set) -- Unicode (UTF-8)
* Edit -- Preferences -- Appearance-Fonts -- Use document-specified
fonts
[II]. Printers:
1) HP Laser printers: may need
adjustment, as following:
1.a) Models HP-III, HP-4M, HP-5Si
File -- Print -- Properties --
Advanced -- Documents Options --
Print Text as Graphics: ON
1.b) Model HP-5M
File -- Print -- Properties --
Advanced -- Options --
Graphic Mode: HP-GL/2
Laser III compatible: ENABLED
1.c) Model HP-8000, HP-4MP
File -- Print -- Properties --
Finishing -- Details --
Font Settings: Send True Type as Bitmaps.
1.d) Other models: follow one of the
above procedures.
2) HP Inkjet printers:
- HP Inkjet 2500C: cannot print
Unicode page, both from browser and from Wotd-97.
- HP Inkjet 721C: can print Unicode in Word-97
- HP 970 Deskjet: can print in Word-97 and in Netscape 4.x (but not IE
5.x)
3) Other printers:
- CANON Bubblejet BJC: can print
Unicode in Win-98/Word-2000
- PANASONIC Laser printer KX series: can print Unicode with both
browsers
- RICOH Aficio 270: can print Unicode only in Word
- EPSON Color Stylus series can print Unicode documents either from
browsers or from Word.
[III]. Resources:
1) Alan Wood's Unicode Resources: http://www.alanwood.net/unicode/
2) Unicode for Vietnamese: http://www.vovisoft.com/vovisoft/UnicodeChoVN.htm
3) Unicode consortium: http://www.unicode.org/
4) See also links and information on Viet
Unicode: http://vietunicode.sourceforge.net/
[IV]. Fonts:
1) Basic fonts come with Office-2000,
Windows-98 SE, Windows-Me, Windows-2000, Windows XP. For older versions,
check these fonts:
- Core fonts: Arial, Courier New,
Times New Roman, version 2.76 or later. If not, then download them and
install.
- Not all WGL-4 fonts supplied by Microsoft contain VN characters.
2) A larger set: Arial-Unicode MS by
Microsoft and CN-Times by Chan-Nguyen, includes Chinese-Japanse-Korean characters (15 Mb, zipped),
for Viet-Han texts.
3) VU-Times by Ho Phuoc Hung for
Viet-Pali texts.
[V]. Software and Hardware
Folowing is a list of common software
and hardware I use for our web site.
Keyboard programs:
1) VPS-Keys 4.3 (freeware): http://www.hcgvn.net/software/
2) WinVNKey, 4.0 (freeware): http://sourceforge.net/projects/winvnkey
3) UniKey, 3.55 (freeware): http://sourceforge.net/projects/unikey
Document and graphics preparation:
1) MS Word-2000, -XP
2) MS Image Composer 1.5
3) Corel Draw and Corel PhotoPaint, versions 9 & 11
Document conversion programs:
1) Convert2anything (freeware), by
Cafe68T http://cafe68t.multimania.com/content/unicode/download.html
2) VoviSoft (freeware), http://www.vovisoft.com/vovisoft/UnicodeChoVN.htm
3) VPSKeys 4.3 (freeware), http://www.hcgvn.net/software/
4) UniKey 3.55 (freeware), http://sourceforge.net/projects/unikey
5) WinVNKey, 4.0 (freeware): http://sourceforge.net/projects/winvnkey
Web page set up:
1) MS Frontpage-2000, -XP (commercial)
2) Arachnophilia 4.0 (freeware): http://www.arachnoid.com/arachnophilia/
Operating systems:
1) Windows 2000
2) Windows XP
Browser: IE 6.x
System hardware:
1) PC Pentium-IV 1.6 GHz, 512 Mb RAM
with Win-XP
2) PC Pentium Celeron 1.6 GHz, 256 Mb RAM, with Win-XP
3) PC Pentium Xeon 2.8 GHz, 2Gb RAM, with Win 2000
Printers:
1) Epson Stylus series (color inkjet)
2) HP Laser 5L
3) Many networked HP Laser printers (4x, 5x) and Inkjet printers.
[VI]. Mac
machines:
I have no experience with Mac machines
and Mac-OS. You might like to consult Alan Wood's website at: http://www.alanwood.net/unicode
[VII] UTF-8
UTF-8 (UTF: Unicode Transformation Format) has
the characteristic of preserving the full US-ASCII range, providing
compatibility with file systems, parsers and other software that rely on
US-ASCII values but are transparent to other values.
This section is only an illustration of how you can
encode a Unicode character in UTF-8.
1) Take the Unicode value of the character to find out how many bytes
you need. Unicode values are given in hexadecimal & decimal numbers:
Hex Range |
Dec Range |
|
0000-007F |
0 - 127 |
1 byte |
0080-07FF |
128 - 2,047 |
2 bytes |
0800-FFFF |
2,048 - 65,535 |
3 bytes |
10000-1FFFFF |
65,536 - 2,097,151 |
4 bytes |
200000 - 3FFFFFF |
2,097,152 - 67,108,863 |
5 bytes |
4000000 - 7FFFFFFF |
67,108,864 - 2,147,483,648 (*) |
6 bytes |
(*) Maximum 2,147,483,648 (2**31) characters could be created.
2) Convert the hex code to binary form and fill in the empty bits:
1 byte |
0xxxxxxx |
2 bytes |
110xxxxx 10xxxxxx |
3 bytes |
1110xxxx 10xxxxxx 10xxxxxx |
4 bytes |
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
5 bytes |
111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx |
6 bytes |
1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx |
Example:
The Unicode value of 'tea' (Han) is 8336 (dec:
33,590), so you need 3 bytes. The binary form of hexadecimal
8336 is:
- 10000011 00110110
Fill the empty slots of the three-byte
template with the binary value of 'tea' and you will get:
Fill the empty slots of the three-byte template with the binary value
of 'tea' and you will get:
- 11101000 10001100 10110110
Thus you have converted 0x8336 to 3 bytes: 0xE8 0x8C
0xB6.
[VIII] UTF-16
Conversion
UTF-16 definition
Each character is assigned a number, which Unicode calls the Unicode
scalar value. In the UTF-16 encoding, characters are represented using
either one or two unsigned 16-bit integers, the rules for how characters
are encoded in UTF-16 are:
- Characters with values less than 0x10000 are represented as a
single 16-bit integer with a value equal to that of the character
number.
- Characters with values between 0x10000 and 0x10FFFF are represented
by a 16-bit integer with a value between 0xD800 and 0xDBFF (within the
so-called high-half zone or high surrogate area) followed by a 16-bit
integer with a value between 0xDC00 and 0xDFFF (within the so-called
low-half zone or low surrogate area).
- Characters with values greater than 0x10FFFF cannot be encoded in
UTF-16.
Note: Values between 0xD800 and 0xDFFF are specifically
reserved for use with UTF-16, and don't have any characters assigned to
them.
Encoding UTF-16
Encoding of a single character from an ISO 10646 character value to
UTF-16 proceeds as follows. Let U be the character number, no greater than
0x10FFFF.
1) If U < 0x10000, encode U as a 16-bit unsigned integer and
terminate.
2) Let U' = U - 0x10000. Because U is less than or equal to 0x10FFFF,
U' must be less than or equal to 0xFFFFF. That is, U' can be represented
in 20 bits.
3) Initialize two 16-bit unsigned integers, W1 and W2, to 0xD800 and
0xDC00, respectively. These integers each have 10 bits free to encode
the character value, for a total of 20 bits.
4) Assign the 10 high-order bits of the 20-bit U' to the 10 low-order
bits of W1 and the 10 low-order bits of U' to the 10 low-order bits of
W2. Terminate.
Graphically, steps 2 through 4 look like:
U' = yyyyyyyyyyxxxxxxxxxx (binary, 20 bits)
W1 = 110110yyyyyyyyyy
W2 = 110111xxxxxxxxxx
-ooOoo- |