software.intel.com · web viewfor example, us-ascii is a fixed-length encoding scheme, which always...

43
Characters, Encodings, and Globalization “I don’t know, but it works on my machine!!” Ever said that? Ever heard that? Ever wondered why it happens? Characters and Encodings are the basic concepts, which every developer must know, no matter what. In this information era when the entire world has shrunk, globalization has emerged as one of the key requirements from any product, because it is being sold and used across the globe by people from different countries, different languages and different cultures. A character or a text string doesn't get translated from one language to another, automatically; and globalization is not just about using resource bundles, if you are thinking that. In this article, you will learn the basic concepts of Character, Glyph, Encoding, Font, and rendering, which is essential for every developer to know -- no matter what the platform, no matter what the program, no matter what the language; the developer works with. Then we clear the myths around ASCII, and move to Unicode. We will see what Unicode is, what Unicode is not, its architecture, and the three popular encoding schemes provided by Unicode – UTF-8, UTF- 16, and UTF-32. Then we move on to the Unicode support in XML and Java language. Finally, we talk about the Globalization concepts (Internationalization, Localization, and Translation) in general, as well as how it applies to Java. Basic Concepts Character and Glyph A character is a platonic abstract entity, which exists in theoretical space. Since a character is platonic, it is usually referred to by its name. For example, the “English

Upload: doannhu

Post on 18-May-2018

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: software.intel.com · Web viewFor example, US-ASCII is a fixed-length encoding scheme, which always encodes a character code in 7-bits. A Character Code cannot exist of its own, but

Characters, Encodings, and Globalization

“I don’t know, but it works on my machine!!” Ever said that? Ever heard that? Ever wondered why it happens? Characters and Encodings are the basic concepts, which every developer must know, no matter what. In this information era when the entire world has shrunk, globalization has emerged as one of the key requirements from any product, because it is being sold and used across the globe by people from different countries, different languages and different cultures. A character or a text string doesn't get translated from one language to another, automatically; and globalization is not just about using resource bundles, if you are thinking that.

In this article, you will learn the basic concepts of Character, Glyph, Encoding, Font, and rendering, which is essential for every developer to know -- no matter what the platform, no matter what the program, no matter what the language; the developer works with. Then we clear the myths around ASCII, and move to Unicode. We will see what Unicode is, what Unicode is not, its architecture, and the three popular encoding schemes provided by Unicode – UTF-8, UTF-16, and UTF-32. Then we move on to the Unicode support in XML and Java language. Finally, we talk about the Globalization concepts (Internationalization, Localization, and Translation) in general, as well as how it applies to Java.

Basic Concepts

Character and GlyphA character is a platonic abstract entity, which exists in theoretical space. Since a character is platonic, it is usually referred to by its name. For example, the “English Letter A” – ‘A’. What we see however, is the visual representation of the character, called as its glyph.

A single character can have multiple glyphs. For example, the abstract character ‘A’ can have, but not limited to the following glyphs (but all mean the same thing – the character ‘A’):

Page 2: software.intel.com · Web viewFor example, US-ASCII is a fixed-length encoding scheme, which always encodes a character code in 7-bits. A Character Code cannot exist of its own, but

(Figure: An abstract character in the Abstract Character Space maps to many glyphs in the Character Glyph Space)

The Abstract Character Space is the set of all characters in this world. Every abstract character in the Abstract Character Space maps to multiple glyphs in the Character Glyph Space. All such possible glyphs for every abstract character is called the Character Glyph Space. Let’s see some of the characteristics of character and glyphs.

Upper and lower case charactersThe English language (and some other languages as well, example Latin, etc.) has the concept of upper case and lower case characters, but not all languages (e.g. Devanagari and many other East Asian languages) in this world have this concept. Even though, the abstract character (upper case) ‘A’ semantically means the same thing as the abstract character (lower case) ‘a’, but for sake of simplicity, both (the upper case and the lower case characters) are treated as two different characters in the machine world.

ligaturesWe saw that one character can have multiple glyphs, but it is also true that some glyphs represent multiple characters. For example, the glyph ‘æ’ is a combination of two different glyphs ‘a’ and ‘e’ corresponding to the characters ‘a’ and ‘e’ respectively. Such glyphs, which get combined to change shape based on its adjacent characters, are called ligatures. Examples of other ligatures are ‘fi’, ‘fl’, ‘ff’, etc.

(Figure: A ligature broken into individual characters)

Page 3: software.intel.com · Web viewFor example, US-ASCII is a fixed-length encoding scheme, which always encodes a character code in 7-bits. A Character Code cannot exist of its own, but

Decomposable charactersDecomposable characters (aka composite or precomposed characters) are one, which can be decomposed into multiple smaller characters. For example, the French letter ‘e with the acute accent’ is one such character, which can be broken down into the character ‘e’, and the ‘character of the acute accent’.

(Figure: A decomposable character broken into individual characters)

The acute accent and other such marks like circumflex (^), grave accent (`), cedilla (¸),

macron (¯), diaeresis (¨), etc are called diacritical marks. When more than one diacritical mark is used on a single base character, they all usually either stack up on the top of the base character or stack down on the bottom of the base character in the order in which they appear w.r.t. the base character.

Well, technically a ligature can also be called as a composite character, because they too can be decomposed into individual characters, but there is a subtle difference between the two. The individual characters in a ligature are complete and independent characters and can also exist on their own, but it is not the same for decomposable characters. For example, the acute accent (and other diacritical marks), which is part of decomposable character, does not mean anything on its own, but only in the context of a complete character (aka base character).

Character Set and Coded Character SetA set of abstract characters is called a character set. A set by definition has no order, and thus, we must not assume any. A character set is just a concept, which is often used and is quite helpful in discussions, to refer to, a set, or a family of characters. For example, the Latin character set, or the Devanagari character set, or the Japanese character set, or the universal character set, and so on and so forth.

Most of these characters have names, but that is not sufficient to identify them uniquely. Moreover in the machine world, everything is represented as a number; thus, every character in the character set is assigned a number to identify it uniquely. Such a character set, where every character is assigned a unique number (an integer number, to be specific) is called a coded character set (aka Code Page, Character Repertoire, or even simply as Character Set); and the unique number assigned to a character is called its

Page 4: software.intel.com · Web viewFor example, US-ASCII is a fixed-length encoding scheme, which always encodes a character code in 7-bits. A Character Code cannot exist of its own, but

character code (aka Code Point, Code Value). A Coded Character Set is independent of any platform, operating system, or program. Some popular Coded Character Sets are Unicode, SHIFT_JIS, and ASCII.

Character Encoding and DecodingA named algorithm to covert a character code to a sequence of code units is called character encoding (aka Character Encoding Scheme), where, a code unit is a block of bits always represented in multiples of an octet (8-bits, or casually referred to as a byte). In other words, character encoding is an algorithm to convert a character code to octets. For example, UTF-16 is a character-encoding algorithm, which uses a code unit of 2 octets.

Note that, a character, when encoded, may result into one or more octets depending on its character code and the encoding algorithm used. Such an encoding algorithm, which generates variable number of octets for different character codes, is called a variable-length encoding scheme. For example, UTF-16 is a variable-length encoding scheme, which encodes character codes in 16-bits or 32-bits. Encoding algorithms, which always generate fixed number of octets for different character codes, are called fixed-length encoding schemes. For example, US-ASCII is a fixed-length encoding scheme, which always encodes a character code in 7-bits.

A Character Code cannot exist of its own, but only as a part of a Coded Character Set. An encoding algorithm must know what is the range of valid Character Codes, and what are illegal characters so that it can encode a character code correctly. Therefore, an encoding algorithm is also always associated with a Coded Character Set.

Mostly one Coded Character Set (or Code Page) is associated with one encoding algorithm. For example, the US-ASCII code page is associated with the US-ASCII encoding algorithm. But there is no such hard and fast rule. For example, the Unicode code page is associated with many encoding algorithms like UTF-8, UTF-16, and UTF-32.

The RFC for the Multipurpose Internet Mail Extensions (MIME) as well as many other specifications officially refer to the Character Encoding as Charset, which has caused some confusing w.r.t. the Character Set. But note that a “Charset” is not a “Character Set”. Though it is confusing but a charset means character encoding.

The mechanism of converting a sequence of octets back to a valid character code is called character decoding.

Character Rendering and FontsThe process of displaying the glyph of a character on the screen is called as character rendering, and the software, which does this, is called as rendering software. For example, browsers, editors, etc, are all rendering softwares in their core.

Page 5: software.intel.com · Web viewFor example, US-ASCII is a fixed-length encoding scheme, which always encodes a character code in 7-bits. A Character Code cannot exist of its own, but

There are three things required to render a character – a) the rendering software b) the decoder, which is usually part of the rendering software, and c) the required fonts.

Apart from the font, the rendering software may use other parameters like – size, colour, style & effects (bold, italic, underline, strikethrough, emboss, etc.), orientation, etc depending on how sophisticated the rendering software is. But again, the primary thing required for rendering a character is the font.

fontsThings were different in earlier days. Earlier, when there were no concept of fonts, the rendering softwares used to turn bits on/off on the display screen within the allocated area of a character to render a character. Now days we use fonts. Fonts are bitmap representation of characters. The rendering software draws this bitmap on the display screen to render a character. Therefore, now days, we have more fancy and sophisticated representation of characters with fancy strokes and all, than we used to have earlier.

There are various types of fonts – example, serif, sans serif, script, etc. But that’s a huge topic in itself, and is out of the scope of this article. For now, it’s enough to understand that a font is the bitmap representation of the glyph of a character. Typically a font file would contain the bitmaps of all the characters for a given character set, mapped to their respective character codes.

From a broader perspective, let’s take a small example to understand the complete end-to-end story of how a text file is displayed:

1. User: Launches the rendering software (say, an editor or a browser).2. User: Specifies the file – directory location and filename, to render. (Most often

this is the File>Open task)3. User: Specifies the encoding that was used to create the file. (Usually in the Open

dialog box itself)Note: This is a very important input for the rendering software, but, most of us ignore it, and the rendering software has to fallback to rely on the system default encoding, which, mind it, might not always work. Therefore, make sure you always specify the right encoding.4. Rendering s/w: Reads the file from the user specified location using the user

specified encoding.5. Decoder: While reading, the decoder converts the octets in the file to character

codes as per the encoding algorithm and returns it to the rendering software where they are accumulated as a sequence of character codes.

6. Rendering s/w: Then loads the user-specified or the default font map.7. Rendering s/w: Iterates over the accumulated character codes, and looks for every

character code in the loaded font map, and uses the corresponding information there to render the character.

8. Rendering s/w: While rendering, the rendering software uses any user-specified or default, style & effects, based on how sophisticated the rendering software is.

Page 6: software.intel.com · Web viewFor example, US-ASCII is a fixed-length encoding scheme, which always encodes a character code in 7-bits. A Character Code cannot exist of its own, but

So we see that what gets rendered is entirely dependent on the font file – i.e. to what glyph the code point is associated with. For example, below is how the string “Hello World” gets rendered in two different fonts:

This raises an important point related to data exchange. When the receiving system gets a text file from some other machine, it might not be able to display the file properly:

1. If it doesn’t know what font to use for rendering, or2. If it doesn’t have the correct fonts installed, or3. If somebody has modified the font file just for kicks.

It would also be a problem, if the rendering software uses an encoding algorithm to read the file that is different from the one used to create that file. Since encoding algorithms are always associated with code pages, it is possible that the same character code in the two different code pages, maps to two different characters, thereby leading to garbage output. Any given computer (especially servers) needs to support many different encodings; yet whenever data is passed between different encodings or platforms, that data always runs the risk of corruption.

Therefore, make sure that:1. You always use or specify the right encoding when reading a file.2. You have the necessary fonts installed on your machine.

DirectionalityBut that’s not it. Rendering software also has to take care of the direction of display. Most scripts have characters that run from left to right, but that’s not true for all the scripts. For example, the Arabic script runs from right to left; and some Japanese scripts run from top to down. The algorithm used by rendering softwares for display is orthogonal to the way the octets are stored and decoded to character codes. The octets, btw are always stored from left to right, therefore, it is the rendering software, which is responsible for rendering the characters with the right orientation and direction.

Page 7: software.intel.com · Web viewFor example, US-ASCII is a fixed-length encoding scheme, which always encodes a character code in 7-bits. A Character Code cannot exist of its own, but

(Figure: Different directionality of different languages)

ASCII – 7 bits or 8 bits?ASCII is one of the most popular code page and encoding, so it’s worth discussing it here. In the period (roughly around) 1963-1967, American National Standards Institute (ANSI), released American Standard Code for Information Interchange (ASCII) code page with the intention to standardize information interchange. ASCII defined only 128 characters, of which there were – 33 non-printable control characters, 52 English alphabet characters (26 upper and 26 lower case characters), 10 numeric characters (0-9), and the remaining 33 were symbol and punctuation characters.

Now, 128 characters can be accommodated within 7 bits. Most computer registers (then and even now) are 8-bits, which means 1 bit was still empty, which when used could store an additional 128 characters. This mistake did not go unnoticed and computer manufacturers in various countries started using the remaining 1 bit, to accommodate characters from their native language. And thus, there was a wave of national variants of ASCII, defining their own characters using the unused bit, which defeated the entire purpose of information interchange.

Officially, ASCII was, and even now, is 7-bits. The other 8-bit variants, which emerged by extending the ASCII character set, are unofficially called as Extended ASCII, and incorrectly still being referred to as ASCII. The correct name for these so called extended ASCII character sets is the ISO-8859 family (there are 10 extended ASCII character sets, from ISO-8859-1 to ISO-8859-10). This utter confusion around 7-bit vs. 8-bit could have had been because most characters in the ASCII character set are now obsolete, and the original ASCII specification is not available free for a common person to validate the truths and rumors about ASCII. Anyways, those interested can still purchase the original ASCII specification from ANSI for $18.

Page 8: software.intel.com · Web viewFor example, US-ASCII is a fixed-length encoding scheme, which always encodes a character code in 7-bits. A Character Code cannot exist of its own, but

(Figure: Complete ASCII code page with 128 chars)

Because of the huge popularity of ASCII, people started using to refer to an ASCII encoded text file, simply as plain text file. This was not correct even then, and it is very wrong even now. A text file cannot exist without an encoding, and in this globalized world ASCII is more or less obsolete. BTW, the preferred MIME name for ASCII is US-ASCII.

UnicodeNow that we understand the basic concept of characters, character codes, code page, and encoding, let’s talk about the popular Unicode standard.

What is Unicode?Unicode, is a standard, a consortium, and a non-profit organization, started in the year 1988, whose objective is similar to that of the ISO/IEC 10646, which is, to have a single standard universal character set that addresses all the characters in this world. From Unicode’s official site “Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language”.

There might be some valid reasons as to why have two standard bodies doing the same thing with the same set of objectives, which is not very clear; but what is clear, is that, both these bodies work in collaboration with each other, rather than competing with each other.

The primary job of Unicode is to collect all the characters from all the languages in this world, and assign a unique number to every character. In short, Unicode is a Coded Character Set. But its not that simple as it sounds like. The Unicode consortium does a great deal of work. To give you an idea, following are some points, which Unicode addresses:

1. Collecting all the letters, punctuations, etc, from all the languages in this world.2. Assigning each character in the character set a unique code point.

Page 9: software.intel.com · Web viewFor example, US-ASCII is a fixed-length encoding scheme, which always encodes a character code in 7-bits. A Character Code cannot exist of its own, but

3. Deciding what qualifies as an independent character and what not. For example, should “e with an acute accent” be treated as an independent character or a composite character sequence of e and the acute accent?

4. Deciding the shape of a character in the context of other characters. For example, when ‘a’ appears next to ‘e’, the shape of the character becomes ‘æ’.

5. Decide the order of characters w.r.t. sorting when characters from different language come together.

6. And many more other things... For details see: http://www.unicode.org

As mentioned earlier, there is always an encoding associated with a Code Page. Therefore, the Unicode Code Page is associated with, not one, but many encoding algorithms or character encoding schemes. Some popular character encoding schemes associated with Unicode code page are – UTF-8, UTF-16 (LE/BE), and UTF-32 (LE/BE). Other less popular encoding algorithms associated with Unicode are -- UTF-7, UTF-EBCDIC, and an upcoming CESU.

Let us also see what Unicode is not:1. Unicode is not a fixed-length 16-bit encoding scheme.2. Unicode is a code page and not an encoding scheme. There are encoding schemes

associated with Unicode.Note: Look at the “Save As” dialog box of MS Windows Notepad application. The “Encoding” drop-down there misguides you by showing the option “Unicode”. It should have been UTF-16.

3. Unicode is not a font or a repository of glyphs.4. Unicode is not a rendering or any other kind of software.5. Unicode does not specify the size, shape, or style of on-screen characters.6. Unicode is not magic.

Officially, Unicode uses the U+NNNN[N]* notation to refer to the various code points in the Unicode character set, where, N is a hexadecimal number. For example, to refer to the “English Letter A”, whose code point is 65, the Unicode representation is: U+0041; and to refer to the “Ugaritic Letter Ho”, whose code point is 66437 (greater than FFFF), the Unicode representation is U+10385.

You can see the character for a given Unicode character code, at:http://www.unicode.org/charts/

The latest version of Unicode while writing this article is v4.0, which defines a range of characters from U+0000 to U+10FFFF, which means one needs 21-bits to represent a Unicode code point in memory, as of now.

Combining CharactersAs mentioned earlier, when a combining mark (e.g. diacritical marks) comes adjacent to an independent character, it has an affinity to get combined with that independent character. For example, when the acute accent comes next to the independent character ‘e’, it gets combined to become é.

Page 10: software.intel.com · Web viewFor example, US-ASCII is a fixed-length encoding scheme, which always encodes a character code in 7-bits. A Character Code cannot exist of its own, but

For the complete list of Unicode combining diacritical characters, see:http://www.unicode.org/charts/PDF/U0300.pdf

What if such a combining mark comes in-between two independent characters, where, it has affinity to get combined with either of those two adjacent independent characters? In such a case, we can use a special character called Zero Width Non Joiner (ZWNJ, U+200C) to express what independent character the combining character should not join to.

For example, to resolve a situation like:C1 CM C2

where, C1 and C2 are two independent characters, and CM is a combining mark, which has affinity towards both C1 and C2; we can do the following:

C1 ZWNJ CM C2Or

C1 CM ZWNJ C2

In the former case, CM gets joined to C2, and in the latter case, CM gets joined with C1.

Now, if we want to combine two independent characters, provided they are combinable, then, we can use another special character called Zero Width Joiner (ZWJ, U+200D). For example, when the character ‘a’, comes next to character ‘e’, both remain independent character, and do not get joined automatically to form the ligature ‘æ’. To join such independent characters, we can do the following:

C1 ZWJ C2

On the media, everything is just bits and bytes. Therefore, it is the job of the Unicode conformant rendering software, to ensure that the appropriate characters are joined, and displayed properly.

Other than the ZWJ special character, many of the scripts have their own special character to facilitate the joining of two characters. For example, in Devanagari script, the special character (U+094D) called Halant, is used to combine such independent characters. For example:

(Figure: Combining two independent characters using Halant)

Page 11: software.intel.com · Web viewFor example, US-ASCII is a fixed-length encoding scheme, which always encodes a character code in 7-bits. A Character Code cannot exist of its own, but

Unicode ArchitectureLet’s see the architecture of Unicode in more detail. There are 17 planes in Unicode. The primary plane is called Basic Multilingual Plane (BMP), and the rest 16 are called supplementary planes. These planes are nothing but just a category to group a range of code points. For example, the Plane 2 (aka Supplementary Ideographic Plane) has code points in the range from U+20000 to U+2FFFF, which are used to capture rare East Asian characters.

Plane # Range of code points Plane NamePlane 0 U+0000 – U+FFFF Basic Multilingual PlanePlane 1 U+10000 – U+1FFFF Supplementary Multilingual PlanePlane 2 U+20000 – U+2FFFF Supplementary Ideographic PlanePlane 3 to Plane 13

U+30000 – U+DFFFF Reserved Planes

Plane 14 U+E0000 – U+EFFFF Supplementary Special-Purpose PlanePlane 15 U+F0000 – U+FFFFF Supplementary Private Use Area-APlane 16 U+100000 – U+10FFFF Supplementary Private Use Area-B

The idea behind having such division in so-called planes is that each plane has a special meaning and contains special characters.

Plane Name PurposeBasicMultilingual Plane

For most used characters. This plane covers characters from almost all the modern languages like – English, Hindi, CJK, etc…

SupplementaryMultilingual Plane

Used for historic scripts

SupplementaryIdeographic Plane

Used for rare East Asian characters

Reserved Planes Unassigned and reserved by Unicode for future useSupplementarySpecial Purpose Plane

Language tag characters and some variation selection characters

SupplementaryPrivate Use Area-A

Reserved for applications that want to use characters which are not specified by Unicode

SupplementaryPrivate Use Area-B

Reserved for applications that want to use characters which are not specified by Unicode

One can define ones own characters and assign them to the code points in the private use area. But, to display those characters then one needs to create a new font file or update an existing font file to assign the visual representation of those characters to the appropriate character codes. These private areas are primarily used by applications to capture the new characters that are defined, but not limited to the CJK languages. For example, it is very common in Japan for people to have names, which cannot always be written using the existing characters; and need new characters to write them. These private use areas provided by Unicode are very helpful for such applications.

Page 12: software.intel.com · Web viewFor example, US-ASCII is a fixed-length encoding scheme, which always encodes a character code in 7-bits. A Character Code cannot exist of its own, but

Let’s move on and see the three popular Unicode encoding schemes in detail.

Unicode Character Encoding SchemesUTF-8, UTF-16, and UTF-32 are the three popular encodings defined by Unicode, which encodes character codes from the Unicode code page to octets, and decodes the encoded octets back to valid character codes, which exist in the Unicode code page. Other less popular encoding schemes from Unicode are CESU, UTF-EBCDIC and UTF-7.

UTF-8UTF-8 is an 8-bit code unit, variable-length encoding algorithm, with the following properties:

1. All possible characters in the Unicode code page can be encoded in UTF-8.2. UTF-8 is completely backward compatible with ASCII, which means, the first

128 characters in the Unicode are exactly the same as defined in the ASCII code page, and are encoded simply as bytes 0x00 to 0x7F, just like the way ASCII does.

3. All characters beyond U+007F are encoded as a sequence of several bytes, each of which has the most significant bit set. Therefore, no ASCII byte (0x00-0x7F) can appear as part of any other character.

4. The first byte of a multibyte sequence that represents a non-ASCII character is always in the range 0xC0 to 0xFD and it indicates how many bytes follow for this character. All further bytes in a multibyte sequence are in the range 0x80 to 0xBF (see the table below). This allows easy resynchronization and makes the encoding stateless and robust against missing bytes.

5. UTF-8 encoded characters may theoretically be up to six bytes long, however 16-bit BMP characters are only up to three bytes long.

6. The optional initial Byte Order Mark (BOM) for UTF-8 is: EF BB BF7. The bytes 0xFE and 0xFF are never used in the UTF-8 encoding.

The algorithm (simplified)

Code Points Encoded OctetsU+00000000 – U+0000007F 0xxxxxxxU+00000080 – U+000007FF 110xxxxx 10xxxxxxU+00000800 – U+0000FFFF 1110xxxx 10xxxxxx 10xxxxxxU+00010000 – U+001FFFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxxU+00200000 – U+03FFFFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxxU+04000000 – U+7FFFFFFF 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

10xxxxxx

For example, U+00E9 (é) in UTF-8 would be encoded as:

Page 13: software.intel.com · Web viewFor example, US-ASCII is a fixed-length encoding scheme, which always encodes a character code in 7-bits. A Character Code cannot exist of its own, but

UTF-16UTF-16 is a 16-bit code unit, variable-length encoding algorithm, with the following properties:

1. It uses a 16-bit fixed-width encoding algorithm to encode characters from the BMP (U+0000 to U+FFFF)

2. It uses surrogate pairs to encode characters from Supplementary planes (i.e. characters beyond U+FFFF), where each surrogate is 16-bits, thereby consuming 32-bits to encode characters from supplementary planes.

This is so, because originally Unicode was designed as a 16-bit, fixed-width encoding scheme, which could only encode characters up to U+FFFF. But, as Unicode character set grew, this had to be modified using the surrogate pair mechanism to accommodate the characters from Supplementary planes.

Surrogate PairsSurrogate pairs or commonly referred to as surrogates, is a pair of two Unicode code points from the Basic Multilingual Plane, to represent a character from the Supplementary Plane. In a coded pair, the first value is a high surrogate and the second is a low surrogate. A high surrogate is a value in the range U+D800 through U+DBFF, and a low surrogate is a value in the range U+DC00 through U+DFFF. These two range of values or code points in the BMP are reserved only for Surrogate pairs, and do not represent any individual character.

For example, the character Ugaritic Letter Ho, whose code point is U+10385 (greater than U+FFFF), is represented as a surrogate pair “U+D800 U+DF85” in UTF-16, whereas, the character English Letter A, whose code point is U+0041 (less than U+FFFF), is represented as U+0041 in UTF-16.

Endian – Big or LittleValues, which are of one octet length, have a Most Significant Bit (MSb), and a Least Significant Bit (LSb), whereas, values which are greater than one octet (or one byte) length have a Most Significant Byte (MSB) and a Least Significant Byte (LSB). Based on the computer architecture, at a given memory address, the MSB of the value might be stored first (called as big-endian), or the LSB of the value might be stored first (called as

Page 14: software.intel.com · Web viewFor example, US-ASCII is a fixed-length encoding scheme, which always encodes a character code in 7-bits. A Character Code cannot exist of its own, but

little-endian). There is no significant advantage of one over other. It is all up to the computer architecture. For example, SPARC machines use big-endian mechanism, whereas, Intel machines use little-endian mechanism to store values in memory.

In the context of UTF-16, which has a code unit of 2-bytes, the endianess makes sense, but doesn’t, for UTF-8, because UTF-8 has a code unit of 1-byte. Therefore, the UTF-16 character-encoding scheme is available in two flavours – UTF-16 BE (Big-Endian) and UTF-16 LE (Little-Endian). UTF-16 without any endianess specified, is assumed to be BE.

Let us take the code point for the English Letter A – U+0041, which has the MSB as 00 and the LSB as 41. This value would be stored as “00 41” in UTF-16 BE (MSB first); and as “41 00” in UTF-16 LE (LSB first). Note that the MSbits and the LSbits are never reversed.

UTF-32Any Unicode character can be represented as a single 32-bit unit using UTF-32. This single 32-bit code unit corresponds to the Unicode scalar value, which is the code point for the abstract character in Unicode code page. The encoding and decoding of characters in UTF-32 is much faster than UTF-16, or UTF-8, however, the downside of UTF-32 is that it forces you to use 32-bits for each character, when only a maximum of 21 bits are ever needed. Also, the most common characters from BMP can be encoded in only 16-bits. Therefore, an application must choose the encoding algorithm wisely.

BOMUnicode uses an optional signature in the beginning of the data stream (or file) when encoding code points so that when no encoding is specified to read the stream, the decoder can automatically detect from the data stream what encoding was used to generate this, and can use it to correctly read the subsequent data stream. This signature at the beginning of the data stream is called as initial Byte Order Mark (BOM). Unicode uses “FEFF” as the BOM, which gets transformed as follows when encoded using the above three encoding schemes:

Bytes Encoding Form00 00 FE FF UTF-32 BEFF FE 00 00 UTF-32 LEFE FF UTF-16 BEFF FE UTF-16 LEEF BB BF UTF-8

(Table: initial BOM in various encoding forms)

Page 15: software.intel.com · Web viewFor example, US-ASCII is a fixed-length encoding scheme, which always encodes a character code in 7-bits. A Character Code cannot exist of its own, but

Unicode and XMLXML supports Unicode inherently. The first edition of W3C XML 1.0 specification, which was published in early 1998, was based on Unicode 2.0. Unicode 2.0 was the then latest specification from the Unicode Consortium. The Unicode Consortium releases a new version of Unicode specification every other year (or whenever they have accumulated and studied enough new characters); therefore, specifications, which are dependent on Unicode, also need to be updated. In late 2000, second edition of W3C XML 1.0 specification was released, which was based on Unicode 3.0. Recently, W3C XML 1.0 third edition was released in early 2004, which is based on Unicode 3.2. Surprisingly, the same day W3C also released XML 1.1. Why release two different versions of XML, and that too the same day?

Updating the XML specification every time a new version of Unicode is released is a tedious and cumbersome task. Therefore, XML 1.1 was released primarily to be backward as well as forward compatible with the Unicode characters. This compatibility with Unicode is achieved at the cost of XML 1.1 being backward incompatible with XML 1.0. Therefore, a third edition of XML 1.0 was released to accommodate the new characters from Unicode 4.0 as well as remain backward compatible with XML 1.0 second edition.

Okay, here is what’s new in XML 1.11. Fully backward and forward compatible with Unicode2. Two end-of-line characters viz. NEL (0x85) and the Unicode line separator

character (0x2028) has been added to the list of characters that mark the end of a line. NEL is the end-of-line character found on mainframes, but XML 1.0 does not recognize this character.

3. Control characters from 0x1 to 0x1F, which were not allowed in XML 1.0 are now allowed as character entity references. For example,  is valid in XML 1.1, but invalid in XML 1.0.

4. Character normalization. Those characters, which can be represented in more than one way, must be normalized so that string related operations (like comparison, etc.) work correctly. For example, a decomposable character, which also has an independent character status, must be normalized, when used in XML 1.1.

When we say XML supports Unicode, this means, the name of the elements, the name of the attributes, the values of the attributes and the character data, all can contain Unicode characters as long as they don’t break the production rules defined in the XML specification. For example:

Page 16: software.intel.com · Web viewFor example, US-ASCII is a fixed-length encoding scheme, which always encodes a character code in 7-bits. A Character Code cannot exist of its own, but

Encoding of XML documentXML, is a text file, therefore, XML too, is associated with an encoding. This encoding must be used when we want to read or write an XML document. The encoding of an XML document is specified using the xml declaration prolog, which should be the first line in any xml document.

<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>

The encoding attribute however is optional. Therefore, if the encoding attribute is not specified, it is assumed to be UTF-8.

This encoding, which is specified on the xml declaration prolog, must be used to parse the xml document. But, to fetch that encoding, we need to parse the xml document with the appropriate encoding (i.e. the encoding specified on the xml declaration prolog). No, I am not kidding, and though it may seem, but this is not a catch-22 situation. Since the position and content of the xml declaration prolog is restricted, and every parser supports a finite number of encodings, the auto-detection of encoding of an xml document is a deterministic task. By analyzing the first few bytes of an XML document, the encoding can be determined in a deterministic way. Once the encoding is found, it is used to read the rest of the xml document. For more details on auto detecting the encoding of an XML document, see: http://www.w3.org/TR/REC-xml/#sec-guessing

Note:1. It is an error if the encoding the xml document is stored in is different from the

encoding specified in the xml declaration prolog.2. Not all xml parsers may be able to auto-detect all the possible encoding that may

be specified for an XML document, correctly.3. This auto-detecting mechanism cannot be used to read non-xml documents

because they do not have anything like the XML declaration prolog.

Most often people use an editor to create xml documents, but while saving it, they don’t care what encoding the document was actually saved in by their favourite editor. This could result in xml parsing errors, if, the encoding that the document was actually saved in, turns out to be different than the encoding specified in the xml declaration prolog of the xml document (or different than UTF-8, which is the default encoding of XML document, when no encoding is explicitly specified in the xml declaration prolog).

For example, on a windows machine when using the character e with the acute accent in an xml document, and saving it using an editor, the editor may (in most of the cases) store that document in ISO-8859-1 encoding (the Latin encoding family). Now, if no encoding were specified in the xml declaration prolog, the parser would try to parse the XML document in UTF-8, and would throw an error for the invalid character found, because the way “the letter e with an acute accent” is stored in UTF-8 is different from the way it is stored in ISO-8859-1.

Page 17: software.intel.com · Web viewFor example, US-ASCII is a fixed-length encoding scheme, which always encodes a character code in 7-bits. A Character Code cannot exist of its own, but

Now, instead of just cribbing that “XML doesn’t recognize my character”, or “XML doesn’t supports my character”, you can go ahead and fix this problem by making sure that your XML document was saved in the encoding which is specified in the XML declaration prolog. Always specify the encoding of the document you are saving (whether it is an xml document or a non-xml document) in the “Save” dialog box (or whatever it is), and if your favourite editor doesn’t allow you to do that, just dump it, and go for another one, which allows you to specify the encoding of files you save. If you cannot key-in the character you want to use in an XML document, by directly using the keyboard, you can use a character entity reference for that character. A character entity is declared as: &#D; or &#xH; -- where, D is a decimal number and H is a hexadecimal number. For example, to use the character A, whose Unicode code point is 65, one can specify a character entity reference like &#65; or &#x41; -- where, both are the same thing.

Best Practices1. When creating XML documents, it is always a good practice to explicitly specify

the encoding of the xml document in the xml declaration prolog.2. Make sure the document is actually stored in the encoding as specified in the xml

declaration prolog, or in UTF-8 (which is the default encoding of xml, when not specified explicitly using the xml declaration prolog).

GlobalizationThe art of making software, which could run on different platforms, and be used by people, from different geographical locations, different cultural backgrounds and different languages, is called Globalization. In a nutshell, it’s an art of making software for the global industry.

(Figure: Globalization = Internalization + Localization + Translation)

InternationalizationThe art of making software independent of the underlying system defaults is called Internationalization (aka I18N). Every system has defaults like encoding, and locale. Any piece of software that unknowingly relies on these defaults may not work correctly when

Page 18: software.intel.com · Web viewFor example, US-ASCII is a fixed-length encoding scheme, which always encodes a character code in 7-bits. A Character Code cannot exist of its own, but

ported from one system to other; because system defaults vary from one system to other; and thus, is not internationalized.

A ScenarioHattori Hanso creates a text file on a Japanese operating system using an editor and saves it to the disk. Most of the time people are not aware of, or they just don’t care about the encoding that the editor used to save their file. That’s a bad habit. Just like the name of the file, one must always know the encoding of a text file. Anyway, let’s say the file was saved in “Shift_JIS” encoding, which also happens to be the default encoding of that Japanese operating system, say.

He then goes ahead and writes a program to read that text file and display its contents. When writing the program he doesn’t specify any encoding to read that text file. This is extremely dangerous, and such a program is highly inflammable. Never, ever, do this, and we’ll see why in a while. He then executes that program on the same machine and everything works perfectly fine -- just as expected.

Now, he FTP over that text file and that same program to his friend Antonio Tourino’s machine, which has an English operating system. He executes the same program again on the English operating system. The program runs, but boom – this time he sees garbage. Where did those Japanese characters go? What do you say? – “I don’t know, but it works perfectly fine on my machine!”.

The ProblemThe problem is that when writing the program, the programmer relied on the source operating system’s default encoding (Shift_JIS), which turned out to be different from the target operating system’s default encoding (ISO-8859-1). Note that, more than 90% of the times, the default encoding of the source and target systems are different. But when executing the same program on the English OS, the default encoding used by the program was ISO-8859-1. Therefore, using ISO-8859-1 to read a Shift_JIS encoded file resulted in absolute garbage.

The SolutionThe solution is that the way the program requires the name of the file to read, as an input, it must also always ask the encoding of the file it was created in, as an input. After doing this fix, the program becomes internationalized, but you may still not see those Japanese characters when executing it on the English OS, because that OS might not have the required fonts for displaying the Japanese characters. But that’s a different story we will look into that in a short while.

LocalizationThe art of making software adapt to the underlying locale is called Localization (aka L10N). A software must rely on these defaults, to display the text messages, error messages, etc, in the locale it is running on, so that, it makes sense for the person who is using that software.

Page 19: software.intel.com · Web viewFor example, US-ASCII is a fixed-length encoding scheme, which always encodes a character code in 7-bits. A Character Code cannot exist of its own, but

What’s a locale?A locale represents a language. Therefore, adapting to a locale means the software should be able to display or take inputs in a language specific manner. For example, the French language is a Locale. But, French as spoken in France, differs from French as spoken in Canada. So, we augment our definition of locale to say that a locale represents a specific language of a specific country. Again, a language spoken in a specific country can have variations. For example, ancient traditional Chinese spoken in China, vs. simplified Chinese spoken in China. So, we augment our definition of locale again, to say that a locale represents a specific language of a specific country with variants.

When we write software that adapts to the underlying locale, this actually means addressing the community, which is going to use this software. Therefore, locale specific things like text messages, exception messages, the date and time, currency, etc, should be displayed in a way, such that, the user of the software understands what gets displayed (i.e. the output of the software). There is a community of user, which the localization tries to address, because we don’t want to re-write the entire software for every language, and every flavour of that language.

This involves writing the program, once, in such a way that every time it is executed, it uses the default locale of the system it is executed on, to display the locale-sensitive information.

TranslationTranslation is to translate all the text messages, error messages, etc, that would be displayed to the user, from the language it is written in, to the various languages the software being developed, wants to support.

For example, when a software is written in English, all such text messages (initially written in English) that will be displayed to the user are first identified and separated out from the code in another file. Each of these messages is assigned a unique key, which the software would use to display the message identified by that key. Then, all these messages are manually translated by language experts in various other languages, the software wants to support. Separating out such text messages from the code makes it easy to manage and translate messages. It also helps the language experts to translate these files to the various other languages independent of the software code.

Note that sentences cannot be translated from one language (or locale to be specific) to another, automatically. This is because sentences are associated with grammar. Even if we try to automate that, the translated sentences might not be grammatically correct, and at times might not make any sense at all.

Therefore, somebody has to sit down and translate all these messages, etc, to different locales. In the IT industry this field is called as Native Language Support (NLS), which primarily comprises of Language experts.

Page 20: software.intel.com · Web viewFor example, US-ASCII is a fixed-length encoding scheme, which always encodes a character code in 7-bits. A Character Code cannot exist of its own, but

Unicode and JavaThe Java programming language inherently supports Unicode since the beginning. This means, all the Characters and Strings used in a Java program are Unicode. The primitive data type char used to represent a character in Java is an unsigned 16-bit integer that can represent any Unicode code point in the range U+0000 to U+FFFF. When Unicode v1.0 was released it did not had those many characters, but Java had that support from the beginning. Even now, after so many years, J2SE 1.4 (code named: Merlin), which supports Unicode 3.0, can easily accommodate all the characters defined by Unicode 3.0 in 16 bits of the Java char datatype.

In Java, a character or should I say a Unicode character can be represented as:

char ch = 'A';

OR by directly using the Unicode code point as:

char ch = 65;

…and both mean the same thing. A third way of representing characters in Java, is by using the “\u” escape sequence. Well, first of all, why have a third way at all? This was primarily for the ease of use to directly support Unicode characters in Java Strings, rather than all the time create a String from char[].

The “\u” escape sequence follows the form: \uNNNN, where, N is a hexadecimal number. Therefore, to represent the character ‘A’, whose Unicode code point is 65, can be represented in java using the “\u” escape sequence as:

char ch = '\u0041';

This escape sequence is very useful when creating Strings. For example:

String str = "h\u00E9llo world";

which is equivalent to:

String str = "héllo world";

Since Unicode is growing faster and theoretically puts no constraint on the number of characters it defines, Java would be in trouble when the number of characters grows beyond U+FFFF. This is because; the char datatype in Java is only 16-bits and cannot accommodate values beyond U+FFFF. Here is the interesting part. Unicode 4.0 is now available and it defines characters in the range U+0000 and U+10FFFF. This means Java must do something if it is to support those additional characters (beyond U+FFFF) from Unicode 4.0, which are called supplementary characters. Therefore, JSR-204 was filed for supporting these

Page 21: software.intel.com · Web viewFor example, US-ASCII is a fixed-length encoding scheme, which always encodes a character code in 7-bits. A Character Code cannot exist of its own, but

supplementary characters in J2SE, and J2SE 1.5 (code named: Tiger) supports Unicode 4.0 using the mechanism defined in JSR-204. In a nutshell, supplementary characters are supported in Java 1.5 using the surrogate pair mechanism of UTF-16. Read the article “Supplementary Characters in the Java Platform” for more details.

To represent characters from supplementary planes, or code points beyond U+FFFF, a pair of escape sequences are used – “\uXXXX \uYYYY” called surrogate pairs. Refer the Unicode UTF-16 section earlier in this article for details on surrogate pairs.For example, to represent the character Ugaritic Letter Ho, whose code point is U+10385 (greater than U+FFFF), in Java, one should use the surrogate pair “\uD800\uDF85”. Note that using the escape sequence “\u10385” is incorrect in Java.

Java 1.5 provides APIs to get these surrogate pairs, for a given code point, without having the developers worry about how are they generated:

int unicodeCodePoint = Integer.parseInt(args[0]);char[] surrogates = Character.toChars(unicodeCodePoint);if (surrogates.length > 1) { System.out.println("High=U+"+Integer.toHexString(surrogates[0])); System.out.println("Low =U+"+Integer.toHexString(surrogates[1]));}else { System.out.println("Value="+surrogates[0]);}

I completely agree, that this is not a great way to handle supplementary characters in Java. It would have been great for developers, had the supplementary characters been supported in a way, something like: “\u{20000}”. This would have made lives much easier, but nothing much can be done about that now. Anyways, this is not too bad either, or is it?

Internationalization in Java

Bytes and StringsWhenever we convert a String to Bytes, or create a String from Bytes, there is an encoding involved. Without an encoding this conversion is not possible at all. Yes, Java provides API which can do this conversion without a user specified encoding:

public String(byte[] bytes)public byte[] getBytes()

Java shouldn’t have done this, because this gives an impression that no encoding is required in these conversions. But the fact is, the system default or the JVM default encoding is used to do this conversion. Most of the time, many of us use these APIs to convert String to bytes and vice-versa without specifying an encoding, and without realizing the impact of this. The impact is huge – the code is no more portable; which

Page 22: software.intel.com · Web viewFor example, US-ASCII is a fixed-length encoding scheme, which always encodes a character code in 7-bits. A Character Code cannot exist of its own, but

means, you might get different and incorrect results when such a program is executed on different machines, because the default encoding would vary from one machine to other.

Never do this. Instead, use the other overloaded APIs where one can specify the encoding explicitly when doing such conversions:

public String(byte[] bytes, String charsetName)public byte[] getBytes(String charsetName)

Remember that once the conversion is done, there is no encoding information associated with the bytes or the characters. Bytes are bytes, and characters are characters. Period. If you want to do conversion, you must specify a third piece of information – encoding.

Input and Output (I/O)Java allows you to read/write streaming data from/to external (as well as internal) data sources using the APIs in the java.io package. Broadly, the java.io package provides API to access the data as stream of Bytes or as stream of Characters. The java.io.InputStream, and the java.io.OutputStream (including all their subclasses) allows you to read/write data as bytes; whereas, the java.io.Reader and the java.io.Writer (including all their subclasses) allows you to read/write data as characters. Based on the type of data source, and the processing required, one can choose an appropriate class to do the I/O.

Byte StreamsThere is no encoding involved when using byte streams to do IO. A file, for example, is stored as a sequence of bits on the file system. When we use a byte stream to read a file (e.g. java.io.FileInputStream), 8-bits (or 1-byte or 1-octet) are read at a time, and returned to the application, as it is. The IO classes supporting byte streams do not look into, or do any sort of processing with the bytes read. It is the onus of the application to process the raw bytes in whatever way it wants.

Char StreamsCharacter streams are byte streams plus encoding. A character stream cannot function without an encoding. IO classes supporting character streams, first read appropriate number of bytes from the underlying byte stream (depending on the encoding), and then do some processing with the content of the bytes read (depending on the encoding), and convert them to a code point. This code point, which represents a character in Unicode, is then returned to the application.

From the characters, we cannot always say what was the exact sequence of bytes in the input stream. Yes, given an encoding, we can convert the characters back to bytes, but this sequence of bytes may not be always exactly the same as the original sequence of bytes.

Even though Java allows, we must never do this:

public InputStreamReader(InputStream in)

Page 23: software.intel.com · Web viewFor example, US-ASCII is a fixed-length encoding scheme, which always encodes a character code in 7-bits. A Character Code cannot exist of its own, but

The above uses the system default or the JVM default encoding to read the file. This means, if you move the program to some other machine, which has a different encoding, then this program might not work at all. Instead, one should always use:

public InputStreamReader(InputStream in, String charsetName)

Below are two flavours of a small program to read a file using the appropriate encoding (i.e. the encoding which was used to create the file), and then to write it back using any user specified encoding.

Listing 1:

import java.io.*;...//create a reader to read the input file //specifying the correct encoding of the input fileBufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file_in), enc_in));

//create a writer to write the output file //in the any encoding of choiceBufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file_out), enc_out));

//read from the input file and write to the output filechar[] buffer = new char[BUFFER_SIZE];int charsRead = -1;while ((charsRead = reader.read(buffer)) != -1) { writer.write(buffer, 0, charsRead);}

The above program doesn’t has to do anything except for creating the readers and writers correctly. The encoding conversion is done automatically by the IO classes.

Listing 2:

import java.io.*;...//create a reader to read the input file//specifying the correct encoding of the input fileBufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file_in), enc_in));

//create an output stream to write to the output file//without any encoding conversionFileOutputStream fos = new FileOutputStream(file_out);

//read from the input file and write to the output filechar[] in_buffer = new char[BUFFER_SIZE];byte[] out_buffer = null;int charsRead = -1;String str = null;while ((charsRead = reader.read(in_buffer)) != -1) {

Page 24: software.intel.com · Web viewFor example, US-ASCII is a fixed-length encoding scheme, which always encodes a character code in 7-bits. A Character Code cannot exist of its own, but

str = new String(in_buffer, 0, charsRead); //do the conversion out_buffer = str.getBytes(enc_out); fos.write(out_buffer);}

The above program explicitly does the conversion using the encoding before writing the bytes to the output file.

Note: The default encoding of the JVM is same as the default encoding of the system, unless it is changed explicitly as follows:

java –Dfile.encoding=UTF-8 mypack.myapp

Java vs. IANA encoding namesFor whatever reason, Java defines its own name for many of the encodings it supports in the java.lang and java.io package, rather than using the standard names for the encodings registered with Internet Assigned Numbers Authority (IANA). List of IANA registered encoding names and their alias can be found here. Though, some of the name of the encodings supported by Java, are also registered with IANA, but its not true for all the Java supported encoding names. For example, “UTF8” (without the hyphen), is one such encoding name supported by Java which is not registered with IANA. The complete list of the name of the encodings supported by Java can be found here.

It is always better to use standard encoding names rather than Java encoding names for interoperability reasons. For example, when a Java program is written to generate an XML file, the Java program cannot use the Java encoding name in the XML declaration prolog of the output XML. This is because XML supports only standard encoding names registered with IANA in the xml declaration prolog. It would be for the good of developers, if Java provides a map implementation of Java encoding names to IANA encoding names and vice-versa for such interoperability reasons.

Java modified UTF-8Taking about UTF-8, the implementation to support Unicode standard UTF-8 in Java is a bit modified version of the standard UTF-8. The Java version is called modified UTF-8 (or UTF8). Modified UTF-8 is different from the standard UTF-8 as follows:

1. The null character (U+0000) is encoded with two bytes instead of one, specifically as 11000000 10000000. This ensures that there are no embedded nulls in the encoded string.

2. The way characters outside the BMP are encoded. Standard UTF-8 doesn’t differentiate, when encoding characters from BMP or Supplementary planes. In modified UTF-8 these characters are first represented as surrogate pairs (as in UTF-16), and then the surrogate pairs are encoded individually in sequence for backward compatibility reasons.

3. When decoding a UTF-8 stream with an initial BOM, the Java implementation reads the BOM as yet another character. This BOM is supposed to be verified and then skipped by the decoder before it actually reads the data. A bug (#4508058)

Page 25: software.intel.com · Web viewFor example, US-ASCII is a fixed-length encoding scheme, which always encodes a character code in 7-bits. A Character Code cannot exist of its own, but

has been filed for this at http://bugs.sun.com, and is supposed to be fixed in Java 1.6 (code named: Mustang)

Localization in Java

LocaleAs per the Javadoc “A Locale object represents a specific geographical, political, or cultural region. An operation that requires a Locale to perform its task is called locale-sensitive and uses the Locale to tailor information for the user. For example, displaying a number is a locale-sensitive operation--the number should be formatted according to the customs/conventions of the user's native country, region, or culture”.

To perform a locale-sensitive operation, a locale object needs to be created first. Java provides pre-created Locales for ease of use. But, one could always explicitly create a locale by specifying either a language code, or a language code and a country code. See the Javadoc for java.util.Locale for more details.

Listing 3: This listing displays the current system date/time in a specific locale using a pre-defined date format.

import java.util.*;import java.text.*;...

//current dateDate date = new Date();

print("Using DateFormat with pre-defined format to format date/time...");DateFormat df_jp = DateFormat.getDateTimeInstance(DateFormat.FULL, DateFormat.FULL, Locale.JAPAN);

DateFormat df_de = DateFormat.getDateTimeInstance(DateFormat.FULL, DateFormat.FULL, Locale.GERMANY);

DateFormat df_en = DateFormat.getDateTimeInstance(DateFormat.FULL, DateFormat.FULL, Locale.US);

//format and print the localized date/timeprint("Date/Time in Japan = "+df_jp.format(date));print("Date/Time in Germany = "+df_de.format(date));print("Date/Time in US = "+df_en.format(date));

Output:Using DateFormat with pre-defined format to format date/time...

Page 26: software.intel.com · Web viewFor example, US-ASCII is a fixed-length encoding scheme, which always encodes a character code in 7-bits. A Character Code cannot exist of its own, but

Listing 4: This listing displays the current system date/time in a specific locale using a custom date format.

import java.util.*;import java.text.*;...

//current dateDate date = new Date();

print("Using SimpleDateFormat with custom format to format date/time...");SimpleDateFormat sdf_jp = (SimpleDateFormat)DateFormat.getDateTimeInstance(DateFormat.FULL, DateFormat.FULL, Locale.JAPAN);

SimpleDateFormat sdf_de = (SimpleDateFormat)DateFormat.getDateTimeInstance(DateFormat.FULL, DateFormat.FULL, Locale.GERMANY);

SimpleDateFormat sdf_en = (SimpleDateFormat)DateFormat.getDateTimeInstance(DateFormat.FULL, DateFormat.FULL, Locale.US);

//override the pre-defined pattern with a custom patternsdf_jp.applyPattern("EEE, d MMM yyyy HH:mm:ss zzzz");sdf_de.applyPattern("EEE, d MMM yyyy HH:mm:ss zzzz");sdf_en.applyPattern("EEE, d MMM yyyy HH:mm:ss zzzz");

//format and print the localized date/timeprint("Date/Time in Japan = "+sdf_jp.format(date));print("Date/Time in Germany = "+sdf_de.format(date));print("Date/Time in US = "+sdf_en.format(date));

Output:Using SimpleDateFormat with custom format to format date/time...

Page 27: software.intel.com · Web viewFor example, US-ASCII is a fixed-length encoding scheme, which always encodes a character code in 7-bits. A Character Code cannot exist of its own, but

Listing 5: This listing parses a locale specific string representing date/time back to java Date object.

//current dateDate date = new Date();

DateFormat df_jp = DateFormat.getDateTimeInstance(DateFormat.FULL, DateFormat.FULL, Locale.JAPAN);

String localeSpecificDateStr = df_jp.format(date);Date newDate = df_jp.parse(localeSpecificDateStr);

Listing 6: Displays a currency value in different locales.

import java.util.*;import java.text.*;...

//formatting numbers and currencieslong value = 123456789;

//Formatting currency using NumberFormatprint("Using NumberFormat to format Currency...");NumberFormat nf_curr_jp = NumberFormat.getCurrencyInstance(Locale.JAPAN);NumberFormat nf_curr_de = NumberFormat.getCurrencyInstance(Locale.GERMANY);NumberFormat nf_curr_en = NumberFormat.getCurrencyInstance(Locale.US);

print("Currency in Japan = "+nf_curr_jp.format(value));print("Currency in Germany = "+nf_curr_de.format(value));print("Currency in US = "+nf_curr_en.format(value));

Output:Note the position of comma and dotsUsing NumberFormat to format Currency...

Listing 7: Displays a numeric value in different locales.

import java.util.*;import java.text.*;...

//formatting numbers and currencieslong value = 123456789;

Page 28: software.intel.com · Web viewFor example, US-ASCII is a fixed-length encoding scheme, which always encodes a character code in 7-bits. A Character Code cannot exist of its own, but

//NumberFormatprint("Using NumberFormat to format Numbers...");NumberFormat nf_jp = NumberFormat.getNumberInstance(Locale.JAPAN);NumberFormat nf_de = NumberFormat.getNumberInstance(Locale.GERMANY);NumberFormat nf_en = NumberFormat.getNumberInstance(Locale.US);

print("Number in Japan = "+nf_jp.format(value));print("Number in Germany = "+nf_de.format(value));print("Number in US = "+nf_en.format(value));

Output:Note the position of comma and dotsUsing NumberFormat to format Numbers...

Resource BundlesTill now we saw that date, time, currency, numeric values can be automatically formatted to a specific locale.

But, what about the following piece of code, when it is executed on different locales?:

String str = “Hello”;Jlabel label = new Jlabel();label.setText(str);

OR

throw new Exception(“run dude, run”);

At a first glance, the code look fine, because we never bother about what if this program is executed on a different locale. For example, what would be displayed when the above program is executed on a Japanese locale?. Shouldn’t the string “Hello” be displayed in Japanese? Yes, it should be. But the question is would it be displayed in Japanese? Did you answered Yes??!! No. The answer is No. Sentences cannot be translated automatically, because as mentioned earlier, when translating a sentence to a specific locale, one needs to ensure that the grammar of the translated sentence is correct. Therefore, sentences cannot be translated from one locale to another automatically.

Resource Bundle is a collection of resource files. Each resource file contains resources (e.g. display messages, exception messages, etc.) in the native language, and every resource in the resource file is identified by a unique key. The message may either be a static message or a parameterized message. Resource bundles are used to display static or parameterized messages into various languages.

Page 29: software.intel.com · Web viewFor example, US-ASCII is a fixed-length encoding scheme, which always encodes a character code in 7-bits. A Character Code cannot exist of its own, but

The way a resource bundle works is simple. Here are the steps you should follow to use a resource bundles:Design Time

1. First of all, identify and remove all the displayable string literals from your code, which you want to be localized, and write it in a text file, identified by a unique key. For example:MSG_1 = “Hello.”

2. Parameterize the messages required. For example:MSG_2 = “May I speak to {0}.”

3. The NLS group now translates this single file into various other languages. The convention used to name these resource files is:<file name>_<lang>_<country>. This helps to easily identify what language the resource file represents, just by looking at its file name.

4. The developers then refer to these constant names used in the resource files, in their code to use the locale-sensitive message.

Runtime1. The program can then load the appropriate resource file based on the locale of the

JVM, and then, use the constant name to fetch the correct localized message, at runtime.

2. If the message is parameterized, it must be replaced with the appropriate value before using that message.

3. Now, the message is ready to be used and is returned to be displayed either in UI or in a exception message, or anywhere else.

Listing 8: Using resource bundle

import java.util.*;import java.text.*;...//resource bundle file nameString resourceBundleBaseFileName = "MyMessages";

//locale to useLocale myLocale = Locale.US;

//get the correct resource bundle using the appropriate locale//See the javadoc of getBundle(), for a complete description //of the search and instantiation strategy.ResourceBundle localizedMessages = ResourceBundle.getBundle(resourceBundleBaseFileName, myLocale);

//get the localized messageString localizedMessage = localizedMessages.getString("MyMessage");

//create parameter values for the localized messageObject[] messageArguments = { "Rahul", new Date(), "Pizza Hut", new Date() };

Page 30: software.intel.com · Web viewFor example, US-ASCII is a fixed-length encoding scheme, which always encodes a character code in 7-bits. A Character Code cannot exist of its own, but

//substitue the parameters, if any, in the localized message//with the parameter values created aboveMessageFormat formatter = new MessageFormat(localizedMessage, myLocale);String finalMessage = formatter.format(messageArguments);

//This final localized string now can be used wherever required.//For example as exception messages or messages in UI, etc. System.out.println(finalMessage);

MyMessages.propertiesMyMessage = Mary is going out on a date with {0} on {1, date} meeting at {2} at {3, time}.

Output:Mary is going out on a date with Rahul on 15.01.2006 meeting at Pizza Hut at 16:47:03.

For details on parameterized messages, see:http://java.sun.com/j2se/1.4.2/docs/api/java/text/MessageFormat.html

Note: When the JVM starts, its default locale is same as the default locale of the system, unless it is changed explicitly as follows:

java –Duser.language=ja -Duser.country=JP mypak.myapp

Conclusion

Basic concepts of characters and encodings are essential for every developer. Globalization is an art of writing programs. And, Unicode is the future.