introduction to character encodings, java and you

50
Introduction to Character Encodings, Java and You

Upload: silvester-floyd

Post on 26-Dec-2015

234 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Introduction to Character Encodings, Java and You

Introduction to Character Encodings, Java and You

Page 2: Introduction to Character Encodings, Java and You

Private and Confidential2

Agenda

Defining the problem– Where webMethods products encounter

character set problems.– What the symptoms look like.

Understand core concepts– What is a character set? What’s an encoding?– What is Unicode, really?

Code Examples to avoid problems

Page 3: Introduction to Character Encodings, Java and You

Private and Confidential3

Confusion Reigns

Generally, the most confusing aspect of internationalization.

1. Many, many standards to choose from.2. Arcane terminology3. American programmers rarely (seem) to

encounter it head-on.

– We’re presenting this because many of our products are encountering this problem now.

Page 4: Introduction to Character Encodings, Java and You

Private and Confidential4

Problem Domain

webMethods products interface with:– non-Java systems (for example, in the

adapters)– non-Java environments (file systems,

databases, libraries, email, ftp, http, etc.).

Page 5: Introduction to Character Encodings, Java and You

Private and Confidential5

Java’s Text Representation

Java provides a convenient text processing architecture centered on the Java String object.– A Java String is basically an array of Java

Character Objects.

Page 6: Introduction to Character Encodings, Java and You

Private and Confidential6

Java Characters

Each Java Character object represents a Unicode character.– (Currently) a 16-bit unsigned integer value

between 0 and 65,535.– Character class provides access to character

properties. UPPER, lower, and Titlecase mapping Comparison Directionality Compatibility C-TYPE values such as ‘alpha-ness’, ‘digit-ness’,

‘alphanumeric-ness’

Page 7: Introduction to Character Encodings, Java and You

Private and Confidential7

Non-Java Text

Non-Java files, applications, filesystems, database, et.al. typically do not use Unicode. Java sees them as an array of bytes (byte[]).

Page 8: Introduction to Character Encodings, Java and You

Private and Confidential8

Three Problems

? Bad Conversion

� No glyph

ƒÃƒ\ǂكÃ\ǂÙ Random-seeming trash characters

Page 9: Introduction to Character Encodings, Java and You

Private and Confidential9

Bad Conversion

Target character set doesn’t have this character in it. Java replaces each character with a “?”

Input String: 日本語Output String: ???

Typically: – Using the default encoding when we meant to

specify one. – Writing on a device (such as System.out) whose

legacy encoding doesn’t support the characters.

Page 10: Introduction to Character Encodings, Java and You

Private and Confidential10

“No Glyph”

Java knows what the character is and is handling it properly, but doesn’t have a picture of it to show you (in the current Font selected).

Input String: 日本語Output String: ���

Typically: – Nothing is wrong, just using the wrong Font.

Page 11: Introduction to Character Encodings, Java and You

Private and Confidential11

Random Trash

A byte[] was converted using the wrong character encoding. Bytes were mapped to the wrong characters.

Input String: 日本語Output String: ú{ê� � �

Typically: – Using the wrong encoding, the underlying bytes

are mapped to different, random-seeming characters.

Page 12: Introduction to Character Encodings, Java and You

Private and Confidential12

Examples

Same byte sequences, different results:

Shift JIS byte[] = 0xE0, 0x41, 0x83, 0x70 = “ 漓パ”

Latin-1 byte[] = 0xE0, 0x41, 0x83, 0x70 = “àAp”�

Java String = 0xE0, 0x41, 0x83, 0x70 = “ 荰”

Java String = “ 漓パ” = U+6F13 U+30D1

Page 13: Introduction to Character Encodings, Java and You

Character Set Terminology

Page 14: Introduction to Character Encodings, Java and You

Private and Confidential14

What is a Character?

A character is a single, atomic unit of text.

The definition has a different meaning according to the writing system and context.

Page 15: Introduction to Character Encodings, Java and You

Private and Confidential15

Abstract characters

Some abstract characters include:

A Roman Letter Capital A` Combining Accent Graveに Hiragana character “ni”語 CJK IdeographArabic letter ي앚 Hangul syllableA Fullwidth compatibility letter A

Page 16: Introduction to Character Encodings, Java and You

Private and Confidential16

What is a Character Set?

A character set is a “set”--- a collection of characters, usually organized in some fashion.You’re probably most familiar with ASCII:– 0x41 ‘A’– 0x42 ‘B’– Etc.

Page 17: Introduction to Character Encodings, Java and You

Private and Confidential17

What is a Character Encoding?

Character set: a collection of characters, basically, a bucket.

Character encoding: the specific ones and zeroes assigned to a character set.

Character Set: ‘A’ == 0x41

Character Encoding: ‘A’ == 0x41

Page 18: Introduction to Character Encodings, Java and You

Private and Confidential18

Eight Bit Encodings

8-bit encodings allow for 256 characters.

128 ASCII

32 ‘C1’ controls

96 extended

Page 19: Introduction to Character Encodings, Java and You

Private and Confidential19

Latin-1

The standard for Western Europe is generally ISO-8859-1AKA “Latin-1”Used by UNIX systems and the Web.Extended version used by Microsoft for Windows.

Page 20: Introduction to Character Encodings, Java and You

Private and Confidential20

Let a Thousand Encodings Bloom…Each language has it’s own character set…– Everywhere: ASCII*– Western European (like German or French):

Latin-1– Eastern European (like Polish or Slovak): Latin-2– Simplified Chinese: GB2312

Page 21: Introduction to Character Encodings, Java and You

Private and Confidential21

Actually, many for each language…

Page 22: Introduction to Character Encodings, Java and You

Private and Confidential22

Other Writing Systems

Writing systems vary around the world (in order of increasing complexity, more or less):– Latin-based alphabets

(ABCDEFG…) English– Cyrillic and Greek-based alphabets

(АБВГДЕЖЩ...) Russian– Ideographic writing systems have thousands of

characters (一丁勺両亀困 ...) Japanese

– Bi-directional (RTL) languages go right to left Hebrew (זוהדגבא...)

– Complex scripts (everything else): (ऋऌऍऎ )Devanagari

Page 23: Introduction to Character Encodings, Java and You

Private and Confidential23

Expanded Character Sets

Most languages have alphabetic or phonetic writing systems:– Russian, Greek, Slavic, (many) Native American,

Bahasa, Hebrew, Arabic, Semitic, etc.: alphabetic– Indian (subcontinent), Thai, Japanese kana, Korean:

phonetic writing systems– 8 bits is enough for all of the above (with some tricks)

Some languages use scripts based on Chinese ideographic writing (“Han” or “Hanja”):– Chinese– Korean– Vietnamese (traditional)– Japanese Kanji

Page 24: Introduction to Character Encodings, Java and You

Private and Confidential24

“Double-Byte”

8-bit character encodings use eight bits per character.– 28 = 255 characters

“Double-byte” character sets must be 2 bytes per character ?– 216 = 65,535 characters

Should actually be called “multi-byte” (MBCS).– Each character can be ONE, TWO, THREE and

sometimes FOUR bytes in length.– MAY involve shift states.

Page 25: Introduction to Character Encodings, Java and You

Private and Confidential25

Multibyte Encodings

A typical Japanese Character Set:JIS X 208 ( 漢字 )

Character Encodings of JIS X 208:Shift-JIS (CP932): 0x8A 0xBF

0x8E 0x9AEUC-JP: 0xB4 0xC1 0xBB 0xFAISO 2022-JP: 0x1B, 0x24, 0x42, 0x34 0x41

0x3B 0x7A 0x1B 0x28 0x4A

Non-Legacy:UTF-16: (0x6F22 0x5B57)

Page 26: Introduction to Character Encodings, Java and You

Private and Confidential26

An MBCS Example: Shift-JIS

Character set used by DOS, Windows, Macs, and a few UNIX-like systems for Japanese.– Code Page 932– JIS X 208:1997

Page 27: Introduction to Character Encodings, Java and You

Private and Confidential27

Shift-JIS

In order to reach more characters, double byte values start with a limited range of “lead bytes”These can be followed by any character value> 0x40 (“trail byte”)

Page 28: Introduction to Character Encodings, Java and You

Private and Confidential28

Shift-JIS

Each “lead byte” provides a “window” onto additional characters.

Page 29: Introduction to Character Encodings, Java and You

Private and Confidential29

Shift-JIS

Problems:– Lead byte

values are also valid as trail bytes.

– Common special characters (“\”!!) are valid trail bytes.

Page 30: Introduction to Character Encodings, Java and You

Private and Confidential30

Han

CJK scripts require up to 100,000 unique characters for complete representation.– Four major variants:

Traditional Chinese Simplified Chinese Japanese Kanji Korean (non-Hangul)

Page 31: Introduction to Character Encodings, Java and You

Private and Confidential31

“Kanji”

Sometimes you hear Japanese called “kanji”– Kanji is actually one of four writing systems

used in Japan.– Kanji should be avoided as a generic term for

DBCS.

Kanji (“Han” or Chinese writing): 日本語Hiragana (phonetic for Japanese words): にほんごKatakana (phonetic for “foreign” words): ニホンゴRomanji (“Roman script”): nihongo

Page 32: Introduction to Character Encodings, Java and You

Private and Confidential32

Chinese

Upper two are Traditional.Lower character is the Simplified variant.

Page 33: Introduction to Character Encodings, Java and You

Private and Confidential33

Hangul

Korean Hangul is a syllabic phonetic system, which has thousands of combinations.– Hangul is not related to Han ideographic

writing.

Page 34: Introduction to Character Encodings, Java and You

Private and Confidential34

Code Page Hell

With hundreds of encodings and character sets to choose from, making internationalized code work in the late 1980’s and early 1990’s was “hellish”.Internationalization folks referred to this as “code page hell”

Page 35: Introduction to Character Encodings, Java and You

Unicode and Java

To the Rescue

Page 36: Introduction to Character Encodings, Java and You

Private and Confidential36

Unicode (ISO 10646-2)

Unicode is a character set that supports all of the world’s languages and writing systems.* Originally designed as a “wide character set”--

every character was represented by 16-bits. This allowed for 65,535 potential characters.

Extended to allow 1.1 million characters. Unicode is maintained by an industry

consortium. ISO 10646-2 is maintained by WG2. The two are exactly identical.

Page 37: Introduction to Character Encodings, Java and You

Private and Confidential37

It’s a character set?

Unicode is a character set. It has these encodings:– UTF-32. (BE/LE)

A 32-bit encoding. All characters 32 bits.– UTF-16. (BE/LE)

A 16-bit encoding. All characters are 16-bits. Characters above 0xFFFF (the “Basic Multilingual

Plane”) require two special “surrogate” characters.– UTF-8.

An 8-bit variable width encoding. Characters are 1, 2, 3 or 4 bytes long. Always non-endian.

ASCII == ASCII All other characters have a special bit pattern

Page 38: Introduction to Character Encodings, Java and You

Private and Confidential38

UTF-8 Bit Pattern

ASCII == ASCII– 0x41 == ‘A’

All other characters are multibyte.– 110xxxxx == two bytes– 1110xxxx == three bytes– 11110xxx == four bytes– 10xxxxxx == trail byte

– U+00C0 == À == 0xC3 0x80 (11000011 10000000)

Page 39: Introduction to Character Encodings, Java and You

Private and Confidential39

Convenience Method for UTF8

Almost True: readUTF and writeUTF allow direct access to UTF-8 DataInput/DataOutputStreams.– This is not really UTF-8, but a Sun specialized

version.– Use InputStreamReader/OutputStreamWriter to

do proper conversions.

Page 40: Introduction to Character Encodings, Java and You

Private and Confidential40

Java Uses Unicode

Every character in every Java String object is encoded as UTF-16 Unicode.– Every string is converted from a legacy

encoding, either by the compiler or by the String class.

– This is the reason for native2ascii and –encoding switches.

Once you have a String object, everything is Unicode UTF-16.

Page 41: Introduction to Character Encodings, Java and You

Private and Confidential41

“Special” encodings

There are two encodings that the system treats as special:– file.encoding– ISO-8859-1

All basic conversion functions use your system default encoding.Most servlet conversion functions use ISO-8859-1 as the default.

Page 42: Introduction to Character Encodings, Java and You

Private and Confidential42

Two File Encodings

Windows systems generally have two different file encodings:– “ANSI” encoding is the Windows default code

page for GUI applications.– “OEM” encoding is the code page used by the

‘cmd’ or ‘command’ interpreter shells.

Page 43: Introduction to Character Encodings, Java and You

Private and Confidential43

Stream Readers and Writers

InputStreamReader and OutputStreamWriter classes perform controlled conversion between byte[] and String.– Always pass the encoding as a variable.– Use the IANA preferred name for the encoding,

if possible (see ftp://ftp.isi.edu/in-notes/iana/assignments/)

– Prefer UTF8 for on-the-wire transport.

Page 44: Introduction to Character Encodings, Java and You

Private and Confidential44

Code Sample

// use with any type of InputStream classInputStream is = new FileInputStream(file);InputStreamReader isr = new InputStreamReader(is, encoding);// use Buffered Reader for efficiencyBufferedReader br = new BufferedReader(isr);StringBuffer sb = new StringBuffer();int chr;

while ((chr = br.read() > -1) { sb.append(chr);}

* Note: Try blocks eliminated for clarity.

Page 45: Introduction to Character Encodings, Java and You

Private and Confidential45

OutputStreamWriter Code Sample

// use with any type of OutputStream classOutputStream os = new ByteArrayOutputStream(file);OutputStreamWriter osw = new OutputStreamWriter((OutputStream)os, encoding);osw.write(myString, 0, myString.length());osw.flush();

* Note: Try blocks eliminated for clarity.

Page 46: Introduction to Character Encodings, Java and You

Private and Confidential46

Character Class

Provides access to Unicode character properties.– UnicodeBlock inside class– Character getType (defined types)– isDigit– isLetter– isLetterOrDigit– isUpperCase/isLowerCase/isTitleCase– toUpperCase/toLowerCase/toTitleCase– isSpace/isWhitespace– isISOControl/isJavaIdentifierStart/

isJavaIdentiferPart

Page 47: Introduction to Character Encodings, Java and You

Private and Confidential47

Normalization

Many characters have two (or more) representations in Unicode.– Normalization makes the sequences the same.– Simplifies user input parsing and validation.

Page 48: Introduction to Character Encodings, Java and You

Private and Confidential48

ICUj Normalizer Class

Four forms of Normalization:– Form C (composed)– Form D (decomposed)– Form KC (canonical composed)– Form KD (canonical decomposed)

– Special handling for Hangul characters!– Note that there is a private class

java.text.Normalizer in the JDK.

Page 49: Introduction to Character Encodings, Java and You

Private and Confidential49

Demo Programs

UnicodeDemo – a Java program that demonstrates the byte sequences of different encodings and also provides some code that shows ISR and OSW in action.Charsets – a Windows program by my buddy Bill Hall for playing with encodings.http://www.inter-locale.com -- my personal website, with examples and demos of certain Java I18n things.