ibm globalization center of competency © 2006 ibm corporation iuc 29, burlingame, camarch 2006...

27
IBM Globalization Center of Competency IUC 29, Burlingame, CA March 2006 © 2006 IBM Corporation Automatic Character Set Recognition Eric Mader, IBM Andy Heninger, IBM

Upload: marilynn-tucker

Post on 30-Dec-2015

215 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy

IBM Globalization Center of Competency

IUC 29, Burlingame, CA March 2006 © 2006 IBM Corporation

Automatic Character Set Recognition

Eric Mader, IBM

Andy Heninger, IBM

Page 2: IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy

IBM Globalization Center of Competency

© 2006 IBM Corporation2 IUC 29, Burlingame, CA March 2006

Overview

What is character set detection?

How is it used?

Character set detection libraries

How ICU’s library is implemented

Conclusion

Page 3: IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy

IBM Globalization Center of Competency

© 2006 IBM Corporation3 IUC 29, Burlingame, CA March 2006

What is Character Set Detection?

Tower of Babel

– Dozens of character encodings in common use

– Web pages, emails, plain text files

– Protocols specify character encoding

Encoding information may be missing or incorrect

– Encoding information may be missing

– Server may have incorrectly overridden

– Translator may have failed to update

Character set detection to the rescue!

Page 4: IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy

IBM Globalization Center of Competency

© 2006 IBM Corporation4 IUC 29, Burlingame, CA March 2006

How is Character Set Detection Used?

Web browsers, search engines, email

– Web pages, email have character encoding information

– This information may be missing or incorrect

File indexing

– Must handle plain text files

– Character encoding information may be incorrect

Page 5: IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy

IBM Globalization Center of Competency

© 2006 IBM Corporation5 IUC 29, Burlingame, CA March 2006

Character Set Detection Libraries

Mozilla

– C++ and Java versions

– Incremental operation

Windows API

– ImultiLanguage2::DetectInputCodepage

– ImultiLanguage2::DetectCodepageInIStream

ICU– C and Java versions

Page 6: IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy

IBM Globalization Center of Competency

© 2006 IBM Corporation6 IUC 29, Burlingame, CA March 2006

ICU’s Character Set Detection Library

Detection function

– Returns character set, confidence

Conversion function

– Converts data to Unicode

Convenience functions to do both

Page 7: IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy

IBM Globalization Center of Competency

© 2006 IBM Corporation7 IUC 29, Burlingame, CA March 2006

Three Classes of Character Sets

Single Byte

– Each byte corresponds to one Unicode character

Multi-Byte

– Two or more bytes represent a single Unicode character

Algorithmic

– Encoding scheme produces distinctive byte patterns

Page 8: IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy

IBM Globalization Center of Competency

© 2006 IBM Corporation8 IUC 29, Burlingame, CA March 2006

Detecting Single Byte Character Sets

Can’t use byte patterns

– Any byte legal in any position

Use statistical method

– Have statistics for each language

– Match statistics of input to each language

– Assumes input is natural language plain text

Page 9: IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy

IBM Globalization Center of Competency

© 2006 IBM Corporation9 IUC 29, Burlingame, CA March 2006

Language Statistics

Trigrams

– Groups of three adjacent letters

– Treat runs of punctuation, spaces as single space

Data is list of most common trigrams

– Computed from large, varied sample of text

Compute trigrams for input, compare

– Confidence based on number of common trigrams

Page 10: IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy

IBM Globalization Center of Competency

© 2006 IBM Corporation10 IUC 29, Burlingame, CA March 2006

Single Byte Character Sets Detected By ICU

Name Languages

ISO-8859-1 Danish, Dutch, English, French, German, Italian, Norwegian, Portuguese, Spanish, Swedish

ISO-8859-2 Czech, Hungarian, Polish, Romanian

ISO-8859-5 Russian

ISO-8859-6 Arabic

ISO-8859-7 Greek

ISO-8859-8 Hebrew

ISO-8859-9 Turkish

Windows-1251 Russian

Windows-1256 Arabic

KOI8-R Russian

Page 11: IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy

IBM Globalization Center of Competency

© 2006 IBM Corporation11 IUC 29, Burlingame, CA March 2006

Multi-Byte Character Set Detection

Used for Chinese, Japanese, Korean

Can use byte patterns

– Rules for which bytes can be in each position

– Can reject data that breaks the rules

Must use statistics

– List of most commonly used characters

– Confidence based on percentage of common characters

Page 12: IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy

IBM Globalization Center of Competency

© 2006 IBM Corporation12 IUC 29, Burlingame, CA March 2006

Chinese GB-2312, GBK, GB18030

GB-2312 (1980)

– 6,763 Han characters

GBK (1995)

– Extends GB-2312

– Adds all Han characters from Unicode 2.0

GB18030 (2000)

– Extends GBK

– Adds all of Unicode

ICU Always matches GB18030

– Common characters are from GB-2312

– GB18030 to Unicode converter will handle all three

Page 13: IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy

IBM Globalization Center of Competency

© 2006 IBM Corporation13 IUC 29, Burlingame, CA March 2006

Multi-Byte Character Sets Detected By ICU

Name Language

Shift-JIS Japanese

EUC-JP Japanese

EUC-KR Korean

GB18030 Chinese

Big5 Chinese

Page 14: IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy

IBM Globalization Center of Competency

© 2006 IBM Corporation14 IUC 29, Burlingame, CA March 2006

Algorithmic Character Sets

Identified by distinctive byte sequences

– Don’t need language statistics

UTF-8, UTF-16, UTF-32

ISO-2022-CN, ISO-2022-JP, ISO-2022--KR

Page 15: IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy

IBM Globalization Center of Competency

© 2006 IBM Corporation15 IUC 29, Burlingame, CA March 2006

Algorithmic Character Sets: UTF-8

Unicode encoding

Represents characters as sequence of one to four bytes

Can start with Byte Order Mark (BOM):

– EF BB BF

Very distinctive byte pattern

# of Bytes Allowable Values at Each Position

1 [00-7F]

2 [C0-DF] [80-BF]

3 [E0-EF] [80-BF] [80-BF]

4 [F0-F7] [80-BF] [80-BF] [80-BF]

Page 16: IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy

IBM Globalization Center of Competency

© 2006 IBM Corporation16 IUC 29, Burlingame, CA March 2006

Algorithmic Character Sets: UTF-16

Unicode encoding

Represents characters as sequence of 16-bit words

Starts with Byte Order Mark (BOM):

– FE FF (big-endian)

– FF FE (little-endian)

Confidence based on presence of BOM

–Could check for defined characters, script runs, etc.

Page 17: IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy

IBM Globalization Center of Competency

© 2006 IBM Corporation17 IUC 29, Burlingame, CA March 2006

Algorithmic Character Sets: UTF-32

Unicode encoding

Represents characters as 32-bit words

Can start with Byte Order Mark (BOM):

– 00 00 FE FF (big-endian)

– FF FE 00 00 (little-endian)

Confidence based on presence of characters in Unicode range

Byte pattern is fairly distinctive

– Lots of zero bytes

Page 18: IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy

IBM Globalization Center of Competency

© 2006 IBM Corporation18 IUC 29, Burlingame, CA March 2006

Algorithmic Character Sets: ISO-2022

Used for Chinese, Japanese, Korean

– Widely used in email

Uses embedded escape sequences, shift codes

– e.g. 1B 24 29 43 is Korean escape sequence

Confidence based on escape sequences:

– Presence of known sequences, absence of unknown

– No overlap for Chinese, Japanese, Korean sequences

Page 19: IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy

IBM Globalization Center of Competency

© 2006 IBM Corporation19 IUC 29, Burlingame, CA March 2006

Character Set Detection and Markup

HTML documents contain headers, markup, JavaScript

Can interfere with language-based detection

– Not part of text content

– Uses Latin alphabet

ICU provides a basic markup filter

– Use if text known to contain markup

– Use for languages written in Latin alphabet

Page 20: IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy

IBM Globalization Center of Competency

© 2006 IBM Corporation20 IUC 29, Burlingame, CA March 2006

How Much Text is Required?

Good results with a few hundred bytes of plain text

Complex web sites can have kilobytes of markup

– Usually at the beginning

– Our experience: 6 kilobytes is enough

Trade-off between speed and accuracy

Test results:

Page 21: IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy

IBM Globalization Center of Competency

© 2006 IBM Corporation21 IUC 29, Burlingame, CA March 2006

Charset Detection

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

20 24 28 33 39 46 55 66 79 94 112 134 160 192 230 276 331 397

Buffer Length (bytes)

Su

cc

es

sfu

l De

tec

tio

n

8859-2-pl

Shift-jis

euc-jp

8859-6-ar

8859-1-de

8859-1-en

8859-1-es

Big5

Average

Page 22: IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy

IBM Globalization Center of Competency

© 2006 IBM Corporation22 IUC 29, Burlingame, CA March 2006

Language Detection

Language detected as side effect

No language for UTF encodings

– We could adapt single-byte data

Closely related languages my be confused

– e.g. French, Spanish, Portuguese

Use linguistic analysis libraries for more accuracy

Test results:

Page 23: IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy

IBM Globalization Center of Competency

© 2006 IBM Corporation23 IUC 29, Burlingame, CA March 2006

Language Detection

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

20 24 28 33 39 46 55 66 79 94 112 134 160 192 230 276 331 397

Buffer Length (bytes)

Su

cc

es

sfu

l De

tec

tio

n

8859-2-pl

Shift-jis

euc-jp

8859-6-ar

8859-1-de

8859-1-en

8859-1-es

Big5

Average

Page 24: IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy

IBM Globalization Center of Competency

© 2006 IBM Corporation24 IUC 29, Burlingame, CA March 2006

Cautions

Character set detection is not 100% reliable

– Based on statistics

– Assumes data is natural language text

– Doesn’t have data for all encodings

Designed to work on plain text

– Markup, etc. will confuse it

– Won’t work on binary formats, like word processing documents

Page 25: IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy

IBM Globalization Center of Competency

© 2006 IBM Corporation25 IUC 29, Burlingame, CA March 2006

Conclusions

Can read and understand text in unknown encoding

Any program that reads text from uncontrolled sources can benefit

Freely available implementations make character set detection easy to use

Page 26: IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy

IBM Globalization Center of Competency

© 2006 IBM Corporation26 IUC 29, Burlingame, CA March 2006

Questions and Answers

Page 27: IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy

IBM Globalization Center of Competency

© 2006 IBM Corporation27 IUC 29, Burlingame, CA March 2006

Character Sets Detected by ICUName Type Languages

ISO-8859-1 Single Byte English, German, French, Spanish, Danish

ISO-8859-2 Single Byte Czech, Hungarian, Polish

ISO-8859-5 Single Byte Russian

ISO-8859-6 Single Byte Arabic

ISO-8859-7 Single Byte Greek

ISO-8859-8 Single Byte Hebrew

ISO-8859-9 Single Byte Turkish

KOI8-R Single Byte Russian

Shift JIS MultiByte Japanese

EUC JP MultiByte Japanese

ISO 2022 JP Algorithmic Japanese

GB18030 MultiByte Chinese

ISO 2022 CN Algorithmic Chinese

Big5 MultiByte Chinese

EUC KR MultiByte Korean

ISO 2022 KR Algorithmic Korean

UTF 8/16/32 Algorithmic All (Unicode)