© 2006 ibm corporation 29th internationalization and unicode conference icu overview: the open...

30
29th Internationalization and Unicode Con ference © 2006 IBM Corporation ICU Overview: The Open Source Unicode Library Markus Scherer IBM Globalization Center of Competency

Upload: beryl-hall

Post on 28-Dec-2015

217 views

Category:

Documents


3 download

TRANSCRIPT

29th Internationalization and Unicode Conference © 2006 IBM Corporation

ICU Overview:The Open Source Unicode Library

Markus SchererIBM Globalization Center of Competency

ICU Overview: The Open Source Unicode Library

San Francisco, California, March, 20062

29th Internationalization and Unicode Conference

Agenda

Background Information

What is ICU?

Architecture Overview

– Significant New ICU Features

References

Q and A

ICU Overview: The Open Source Unicode Library

San Francisco, California, March, 20063

29th Internationalization and Unicode Conference

Why Globalization?

ICU Overview: The Open Source Unicode Library

San Francisco, California, March, 20064

29th Internationalization and Unicode Conference

Unicode

Handles all modern world languages

Efficient and effective processing

Lossless data exchange

Enables single-binary global software

But… all languages large, complex standard⇒

– 1,400 pages + Annexes + additional standards

– Almost 100,000 characters

– Major update every 3 years; minor update about once a year

– 70 character properties, many multi-valued

– Affects many processes: display, line-break, regular expressions, …

ICU Overview: The Open Source Unicode Library

San Francisco, California, March, 20065

29th Internationalization and Unicode Conference

Internationalization, Localization & Locales

Requirements vary widely across languages & countries

– Sorting

– Text searching

– Line breaks

– Date/time/number/currency formatting

– Codepage conversion

– …and so on

Performance is key

– It is easy to do the right thing

– It is hard to do it fast

ICU Overview: The Open Source Unicode Library

San Francisco, California, March, 20066

29th Internationalization and Unicode Conference

What is ICU?

International Components for Unicode

Globalization / Unicode / Locales

Mature, widely used set of C/C++ and Java libraries

– Basis for Java 1.1 internationalization, but goes far beyond Java 1.1

Very portable – identical results on all platforms / programming languages

– C/C++ (ICU4C): 30+ platforms/compilers

– Java (ICU4J): IBM & Sun JDK

– C/C++ with Java (ICU4JNI)

Full threading model

Customizable & Modular

Open source – but non-restrictive

ICU Overview: The Open Source Unicode Library

San Francisco, California, March, 20067

29th Internationalization and Unicode Conference

Who uses ICU?

Products Within IBM

– All 5 major software brands

– Many other related software applications

– Used on all IBM operating systems

Other Companies and Organizations

– Adobe, Apple (Mac OS X), Avaya, BEA, BroadJump, Business Objects, Caris, CERN, Cognos, Debian Linux, Gentoo Linux, HP, Home Depot, Inktomi, JD Edwards, Macromedia, Mathworks, MKS, Mozilla, NCR, OpenOffice, Parrot, PayPal, Python, QNX, Rogue Wave, SAP, Siebel, SIL, Software AG, Sun Microsystems (Solaris, Java), SuSE Linux, Sybase, Virage, webMethods, Wine, Leica Geosystems GIS & Mapping LLC., Xerox, Yahoo!...and many more

ICU Overview: The Open Source Unicode Library

San Francisco, California, March, 20068

29th Internationalization and Unicode Conference

ICU Features

Unicode text handling

Charset conversions (700+)

Collation & Searching

Locales from CLDR (250+)

Resource Bundles

Calendar & Time zones

Complex-text layout engine

Unicode Regular Expressions

Breaks: word, line, …

Formatting

– Date & time

– Messages

– Numbers & currencies

Transforms

– Normalization

– Casing

– Transliterations

ICU Overview: The Open Source Unicode Library

San Francisco, California, March, 20069

29th Internationalization and Unicode Conference

Architecture Overview 1

Locale Based Services

– Locale is an identifier, not a container

– Keywords for variants: de@collation=phonebook

Resource inheritance: shared resources

root

en

US IE

de

DE CH

zh

Hant Hans

TW CN TWCN

Language

Script

Region

ICU Overview: The Open Source Unicode Library

San Francisco, California, March, 200610

29th Internationalization and Unicode Conference

Architecture Overview 2

Open and Close Service Model

– Open a service object, use it many times, close it when done

– Better performance by avoiding setup costs per operation

ICU Threading Model

– Multiple service objects in use simultaneouslywith same or different attributes

– Large resources shared in read-only cache

– Compatible with Java threading model

ICU Overview: The Open Source Unicode Library

San Francisco, California, March, 200611

29th Internationalization and Unicode Conference

Architecture Overview 3

Data Driven Services

– Customize at build-time or run-time

– Interchange with other platforms;

• same results on each

– Rule-based

• Collation, Word-breaks, Transforms

– Pattern-based

• Date/Time/Number/Message formatting

– Table-based

• Character Conversion

ICU Overview: The Open Source Unicode Library

San Francisco, California, March, 200612

29th Internationalization and Unicode Conference

Architecture Overview – ICU4C

Simple Error Handling

– Thread safe

– Works in C and C++

C/C++ subset for portability

Version Management

– Multiple versions of ICU4C in the same process memory space

– Data and library versioning

String Buffer Management

– Preflighting and overflow protection

Flexible

– Allows Loading and Unloading ICU4C libraries

– Runtime settable memory allocation and mutex functions

ICU Overview: The Open Source Unicode Library

San Francisco, California, March, 200613

29th Internationalization and Unicode Conference

Architecture Overview – ICU4J

Supplement for Java

Core globalization (no character conversion or regular expressions)

– We do supply complex text support for Sun

Modularized: products may add just needed functionality

Usually drop-in replacement for JDK functionality

– Changing the import statements is usually all that is needed

ICU Overview: The Open Source Unicode Library

San Francisco, California, March, 200614

29th Internationalization and Unicode Conference

ICU4J: Supplement for Java

CLDR (Common Locale Data Repository)

– More fully supported locales than Java

Up-to-date globalization: standards-compliant; latest Unicode

– Supplementary character (GB 18030, JIS X 213, HKSCS)

• Java 5 adds handling of supplementary characters

– Full properties – JDK has only a fraction

– Unicode Collation Algorithm

– Local calendars (Islamic, Japan,…); more time zone localizations

– Currencies, String Search, Internationalized Domain Names

– Transforms: Case, Scripts, Normalization

Much shorter release cycle and quicker support for Unicode standard

ICU Overview: The Open Source Unicode Library

San Francisco, California, March, 200615

29th Internationalization and Unicode Conference

Recent Additions for ICU4J

ICU4J in Eclipse Rich Client Platform

– Take advantage of advanced globalization and support for more locales and more recent Unicode

– Consistent behavior between application code and framework code

– ICU4J 3.4+ in Eclipse 3.2

– No-op jar for JDK behavior and low-memory devices

Improved portability & JRE independence

– Back-ported to Java 1.3 (via preprocessor)

– Carries own Time Zone data

ICU Overview: The Open Source Unicode Library

San Francisco, California, March, 200616

29th Internationalization and Unicode Conference

Unicode Text Handling

All UTF-16 processing

C

– UChar*: null-terminated or with length

C++

– UnicodeString: full featured string class

Java

– Uses java.lang.String and adds utilities

All handle supplementary characters

– Required for GB 18030, HKSCS and JIS X 213 repertoires

ICU Overview: The Open Source Unicode Library

San Francisco, California, March, 200617

29th Internationalization and Unicode Conference

Abstract Text Access

Work with non-UTF-16 text

– UTF-16 text in discontinuous buffers

– UTF-8/32 APIs (char/wchar_t compatibility)

– Non-Unicode APIs, SBCS or MBCS

UText API: Efficient interface

– Provider implementation converts to UTF-16

– Fast inline code for text access within blocks

– Storage-native indexes avoid index translation

– Writable

For now only C/C++, used in few services

LineBreak

UTF-8,SBCS, …

UText

ICU Overview: The Open Source Unicode Library

San Francisco, California, March, 200618

29th Internationalization and Unicode Conference

Unicode Character Handling

All Unicode properties (except Unihan)

– Direct API

• Values, names, enumerations

– UnicodeSet

• Fast, compact set operations (union, intersection, …)• Pattern-based (both Perl & POSIX syntax for properties)

– \p{greek} vs. [:greek:]

• All properties:– [\p{lowercase}-[a-z]]– [\p{greek} & \p{uppercase}]

Alphabetic

Ideographic

a ξ ँँ�� ँ�

Uppercase

A Ξ 不与

ICU Overview: The Open Source Unicode Library

San Francisco, California, March, 200619

29th Internationalization and Unicode Conference

Recent Additions

Conforms to CLDR 1.3/1.4

– Adds and confirms many translated terms for languages, scripts, regions, currencies, and time zones

– Access to more CLDR items

Support for Unicode interpretation of POSIX properties

Charset detection API (ICU4J only)

Load Java resource bundles from ULocale, support zh_Hant etc.

Better modularization for memory constrained environments (ICU4C only)

New icupkg tool for data customization

– Easily update pre-built or installed data packages (.dat files)

ICU Overview: The Open Source Unicode Library

San Francisco, California, March, 200620

29th Internationalization and Unicode Conference

Character Set Conversion

Precise alias information:

– When you ask for “Shift-JIS”, you can request the precise definition by platform (e.g. Windows, IBM, Java, … )

Buffer management

– API automatically handles characters that cross buffers

– Can provide offset mappings between byte buffer and UChar buffer

Runtime customizations allowed for:

– Illegal sequences

– Undefined characters

Unicode Text Compression – SCSU, BOCU-1

Consistent conversion results across platforms

You can use more character sets at runtime or build time

ICU Overview: The Open Source Unicode Library

San Francisco, California, March, 200621

29th Internationalization and Unicode Conference

Collation: Sorting, Searching and Matching

Fast international comparison for string search; fully UCA compliant

– Compressed sort keys, optimized string comparison, sublinear string search

– Incremental sortkeys used for radix sorting

Precise binary sortkey stability over time (library versioning)

Fully data driven

– Many common rules provided

Runtime and build time rule customizations

– Strength, normalization, upper vs. lowercase first, ignore punctuation, numeric, …

– Only delta from UCA is needed for rule customization

ICU Overview: The Open Source Unicode Library

San Francisco, California, March, 200622

29th Internationalization and Unicode Conference

Calendar & Time Zones

International Calendars – Islamic, Buddhist, Hebrew, Japanese

– Required for correct presentation of dates in some countries

Olson timezone support with localizations

Recent Additions:

– Many more time zone localizations

ICU Overview: The Open Source Unicode Library

San Francisco, California, March, 200623

29th Internationalization and Unicode Conference

Formatting

Date & time: 8 formats per locale by default

Messages

– Completely localizable, plural support

Numbers & currencies

– Scientific Notation, Spelled-out (checks, etc.)

– Full Orthogonal Currency support (any currency × any language)• INR In Hindi: रु१,२३४.५७• INR In English: Rs. 1,234.57• INR In German: Rs. 1.234,57

Recent Additions

– Flexible date/time formatting: Select format for desired set of fields

– List available currencies API

– Short and stand-alone month/day names

ICU Overview: The Open Source Unicode Library

San Francisco, California, March, 200624

29th Internationalization and Unicode Conference

Globalization Preferences

ICU: Convenient container for globalization preference values

Provides defaults for missing values

Convenient instantiation of related formatters

Example Standard

Language en_US (or en-US) RFC 3066 (or successor)

Region AU ISO 3166

Currency EUR ISO 4217

Time zone Australia/Melbourne TZ DB

Calendar islamic-civil CLDR Calendar ID

Custom Date yyyy-mmm-dd CLDR Pattern Format

VAT 08.23% (books), 15.73% (food) App/Country-Specific

… Exact Composition Depends on System Requirements!

ICU Overview: The Open Source Unicode Library

San Francisco, California, March, 200625

29th Internationalization and Unicode Conference

Transforms

Unicode Normalization

– Highly optimized for performance

– performance utilities: concatenation, detection, comparison

Casing (upper, lower, title, folding)

General Transforms

– Script transliterations

– Half-width/Full-width, Hex, etc.

– Chain transforms together, filter source characters

– Rule-based, customizable at runtime.

StringPrep: Internationalized Domain Names (IDN), NFS

ICU Overview: The Open Source Unicode Library

San Francisco, California, March, 200626

29th Internationalization and Unicode Conference

Segmentation: word, line & sentence

Fast state-table implementation

Customizable

– Rule-based – customizable at runtime

– Special customizations, e.g. Thai

Recent Additions:

– Uses new UText API

– ICU4J rule syntax aligned with ICU4C

– Tracks Unicode Line Break changes

ICU Overview: The Open Source Unicode Library

San Francisco, California, March, 200627

29th Internationalization and Unicode Conference

Unicode Regular Expressions

Full Regex Implementation

– C/C++ only: Java 1.4 has own package (though not as powerful)

All Unicode Properties

– Supported through UnicodeSet

Good performance

– Competitive with non-Unicode regex

ICU Overview: The Open Source Unicode Library

San Francisco, California, March, 200628

29th Internationalization and Unicode Conference

Complex-text layout engine

Glyph processing, positioning & adjustment– Ligature substitution, contextual forms, kerning, accent placement, bidi

scripts, etc.

Support for:– Information for drawing

– Caret Display

– Hit Testing

– Selection Highlighting

– Caret Movement

– Layout Metrics

– Line Break

– Canonical Equivalence: a + ´ or á

Recent Additions:– Tibetan, Sinhala, Indic ZWJ/ZWNJ, Kerning

ICU Overview: The Open Source Unicode Library

San Francisco, California, March, 200629

29th Internationalization and Unicode Conference

References

ICU main site:

– http://www.ibm.com/software/globalization/icu/

– Links to

• Download ICU

• User Guide, Technical FAQ, Support, Bug Reports, Demonstrations

ICU support site:

– http://icu.sourceforge.net/

Unicode Consortium

– http://www.unicode.org/

• Unicode glossary, Unicode character database

ICU Overview: The Open Source Unicode Library

San Francisco, California, March, 200630

29th Internationalization and Unicode Conference

Questions and Answers