© 2006 ibm corporation 29th internationalization and unicode conference icu overview: the open...
TRANSCRIPT
29th Internationalization and Unicode Conference © 2006 IBM Corporation
ICU Overview:The Open Source Unicode Library
Markus SchererIBM Globalization Center of Competency
ICU Overview: The Open Source Unicode Library
San Francisco, California, March, 20062
29th Internationalization and Unicode Conference
Agenda
Background Information
What is ICU?
Architecture Overview
– Significant New ICU Features
References
Q and A
ICU Overview: The Open Source Unicode Library
San Francisco, California, March, 20063
29th Internationalization and Unicode Conference
Why Globalization?
ICU Overview: The Open Source Unicode Library
San Francisco, California, March, 20064
29th Internationalization and Unicode Conference
Unicode
Handles all modern world languages
Efficient and effective processing
Lossless data exchange
Enables single-binary global software
But… all languages large, complex standard⇒
– 1,400 pages + Annexes + additional standards
– Almost 100,000 characters
– Major update every 3 years; minor update about once a year
– 70 character properties, many multi-valued
– Affects many processes: display, line-break, regular expressions, …
ICU Overview: The Open Source Unicode Library
San Francisco, California, March, 20065
29th Internationalization and Unicode Conference
Internationalization, Localization & Locales
Requirements vary widely across languages & countries
– Sorting
– Text searching
– Line breaks
– Date/time/number/currency formatting
– Codepage conversion
– …and so on
Performance is key
– It is easy to do the right thing
– It is hard to do it fast
ICU Overview: The Open Source Unicode Library
San Francisco, California, March, 20066
29th Internationalization and Unicode Conference
What is ICU?
International Components for Unicode
Globalization / Unicode / Locales
Mature, widely used set of C/C++ and Java libraries
– Basis for Java 1.1 internationalization, but goes far beyond Java 1.1
Very portable – identical results on all platforms / programming languages
– C/C++ (ICU4C): 30+ platforms/compilers
– Java (ICU4J): IBM & Sun JDK
– C/C++ with Java (ICU4JNI)
Full threading model
Customizable & Modular
Open source – but non-restrictive
ICU Overview: The Open Source Unicode Library
San Francisco, California, March, 20067
29th Internationalization and Unicode Conference
Who uses ICU?
Products Within IBM
– All 5 major software brands
– Many other related software applications
– Used on all IBM operating systems
Other Companies and Organizations
– Adobe, Apple (Mac OS X), Avaya, BEA, BroadJump, Business Objects, Caris, CERN, Cognos, Debian Linux, Gentoo Linux, HP, Home Depot, Inktomi, JD Edwards, Macromedia, Mathworks, MKS, Mozilla, NCR, OpenOffice, Parrot, PayPal, Python, QNX, Rogue Wave, SAP, Siebel, SIL, Software AG, Sun Microsystems (Solaris, Java), SuSE Linux, Sybase, Virage, webMethods, Wine, Leica Geosystems GIS & Mapping LLC., Xerox, Yahoo!...and many more
ICU Overview: The Open Source Unicode Library
San Francisco, California, March, 20068
29th Internationalization and Unicode Conference
ICU Features
Unicode text handling
Charset conversions (700+)
Collation & Searching
Locales from CLDR (250+)
Resource Bundles
Calendar & Time zones
Complex-text layout engine
Unicode Regular Expressions
Breaks: word, line, …
Formatting
– Date & time
– Messages
– Numbers & currencies
Transforms
– Normalization
– Casing
– Transliterations
ICU Overview: The Open Source Unicode Library
San Francisco, California, March, 20069
29th Internationalization and Unicode Conference
Architecture Overview 1
Locale Based Services
– Locale is an identifier, not a container
– Keywords for variants: de@collation=phonebook
Resource inheritance: shared resources
root
en
US IE
de
DE CH
zh
Hant Hans
TW CN TWCN
Language
Script
Region
ICU Overview: The Open Source Unicode Library
San Francisco, California, March, 200610
29th Internationalization and Unicode Conference
Architecture Overview 2
Open and Close Service Model
– Open a service object, use it many times, close it when done
– Better performance by avoiding setup costs per operation
ICU Threading Model
– Multiple service objects in use simultaneouslywith same or different attributes
– Large resources shared in read-only cache
– Compatible with Java threading model
ICU Overview: The Open Source Unicode Library
San Francisco, California, March, 200611
29th Internationalization and Unicode Conference
Architecture Overview 3
Data Driven Services
– Customize at build-time or run-time
– Interchange with other platforms;
• same results on each
– Rule-based
• Collation, Word-breaks, Transforms
– Pattern-based
• Date/Time/Number/Message formatting
– Table-based
• Character Conversion
ICU Overview: The Open Source Unicode Library
San Francisco, California, March, 200612
29th Internationalization and Unicode Conference
Architecture Overview – ICU4C
Simple Error Handling
– Thread safe
– Works in C and C++
C/C++ subset for portability
Version Management
– Multiple versions of ICU4C in the same process memory space
– Data and library versioning
String Buffer Management
– Preflighting and overflow protection
Flexible
– Allows Loading and Unloading ICU4C libraries
– Runtime settable memory allocation and mutex functions
ICU Overview: The Open Source Unicode Library
San Francisco, California, March, 200613
29th Internationalization and Unicode Conference
Architecture Overview – ICU4J
Supplement for Java
Core globalization (no character conversion or regular expressions)
– We do supply complex text support for Sun
Modularized: products may add just needed functionality
Usually drop-in replacement for JDK functionality
– Changing the import statements is usually all that is needed
ICU Overview: The Open Source Unicode Library
San Francisco, California, March, 200614
29th Internationalization and Unicode Conference
ICU4J: Supplement for Java
CLDR (Common Locale Data Repository)
– More fully supported locales than Java
Up-to-date globalization: standards-compliant; latest Unicode
– Supplementary character (GB 18030, JIS X 213, HKSCS)
• Java 5 adds handling of supplementary characters
– Full properties – JDK has only a fraction
– Unicode Collation Algorithm
– Local calendars (Islamic, Japan,…); more time zone localizations
– Currencies, String Search, Internationalized Domain Names
– Transforms: Case, Scripts, Normalization
Much shorter release cycle and quicker support for Unicode standard
ICU Overview: The Open Source Unicode Library
San Francisco, California, March, 200615
29th Internationalization and Unicode Conference
Recent Additions for ICU4J
ICU4J in Eclipse Rich Client Platform
– Take advantage of advanced globalization and support for more locales and more recent Unicode
– Consistent behavior between application code and framework code
– ICU4J 3.4+ in Eclipse 3.2
– No-op jar for JDK behavior and low-memory devices
Improved portability & JRE independence
– Back-ported to Java 1.3 (via preprocessor)
– Carries own Time Zone data
ICU Overview: The Open Source Unicode Library
San Francisco, California, March, 200616
29th Internationalization and Unicode Conference
Unicode Text Handling
All UTF-16 processing
C
– UChar*: null-terminated or with length
C++
– UnicodeString: full featured string class
Java
– Uses java.lang.String and adds utilities
All handle supplementary characters
– Required for GB 18030, HKSCS and JIS X 213 repertoires
ICU Overview: The Open Source Unicode Library
San Francisco, California, March, 200617
29th Internationalization and Unicode Conference
Abstract Text Access
Work with non-UTF-16 text
– UTF-16 text in discontinuous buffers
– UTF-8/32 APIs (char/wchar_t compatibility)
– Non-Unicode APIs, SBCS or MBCS
UText API: Efficient interface
– Provider implementation converts to UTF-16
– Fast inline code for text access within blocks
– Storage-native indexes avoid index translation
– Writable
For now only C/C++, used in few services
LineBreak
UTF-8,SBCS, …
UText
ICU Overview: The Open Source Unicode Library
San Francisco, California, March, 200618
29th Internationalization and Unicode Conference
Unicode Character Handling
All Unicode properties (except Unihan)
– Direct API
• Values, names, enumerations
– UnicodeSet
• Fast, compact set operations (union, intersection, …)• Pattern-based (both Perl & POSIX syntax for properties)
– \p{greek} vs. [:greek:]
• All properties:– [\p{lowercase}-[a-z]]– [\p{greek} & \p{uppercase}]
Alphabetic
Ideographic
a ξ ँँ�� ँ�
Uppercase
A Ξ 不与
ICU Overview: The Open Source Unicode Library
San Francisco, California, March, 200619
29th Internationalization and Unicode Conference
Recent Additions
Conforms to CLDR 1.3/1.4
– Adds and confirms many translated terms for languages, scripts, regions, currencies, and time zones
– Access to more CLDR items
Support for Unicode interpretation of POSIX properties
Charset detection API (ICU4J only)
Load Java resource bundles from ULocale, support zh_Hant etc.
Better modularization for memory constrained environments (ICU4C only)
New icupkg tool for data customization
– Easily update pre-built or installed data packages (.dat files)
ICU Overview: The Open Source Unicode Library
San Francisco, California, March, 200620
29th Internationalization and Unicode Conference
Character Set Conversion
Precise alias information:
– When you ask for “Shift-JIS”, you can request the precise definition by platform (e.g. Windows, IBM, Java, … )
Buffer management
– API automatically handles characters that cross buffers
– Can provide offset mappings between byte buffer and UChar buffer
Runtime customizations allowed for:
– Illegal sequences
– Undefined characters
Unicode Text Compression – SCSU, BOCU-1
Consistent conversion results across platforms
You can use more character sets at runtime or build time
ICU Overview: The Open Source Unicode Library
San Francisco, California, March, 200621
29th Internationalization and Unicode Conference
Collation: Sorting, Searching and Matching
Fast international comparison for string search; fully UCA compliant
– Compressed sort keys, optimized string comparison, sublinear string search
– Incremental sortkeys used for radix sorting
Precise binary sortkey stability over time (library versioning)
Fully data driven
– Many common rules provided
Runtime and build time rule customizations
– Strength, normalization, upper vs. lowercase first, ignore punctuation, numeric, …
– Only delta from UCA is needed for rule customization
ICU Overview: The Open Source Unicode Library
San Francisco, California, March, 200622
29th Internationalization and Unicode Conference
Calendar & Time Zones
International Calendars – Islamic, Buddhist, Hebrew, Japanese
– Required for correct presentation of dates in some countries
Olson timezone support with localizations
Recent Additions:
– Many more time zone localizations
ICU Overview: The Open Source Unicode Library
San Francisco, California, March, 200623
29th Internationalization and Unicode Conference
Formatting
Date & time: 8 formats per locale by default
Messages
– Completely localizable, plural support
Numbers & currencies
– Scientific Notation, Spelled-out (checks, etc.)
– Full Orthogonal Currency support (any currency × any language)• INR In Hindi: रु१,२३४.५७• INR In English: Rs. 1,234.57• INR In German: Rs. 1.234,57
Recent Additions
– Flexible date/time formatting: Select format for desired set of fields
– List available currencies API
– Short and stand-alone month/day names
ICU Overview: The Open Source Unicode Library
San Francisco, California, March, 200624
29th Internationalization and Unicode Conference
Globalization Preferences
ICU: Convenient container for globalization preference values
Provides defaults for missing values
Convenient instantiation of related formatters
Example Standard
Language en_US (or en-US) RFC 3066 (or successor)
Region AU ISO 3166
Currency EUR ISO 4217
Time zone Australia/Melbourne TZ DB
Calendar islamic-civil CLDR Calendar ID
Custom Date yyyy-mmm-dd CLDR Pattern Format
VAT 08.23% (books), 15.73% (food) App/Country-Specific
… Exact Composition Depends on System Requirements!
ICU Overview: The Open Source Unicode Library
San Francisco, California, March, 200625
29th Internationalization and Unicode Conference
Transforms
Unicode Normalization
– Highly optimized for performance
– performance utilities: concatenation, detection, comparison
Casing (upper, lower, title, folding)
General Transforms
– Script transliterations
– Half-width/Full-width, Hex, etc.
– Chain transforms together, filter source characters
– Rule-based, customizable at runtime.
StringPrep: Internationalized Domain Names (IDN), NFS
ICU Overview: The Open Source Unicode Library
San Francisco, California, March, 200626
29th Internationalization and Unicode Conference
Segmentation: word, line & sentence
Fast state-table implementation
Customizable
– Rule-based – customizable at runtime
– Special customizations, e.g. Thai
Recent Additions:
– Uses new UText API
– ICU4J rule syntax aligned with ICU4C
– Tracks Unicode Line Break changes
ICU Overview: The Open Source Unicode Library
San Francisco, California, March, 200627
29th Internationalization and Unicode Conference
Unicode Regular Expressions
Full Regex Implementation
– C/C++ only: Java 1.4 has own package (though not as powerful)
All Unicode Properties
– Supported through UnicodeSet
Good performance
– Competitive with non-Unicode regex
ICU Overview: The Open Source Unicode Library
San Francisco, California, March, 200628
29th Internationalization and Unicode Conference
Complex-text layout engine
Glyph processing, positioning & adjustment– Ligature substitution, contextual forms, kerning, accent placement, bidi
scripts, etc.
Support for:– Information for drawing
– Caret Display
– Hit Testing
– Selection Highlighting
– Caret Movement
– Layout Metrics
– Line Break
– Canonical Equivalence: a + ´ or á
Recent Additions:– Tibetan, Sinhala, Indic ZWJ/ZWNJ, Kerning
ICU Overview: The Open Source Unicode Library
San Francisco, California, March, 200629
29th Internationalization and Unicode Conference
References
ICU main site:
– http://www.ibm.com/software/globalization/icu/
– Links to
• Download ICU
• User Guide, Technical FAQ, Support, Bug Reports, Demonstrations
ICU support site:
– http://icu.sourceforge.net/
Unicode Consortium
– http://www.unicode.org/
• Unicode glossary, Unicode character database