whats new in globalization? mark davis president & cofounder the unicode consortium
TRANSCRIPT
![Page 1: Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium](https://reader033.vdocuments.us/reader033/viewer/2022061306/551467f45503462d4e8b5c17/html5/thumbnails/1.jpg)
What’s New in Globalization?
Mark DavisPresident & Cofounder
The Unicode Consortium
![Page 2: Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium](https://reader033.vdocuments.us/reader033/viewer/2022061306/551467f45503462d4e8b5c17/html5/thumbnails/2.jpg)
The Unicode Standard, Version 5.0
“Hard copy versions of the Unicode Standard have been among the most crucial and most heavily used reference books in my personal library for years.”
— Donald E. Knuth“For more than a decade, Unicode has been a foundation for many Microsoft products and technologies; Unicode Standard Version 5.0 will help us deliver important new benefits to users.”
— Bill Gates“The path W3C follows to making text on the Web truly global is Unicode.”
— Sir Tim Berners-Lee, KBE“Without Unicode, Java wouldn't be Java, and the Internet would have a harder time connecting the people of the world.”
— James Gosling
![Page 3: Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium](https://reader033.vdocuments.us/reader033/viewer/2022061306/551467f45503462d4e8b5c17/html5/thumbnails/3.jpg)
The Unicode Standard, Version 5.0
Obsoletes previous versions
Basis for Microsoft's Vista; in upgrade plans for Google, Yahoo!, and ICU, to name but a few.
Hundreds of pages of new information; thousands of revised pages; all Unicode Standard Annexes
Systematic framework for improved text processing
Improvements to the Unicode Encoding Model for UTF-8, …
Rigorous stability of case folding and identifiersImproved interoperability and backward compatibility
Enabling additional new ways to optimize code
![Page 4: Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium](https://reader033.vdocuments.us/reader033/viewer/2022061306/551467f45503462d4e8b5c17/html5/thumbnails/4.jpg)
U5.0 Unicode Character Database
Unicode: far more than a list of characters
Properties: key to how characters function
Changes in 5.0Scripts: Unassigned code points → Zzzz
Casing Stability: Upper → folded
BIDI: Consistent Bidi_Mirrored
Now Normative: kIICore
Line Break: SE Asian → Complex_Context
New Properties: Normative_Name_Alias, Deprecated, 3 Unihan provisional properties
General99,08
9
Private Use
137,468
Surrogate 2,048
Noncharacter 66
Reserved875,44
1
![Page 5: Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium](https://reader033.vdocuments.us/reader033/viewer/2022061306/551467f45503462d4e8b5c17/html5/thumbnails/5.jpg)
U5.0 Conformance
Stable Case-Folded≈ Upper → Lower
Much clearer encoding / property model
Stable Approved Named Character Sequences
Bengali, Gurmukhi, Tamil changes
Combining grapheme joiner clarified
Disunification of Diacritics
![Page 6: Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium](https://reader033.vdocuments.us/reader033/viewer/2022061306/551467f45503462d4e8b5c17/html5/thumbnails/6.jpg)
5.0 Annexes: Core
UAX #9: Bidirectional AlgorithmTightened conformance requirements
UAX #15: Unicode Normalization FormsNew Stream-Safe Text Format
Appendix of characters requiring special handling
Expanded info on stability guarantees
Additional detailed figures, guidelines
UAX #31: Identifier and Pattern SyntaxAdded profiles & information on usage
![Page 7: Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium](https://reader033.vdocuments.us/reader033/viewer/2022061306/551467f45503462d4e8b5c17/html5/thumbnails/7.jpg)
U5.0 Annexes: Boundaries
UAX #14: Line Breaking PropertiesRules modified to improve behavior
Now Normative (conformance clauses reorganized)
UAX #29: Text BoundariesEdge cases improved
Tailorings for text boundaries now in Unicode CLDR
Format of the rules changed to ease implementation
Additional guidelines on regex, identifiers,…
![Page 8: Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium](https://reader033.vdocuments.us/reader033/viewer/2022061306/551467f45503462d4e8b5c17/html5/thumbnails/8.jpg)
U5.0 Characters by Script
Phags Pa
Phoenician
Devanagari
Hebrew
Greek
Kannada
Nko
Common
Latin
Inherited Cyrillic
Cuneiform
Balinese
![Page 9: Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium](https://reader033.vdocuments.us/reader033/viewer/2022061306/551467f45503462d4e8b5c17/html5/thumbnails/9.jpg)
Unicode Character Timeline
1
10
100
1,000
10,000
100,000
1,000,000
2.0.0 2.1.2 3.0.0 3.1.0 3.2.0 4.0.0 4.1.0 5.0.0
Letter
Symbol
Mark
Number
Punctuation
Control/Format
Separator
![Page 10: Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium](https://reader033.vdocuments.us/reader033/viewer/2022061306/551467f45503462d4e8b5c17/html5/thumbnails/10.jpg)
Unicode Guide for Programmers
Adjunct to Standard
Concise Guide for Software Globalization
Crucial Concepts
Key “Gotchas”Recognize and Avoid
Details onEncoding & conversions:
UTF-8, 16, 32 & BOM
Using character properties
Text Operations
![Page 11: Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium](https://reader033.vdocuments.us/reader033/viewer/2022061306/551467f45503462d4e8b5c17/html5/thumbnails/11.jpg)
Unicode Common Locale Data Repository: CLDR
Key locale data for world languages
Most extensive standard repository of locale data
XML format
Δευτέρα, 05 Σεπτεμβρίου 2005
Montag, 5. September 2005
¥ 1,234.57 1 234,57руб.
Arabic – arabskiBulgarian – bułgarskiCzech – czeski…
Africa – 非洲Central America – 中美洲Eastern Africa – 东非Northern Africa – 北非…
AED – . إ. دBHD – .ب .دDZD – . ج. دEGP – . م. جEUR – €…
Z < Å
![Page 12: Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium](https://reader033.vdocuments.us/reader033/viewer/2022061306/551467f45503462d4e8b5c17/html5/thumbnails/12.jpg)
Unicode CLDR 1.4
121 languages and 142 territories – 360 locales in all
25% more locale data; over 17,000 new/modified items
Repository separated into language vs locale data
Language-specific segmentation (word/line breaks…)
Transliterations (eg Ελληνικά ↔ Ellēniká)
Data for lenient date/time formatting and parsing
Programmer asks for “numeric day” + “abbreviated month”
Best format pattern returned, eg “dd.MMM”
+ Quarters in dates (eg 2006Q1)
BCP 47 compatibility + extensions
![Page 13: Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium](https://reader033.vdocuments.us/reader033/viewer/2022061306/551467f45503462d4e8b5c17/html5/thumbnails/13.jpg)
BCP 47 Language Tags
Usage: HTTP, HTML, XML; CLDR Locale IDs…
RFC 4646; Obsoletes RFCs 1766, 3066
Addresses problems in RFC3066ISO standards: stability / accessibility / ambiguity
Parseability, Extensibility; Registration speed
Identification of script (where necessary):Traditional Chinese (zh-Hant), Serbian in Latin (sr-Latn), Azerbaijani (Cyrillic) az-Cyrl, etc.
![Page 14: Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium](https://reader033.vdocuments.us/reader033/viewer/2022061306/551467f45503462d4e8b5c17/html5/thumbnails/14.jpg)
Unicode Security
Examples: Visual Confusables: “paypal.com” with Cyrillic ‘a’…
Non visual problems: buffer overflows, non-shortest form,…
UTR# 36 Unicode Security ConsiderationsGuidelines & Recommendations
UTS# 39. Unicode Security MechanismsAlgorithms & Data
Limitations on Repertoire
Testing for Confusables
![Page 15: Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium](https://reader033.vdocuments.us/reader033/viewer/2022061306/551467f45503462d4e8b5c17/html5/thumbnails/15.jpg)
Internationalized Domain Names
One instance of broad problemMany RFCs use Nameprep – limited to Unicode 3.2
Unicode recommendationsNarrow the repertoire: exclude symbols, punctuation
Expand the coverage: currently only Unicode 3.2.
IETF idn-nextsteps publishedSome positive developments, but misreads Unicode, needs more work
![Page 16: Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium](https://reader033.vdocuments.us/reader033/viewer/2022061306/551467f45503462d4e8b5c17/html5/thumbnails/16.jpg)
URL → IRI
International Resource Identifier (IRI)
UTF-8, %-escaped
Example:http://w3.org/International/articles/idn-and-iri/JP納豆/引き割り納豆.html http://w3.org/International/articles/idn-and-iri/JP%E7%B4%8D... %E8%B1%86.html
See http://ietf.org/rfc/rfc3987.txt
![Page 17: Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium](https://reader033.vdocuments.us/reader033/viewer/2022061306/551467f45503462d4e8b5c17/html5/thumbnails/17.jpg)
Ideographic Variation Database
U+82A6 ashi: multiple forms
The first occurrence – any glyph
Second occurrence is in the name of the town Ashiya – customarily displayed with form #4
Registration for variants
![Page 18: Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium](https://reader033.vdocuments.us/reader033/viewer/2022061306/551467f45503462d4e8b5c17/html5/thumbnails/18.jpg)
Ideographic Variation Database
Variation SelectorIdentifies a restriction on the appearance of a character
Character + Variation Selector = Variation Sequence
Han ideographsImpossible to build a single collection for everyone: requirements from scholars, governments and publishers…
Instead, registration of multiple independent collections
Unicode Ideographic Variation DatabaseA given variation sequence is used in at most one collection
Makes interchange of variation sequences reliable.
Registration, not Assessment
![Page 19: Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium](https://reader033.vdocuments.us/reader033/viewer/2022061306/551467f45503462d4e8b5c17/html5/thumbnails/19.jpg)
ICU 3.6
Mature, portable C/C++/Java int’l libraries
Unicode 5.0, UCA 5.0, CLDR 1.4
ICU4CCharset Detection
Improved: Time Zones, Thai word break, UText (64 bit), Performance, Data Management,…
ICU4J Globalization Preferences
Flexible date/time formats*, Charset conversion*
![Page 20: Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium](https://reader033.vdocuments.us/reader033/viewer/2022061306/551467f45503462d4e8b5c17/html5/thumbnails/20.jpg)
Near-Term Issues
Unicode 5.0.1, Unicode 5.1
CLDR / BCP 47bis
LDAP
Collation Registry
IANA Charset Registry
![Page 21: Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium](https://reader033.vdocuments.us/reader033/viewer/2022061306/551467f45503462d4e8b5c17/html5/thumbnails/21.jpg)
Unicode 5.1 - possibilities
CharactersCJK Unified Ideographs Extension C
Minority Scripts: Cham and Lanna
Malayalam chillu
…
Properties/BehaviorNormalization process for stable strings
…
![Page 22: Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium](https://reader033.vdocuments.us/reader033/viewer/2022061306/551467f45503462d4e8b5c17/html5/thumbnails/22.jpg)
CLDR 1.5 / BCP 47bis
CLDR 1.5
Data Submission Starting November
New structures / data
BCP 47
Adding ~7,000 (!) new language subtags
Possibly other changes…
![Page 23: Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium](https://reader033.vdocuments.us/reader033/viewer/2022061306/551467f45503462d4e8b5c17/html5/thumbnails/23.jpg)
LDAP
Now has definitive comparison
(good)
Stuck at Unicode 3.2
(bad)
http://www.ietf.org/rfc/rfc4518.txt
![Page 24: Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium](https://reader033.vdocuments.us/reader033/viewer/2022061306/551467f45503462d4e8b5c17/html5/thumbnails/24.jpg)
Collation Registry
Nearing approval
Adds ability to register comparisons
Workable for basic cases
http://www.ietf.org/internet-drafts/draft-newman-i18n-comparator-14.txt
![Page 25: Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium](https://reader033.vdocuments.us/reader033/viewer/2022061306/551467f45503462d4e8b5c17/html5/thumbnails/25.jpg)
IANA Charset registry
Currently limited usefulness
Ill-defined
Missing mapping tables
Incomplete
Inaccurate
Regime Change
Hope for future improvements!
![Page 26: Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium](https://reader033.vdocuments.us/reader033/viewer/2022061306/551467f45503462d4e8b5c17/html5/thumbnails/26.jpg)
What’s New in Globalization?
Mark DavisPresident & Cofounder
The Unicode Consortium