![Page 1: New in Unicode Mark Davis, John Jenkins. Agenda Unicode 4.1.0 UCA 4.1.0 Regular Expressions Security Considerations Character Mapping Common Locale Data](https://reader033.vdocuments.us/reader033/viewer/2022061306/551467f1550346494e8b5c29/html5/thumbnails/1.jpg)
New in Unicode
Mark Davis, John Jenkins
![Page 2: New in Unicode Mark Davis, John Jenkins. Agenda Unicode 4.1.0 UCA 4.1.0 Regular Expressions Security Considerations Character Mapping Common Locale Data](https://reader033.vdocuments.us/reader033/viewer/2022061306/551467f1550346494e8b5c29/html5/thumbnails/2.jpg)
Agenda
Unicode 4.1.0 UCA 4.1.0 Regular Expressions Security Considerations Character Mapping Common Locale Data Repository Expanded Role for Consortium
![Page 3: New in Unicode Mark Davis, John Jenkins. Agenda Unicode 4.1.0 UCA 4.1.0 Regular Expressions Security Considerations Character Mapping Common Locale Data](https://reader033.vdocuments.us/reader033/viewer/2022061306/551467f1550346494e8b5c29/html5/thumbnails/3.jpg)
Unicode 4.1.0
Released 2005 March 31 New Characters New Unicode Character Database New Specifications
![Page 4: New in Unicode Mark Davis, John Jenkins. Agenda Unicode 4.1.0 UCA 4.1.0 Regular Expressions Security Considerations Character Mapping Common Locale Data](https://reader033.vdocuments.us/reader033/viewer/2022061306/551467f1550346494e8b5c29/html5/thumbnails/4.jpg)
1,273 New Characters
Roundtripping for HKSCS and GB 18030
Five new currency signs Additional characters for Indic and
Korean Eight new scripts
![Page 5: New in Unicode Mark Davis, John Jenkins. Agenda Unicode 4.1.0 UCA 4.1.0 Regular Expressions Security Considerations Character Mapping Common Locale Data](https://reader033.vdocuments.us/reader033/viewer/2022061306/551467f1550346494e8b5c29/html5/thumbnails/5.jpg)
Changes in the Standard
Conformance Changes Modifications to Default Case Operations Clarification of Decomposition Mappings
Other Changes SPACE not recommended as base for
nonspacing marks Use of CGJ to prevent reordering, prevent
contractions in sorting/matching (UCA) Positioning of Meteg Rendering of Thai Combining Marks
![Page 6: New in Unicode Mark Davis, John Jenkins. Agenda Unicode 4.1.0 UCA 4.1.0 Regular Expressions Security Considerations Character Mapping Common Locale Data](https://reader033.vdocuments.us/reader033/viewer/2022061306/551467f1550346494e8b5c29/html5/thumbnails/6.jpg)
Unicode Character Database
Determines the behavior of characters in modern software: Alphabetics, Letters, Numbers, Identifiers,
Scripts, … New properties
Grapheme_Cluster_Break, Sentence_Break, Word_Break, Pattern_Syntax, and Pattern_White_Space
Revised Property Values Eg Alphabetic ⊃ ( Lowercase ∪ Uppercase )
Expanded documentation Each release now complete, not delta
![Page 7: New in Unicode Mark Davis, John Jenkins. Agenda Unicode 4.1.0 UCA 4.1.0 Regular Expressions Security Considerations Character Mapping Common Locale Data](https://reader033.vdocuments.us/reader033/viewer/2022061306/551467f1550346494e8b5c29/html5/thumbnails/7.jpg)
New Specifications
UAX #31: Identifier and Pattern Syntax Basis for Backwards-Compatible Identifiers
Programming Languages Resources and Services
Basis for Stable Syntax characters Whitespace Operators
UAX #34: Unicode Named Character Sequences Mechanism for identifying/naming significant
sequences Standardized list
![Page 8: New in Unicode Mark Davis, John Jenkins. Agenda Unicode 4.1.0 UCA 4.1.0 Regular Expressions Security Considerations Character Mapping Common Locale Data](https://reader033.vdocuments.us/reader033/viewer/2022061306/551467f1550346494e8b5c29/html5/thumbnails/8.jpg)
Major Revisions in Annexes
UAX #15: Unicode Normalization Forms Correction for Idempotency Problem Enhanced discussion of Hangul
UAX #14: Line Breaking Properties Modifications for Hangul Changes because SPACE not recommended as
base for nonspacing marks Separated all suggested tailorings into separate
section UAX #29: Text Boundaries
Using new properties, adding Joiner/Non-Joiner Modifications to Word -Break
![Page 9: New in Unicode Mark Davis, John Jenkins. Agenda Unicode 4.1.0 UCA 4.1.0 Regular Expressions Security Considerations Character Mapping Common Locale Data](https://reader033.vdocuments.us/reader033/viewer/2022061306/551467f1550346494e8b5c29/html5/thumbnails/9.jpg)
UTS #10: Unicode Collation Algorithm
Basis for language-sensitive sorting, searching, and matching
Synchronized with Unicode 4.1.0 New:
Characters Revised Weights Specification: matching, ignorables,
Thai, …
![Page 10: New in Unicode Mark Davis, John Jenkins. Agenda Unicode 4.1.0 UCA 4.1.0 Regular Expressions Security Considerations Character Mapping Common Locale Data](https://reader033.vdocuments.us/reader033/viewer/2022061306/551467f1550346494e8b5c29/html5/thumbnails/10.jpg)
UTS #18: Unicode Regular Expressions
Regular expressions used widely in programs, for matching patterns (eg Wildcards)
Unicode expands the scope drastically
Explicit Conformance Clauses POSIX-Conformance
![Page 11: New in Unicode Mark Davis, John Jenkins. Agenda Unicode 4.1.0 UCA 4.1.0 Regular Expressions Security Considerations Character Mapping Common Locale Data](https://reader033.vdocuments.us/reader033/viewer/2022061306/551467f1550346494e8b5c29/html5/thumbnails/11.jpg)
UAX #36: Unicode Security
Incorrect usage of Unicode can expose programs or systems to possible security attacks! Examples:
Numbers: ৪୨ = 42 ! Bengali { ০ ১ ২ ৩ ৪ ৫ ৬ ৭ ৮ ৯}, Oriya { ୦ ୧ ୨ ୩ ୪ ୫ ୬ ୭ ୮ ୯}.
Domain Names: String UTF-16 Internal - IDNA
1a at.com 0061 0308 0074 002E 0063 006F 006D xn--t-zfa.com
1b at.com 00E4 0074 002E 0063 006F 006D xn--t-zfa.com
2a tοp.com 0074 03BF 0070 002E 0063 006F 006D xn--tp-jbc.com
2b tοp.com 0074 006F 0070 002E 0063 006F 006D top.com
4a sos.com 0073 006F 0337 0073 002E 0063 006F 006D xn--sos-rjc.com
4b søs.com 0073 00F8 0073 002E 0063 006F 006D xn--ss-lka.com
![Page 12: New in Unicode Mark Davis, John Jenkins. Agenda Unicode 4.1.0 UCA 4.1.0 Regular Expressions Security Considerations Character Mapping Common Locale Data](https://reader033.vdocuments.us/reader033/viewer/2022061306/551467f1550346494e8b5c29/html5/thumbnails/12.jpg)
Character Mapping ML
XML format for the interchange of mapping data for character encodings and aliases.
Promoted to Unicode Technical Standard; with new Conformance section (2).
Added explicit text about multi-character mappings.
![Page 13: New in Unicode Mark Davis, John Jenkins. Agenda Unicode 4.1.0 UCA 4.1.0 Regular Expressions Security Considerations Character Mapping Common Locale Data](https://reader033.vdocuments.us/reader033/viewer/2022061306/551467f1550346494e8b5c29/html5/thumbnails/13.jpg)
Common Locale Data Repository
Common, necessary software locale data for world languages
XML format for effective interchange
Δευτέρα, 05 Σεπτεμβρίου 2005
Montag, 5. September 2005
¥ 1,234.571 234,57руб.
Arabic – arabskiBulgarian – bułgarskiCzech – czeski…
Africa – 非洲Central America – 中美洲Eastern Africa – 东非Northern Africa – 北非…
AED – . إ. دBHD – .ب .دDZD – . ج. دEGP – . م. جEUR – €…
Z < Å
![Page 14: New in Unicode Mark Davis, John Jenkins. Agenda Unicode 4.1.0 UCA 4.1.0 Regular Expressions Security Considerations Character Mapping Common Locale Data](https://reader033.vdocuments.us/reader033/viewer/2022061306/551467f1550346494e8b5c29/html5/thumbnails/14.jpg)
Typical Locale Data
Dates/time formats Number/Currency formats Measurement Systems Collation Specifications (UCA-based)
Used for sorting, searching, matching Tailorings of translated names for
language, territory, script, timezones, currencies, …
...
![Page 15: New in Unicode Mark Davis, John Jenkins. Agenda Unicode 4.1.0 UCA 4.1.0 Regular Expressions Security Considerations Character Mapping Common Locale Data](https://reader033.vdocuments.us/reader033/viewer/2022061306/551467f1550346494e8b5c29/html5/thumbnails/15.jpg)
Latest Release: CLDR 1.3 296 locales: 96 languages, 130 territories
Languages: Afar [Qafar]; Afrikaans; Albanian [shqipe]; Amharic [አማርኛ]; Arabic [العربية]; Armenian [Հայերէն]; …
Territories: Afghanistan [افغانستان]; Albania [Shqipëria]; Algeria [الجزائر]; Argentina; Armenia [Հայաստանի Հանրապետութիւն]; Australia; Austria [Österreich]; Azerbaijan [Azərbaycan, Азәрбај ан]; …ҹ
Complete set of generated POSIX-format data Plus tool to generate versions tuned for different
platforms. Expanded locale data
Timezone localizations Including UN M.49 continents and regions Many other revisions and additions of data
New Tests & Tools
![Page 16: New in Unicode Mark Davis, John Jenkins. Agenda Unicode 4.1.0 UCA 4.1.0 Regular Expressions Security Considerations Character Mapping Common Locale Data](https://reader033.vdocuments.us/reader033/viewer/2022061306/551467f1550346494e8b5c29/html5/thumbnails/16.jpg)
Expanded Role for Consortium
Dedicated to the goal that all the world's languages can be used on computers everywhere, from mobile phones to mainframes.
Providing the fundamental specifications for full software globalization, full interoperability
![Page 17: New in Unicode Mark Davis, John Jenkins. Agenda Unicode 4.1.0 UCA 4.1.0 Regular Expressions Security Considerations Character Mapping Common Locale Data](https://reader033.vdocuments.us/reader033/viewer/2022061306/551467f1550346494e8b5c29/html5/thumbnails/17.jpg)
Full Members
![Page 18: New in Unicode Mark Davis, John Jenkins. Agenda Unicode 4.1.0 UCA 4.1.0 Regular Expressions Security Considerations Character Mapping Common Locale Data](https://reader033.vdocuments.us/reader033/viewer/2022061306/551467f1550346494e8b5c29/html5/thumbnails/18.jpg)
Institutional & Supporting Members(New Membership
Categories)
![Page 19: New in Unicode Mark Davis, John Jenkins. Agenda Unicode 4.1.0 UCA 4.1.0 Regular Expressions Security Considerations Character Mapping Common Locale Data](https://reader033.vdocuments.us/reader033/viewer/2022061306/551467f1550346494e8b5c29/html5/thumbnails/19.jpg)
Associate Members
![Page 20: New in Unicode Mark Davis, John Jenkins. Agenda Unicode 4.1.0 UCA 4.1.0 Regular Expressions Security Considerations Character Mapping Common Locale Data](https://reader033.vdocuments.us/reader033/viewer/2022061306/551467f1550346494e8b5c29/html5/thumbnails/20.jpg)
Liaison Members Center of Computer and
Information Development (CCID), Beijing, China
High Council of Informatics (HCI), Iran
Information and Communication Technology Agency of Sri Lanka (ICTA)
The International Forum for Information Technology in Tamil (INFITT)
The Internet Engineering Task Force (IETF)
ISO/IEC JTC1/SC2 and WG2 Linguistic Society of America (
LSA)
National Endowment for the Humanities (NEH)
National Information Standards Organization (NISO)
NSAI/ICTSCC/SC4:Irish standardization: Codes, Character Sets, and Int’lization
Open I18n.org: The Free standards Group Open Internationalization Initiative
Research Institute for ILCAA, Tokyo University of Foreign Studies
Research Institute for the Languages of Finland (RILF)
Special Libraries Association (SLA )
Technical Committee on Information Technology (TCVN/TC1), Hanoi, Viet Nam
United Nations Group of Experts on Geographical Names (UNGEGN)
World Wide Web Consortium - W3C I18N Core Working Group
![Page 21: New in Unicode Mark Davis, John Jenkins. Agenda Unicode 4.1.0 UCA 4.1.0 Regular Expressions Security Considerations Character Mapping Common Locale Data](https://reader033.vdocuments.us/reader033/viewer/2022061306/551467f1550346494e8b5c29/html5/thumbnails/21.jpg)
Unicode Technical Committee
Multiple Globalization Standards The Unicode Standard, including UAXes Unicode Technical Standards: Collation,
… Unicode Technical Notes: Best
Practices, Background Information Quarterly F2F Meetings Email Discussion
![Page 22: New in Unicode Mark Davis, John Jenkins. Agenda Unicode 4.1.0 UCA 4.1.0 Regular Expressions Security Considerations Character Mapping Common Locale Data](https://reader033.vdocuments.us/reader033/viewer/2022061306/551467f1550346494e8b5c29/html5/thumbnails/22.jpg)
CLDR Technical Committee
Meetings Short, frequent: Telecon + Instant
Messaging Email Discussion
Data All additions / revisions in bug database Anyone can file; committee assesses,
vets
![Page 23: New in Unicode Mark Davis, John Jenkins. Agenda Unicode 4.1.0 UCA 4.1.0 Regular Expressions Security Considerations Character Mapping Common Locale Data](https://reader033.vdocuments.us/reader033/viewer/2022061306/551467f1550346494e8b5c29/html5/thumbnails/23.jpg)
Why Join? Support the technology
That enables your success in international, technical, and emerging markets.
Protect your investment The stability you need The extensions you require The developments you call for: security, …
Demonstrate your leadership For the goal that all the world's languages
can be used on computers everywhere, from mobile phones to mainframes.