![Page 1: Collation in ICU 1.8 Mark Davis Chief SW Globalization Architect IBM](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1bc57f8b9ab0599d480d/html5/thumbnails/1.jpg)
Collation in ICU 1.8
Mark DavisChief SW Globalization Architect
IBM
![Page 2: Collation in ICU 1.8 Mark Davis Chief SW Globalization Architect IBM](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1bc57f8b9ab0599d480d/html5/thumbnails/2.jpg)
AgendaWhat is Collation?
FeaturesMechanismsWarnings
ICU 1.8 Collation
Note: Slides differ from printouts
![Page 3: Collation in ICU 1.8 Mark Davis Chief SW Globalization Architect IBM](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1bc57f8b9ab0599d480d/html5/thumbnails/3.jpg)
Collation = Sorting Order
How hard can it be?A < B < C < …Complications
Languages are complex and variedUnicode is a big set of charactersPerformance is crucial
![Page 4: Collation in ICU 1.8 Mark Davis Chief SW Globalization Architect IBM](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1bc57f8b9ab0599d480d/html5/thumbnails/4.jpg)
Varies By:
Language Swedish: z < ö German: ö < z
Usage Dictionary: öf < of Telephone: of < öf
Customizations A < a a < A
Versioning Fixes New Gov. Stds New Characters
![Page 5: Collation in ICU 1.8 Mark Davis Chief SW Globalization Architect IBM](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1bc57f8b9ab0599d480d/html5/thumbnails/5.jpg)
Levels1. Base characters: a < b2. Accents: as < às < at
ignored if there is a L1 character difference
3. Case: ao < Ao < aòignored if there is a L1 or L2 difference
4. Punctuation: ab < a-b < aBignored* if there is a L1, L2, or L3 difference
![Page 6: Collation in ICU 1.8 Mark Davis Chief SW Globalization Architect IBM](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1bc57f8b9ab0599d480d/html5/thumbnails/6.jpg)
Context SensitivityContractions
H < Z, but CZ < CHExpansions
OE < Œ < OFBoth
カー < カイ キー > キイ
![Page 7: Collation in ICU 1.8 Mark Davis Chief SW Globalization Architect IBM](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1bc57f8b9ab0599d480d/html5/thumbnails/7.jpg)
Canonical Equivalence
Å ≡ Å≡ A + º
x + . + ^ ≡ x + ^ + .ự≡ u + ’
≡ ư + .≡ ụ + ’≡ u + . + ’≡ u + ̛ + .
![Page 8: Collation in ICU 1.8 Mark Davis Chief SW Globalization Architect IBM](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1bc57f8b9ab0599d480d/html5/thumbnails/8.jpg)
OdditiesNormal accents
cote < coté < côte < côté• first accent difference determines order
French accentscote < côte < coté < côté• last accent difference determines order
Il-logical Order (Thai, Lao) เ ก sorts like ก เ
![Page 9: Collation in ICU 1.8 Mark Davis Chief SW Globalization Architect IBM](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1bc57f8b9ab0599d480d/html5/thumbnails/9.jpg)
Merging Database Fields
F1 = LastName, F2 = FirstName
Sequential Weak 1st MergedF1, then F2 F1 (L1), F2 L1, L2, L3
diSilva, JohndiSilva, Freddi Silva, Johndi Silva, Freddísilva, Johndísilva, Fred
diSilva, Johndísilva, Johndi Silva, Johndi Silva, FreddiSilva, Freddísilva, Fred
diSilva, Johndi Silva, Johndísilva, JohndiSilva, Freddi Silva, Freddísilva, Fred
![Page 10: Collation in ICU 1.8 Mark Davis Chief SW Globalization Architect IBM](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1bc57f8b9ab0599d480d/html5/thumbnails/10.jpg)
Customizations
Parameters that change collation behavior
Choice of language (locale)Runtime choices
Examples to follow
![Page 11: Collation in ICU 1.8 Mark Davis Chief SW Globalization Architect IBM](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1bc57f8b9ab0599d480d/html5/thumbnails/11.jpg)
Parametric Customizations
Strength Base Base + Accent Base + Accent + Case
Case: A < a a < A
Punctuation: di Silva < diSilva diSilva < di Silva
![Page 12: Collation in ICU 1.8 Mark Davis Chief SW Globalization Architect IBM](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1bc57f8b9ab0599d480d/html5/thumbnails/12.jpg)
Punctuation (Alternates)Base Character
di silvadi SilvaDi silvaDi SilvaDickensdisilvadiSilvaDisilvaDiSilva
Ignoreable
Dickens di silvadisilvadi SilvadiSilvaDi silvaDisilvaDi SilvaDiSilva
![Page 13: Collation in ICU 1.8 Mark Davis Chief SW Globalization Architect IBM](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1bc57f8b9ab0599d480d/html5/thumbnails/13.jpg)
Extended Customizations
User-defined“&” ≡ “ampersand”
Merging tailoringsIranian + French
Script Orderb < ב < β < бβ < b < б < ב
Numbers A-1 < A-234 A-234 < A-1
![Page 14: Collation in ICU 1.8 Mark Davis Chief SW Globalization Architect IBM](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1bc57f8b9ab0599d480d/html5/thumbnails/14.jpg)
Collation also used for:Searching
ignore case, accent optionsSelection
Return all records where• Jones ≤ name < Smith
GraphemesWhat a user considers a “character”Regular expressions (Level 3)• UTR #18
![Page 15: Collation in ICU 1.8 Mark Davis Chief SW Globalization Architect IBM](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1bc57f8b9ab0599d480d/html5/thumbnails/15.jpg)
UCAUTS #10: Unicode Collation Algorithm
Levels, Expansions, Contractions, Punctuation, Canonical Equivalence, etc.Default ordering: all Unicode code pointsProvides for tailoring to given languagesAlso see: The Unicode Standard, §5.17: Sorting and Searching
Aligned with ISO 14651
![Page 16: Collation in ICU 1.8 Mark Davis Chief SW Globalization Architect IBM](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1bc57f8b9ab0599d480d/html5/thumbnails/16.jpg)
APIs
String CompareSort KeysString Search
![Page 17: Collation in ICU 1.8 Mark Davis Chief SW Globalization Architect IBM](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1bc57f8b9ab0599d480d/html5/thumbnails/17.jpg)
Sort Keys
Transform string into series of bytes which will binary-compare
a: 06 C3 01 20 01 02 00
A: 06 C3 01 20 01 08 00
á: 06 C3 01 20 32 01 02 02 00
ab: 06 C3 06 D7 01 20 20 01 02 02 00
b: 06 D7 01 20 01 02 00
Level 1 Level 2 Level 3
![Page 18: Collation in ICU 1.8 Mark Davis Chief SW Globalization Architect IBM](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1bc57f8b9ab0599d480d/html5/thumbnails/18.jpg)
String Compare vs. Sort KeysSame results in either caseSC faster for single comparisons
average 5 to 10 times!SK faster for multiple comparisons
index once binary compare many times
![Page 19: Collation in ICU 1.8 Mark Davis Chief SW Globalization Architect IBM](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1bc57f8b9ab0599d480d/html5/thumbnails/19.jpg)
String SearchNaïve Approach
key matches in target at <x, y>iff target.substring(x, y) ≡ key
Boundary ComplicationsIgnorables: “a” matches in “(a)”?• at <0,2> & <1, 2> & <0,3> & <1,3>?
Contractions: “c” matches in “churo”?Normalization: “å” matches in “a¸˚”?
![Page 20: Collation in ICU 1.8 Mark Davis Chief SW Globalization Architect IBM](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1bc57f8b9ab0599d480d/html5/thumbnails/20.jpg)
WARNING 1: BasicsNot aligned with character set or repertoire
Latin-1: Swedish and German sorting differsNot code point (binary) order
Binary: Z < a < v < wEnglish: Z > aSwedish: v ≡ w
Not a property of stringsWith same database
• Swedish user: view/select• German user: view/select
![Page 21: Collation in ICU 1.8 Mark Davis Chief SW Globalization Architect IBM](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1bc57f8b9ab0599d480d/html5/thumbnails/21.jpg)
WARNING 2: Operations
Order not preserved under concatenation / substringing
x < y ↛ xz < yzx < y ↛ zx < zyxz < yz ↛ x < yzx < zy ↛ x < y
![Page 22: Collation in ICU 1.8 Mark Davis Chief SW Globalization Architect IBM](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1bc57f8b9ab0599d480d/html5/thumbnails/22.jpg)
WARNING 3: DependenceCollation is a relation over strings
Sort keys embody part of that relationThus, comparing sort keys from different tailorings (or parameters) gives undefined results.C < CH < DMay move binary value for D
![Page 23: Collation in ICU 1.8 Mark Davis Chief SW Globalization Architect IBM](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1bc57f8b9ab0599d480d/html5/thumbnails/23.jpg)
WARNING 4: StabilityStable Sort
Records with equal comparison come out in original orderProperty of algorithm, not comparison
Semi-Stable Comparisonx ≠ y → x ≢ yProperty of comparison, not algorithmDegrades performanceDoesn’t do what people think (or really want)!
![Page 24: Collation in ICU 1.8 Mark Davis Chief SW Globalization Architect IBM](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1bc57f8b9ab0599d480d/html5/thumbnails/24.jpg)
ICU (Int’l Components for Unicode)
Open-source: C, C++, Java, JNICharset Conversions, Locales, Resources, Collation, Calendars, Time zones (daylight), Transliteration, Normalization, Boundaries (grapheme, word, line, sentence), Format/Parse (numbers, currencies, dates, times, messages)
Cross-Platform: Windows, Unix, 390, …Architecture ≡ Javahttp://oss.software.ibm.com/icu/
![Page 25: Collation in ICU 1.8 Mark Davis Chief SW Globalization Architect IBM](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1bc57f8b9ab0599d480d/html5/thumbnails/25.jpg)
ICU/Java Collation ArchitectureL1-3, contractions, expansions, …Locale tailoringsFully rule-based specificationArbitrary runtime user customizations
& ‘?’ = ‘question mark’ & ‘$’ = ‘dollar sign’ & z < ‘george’
![Page 26: Collation in ICU 1.8 Mark Davis Chief SW Globalization Architect IBM](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1bc57f8b9ab0599d480d/html5/thumbnails/26.jpg)
ICU 1.8.1 Collation Revisionfull UCA compliancefull supplementary character supportmuch better performancemuch smaller sort-keyssmaller memory footprintsmaller disk footprintadditional parametric controladditional tailoring control
![Page 27: Collation in ICU 1.8 Mark Davis Chief SW Globalization Architect IBM](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1bc57f8b9ab0599d480d/html5/thumbnails/27.jpg)
Coding Style for PerformanceAvoided unnecessary function calls.
Example: strlen too expensive!Avoided use of objects
Rewrote core code in CC++ API wraps the C core code.
Fast-pathed common casesUsed stack memory buffers
(with expansion if necessary)Made inner loops as tight as possible
![Page 28: Collation in ICU 1.8 Mark Davis Chief SW Globalization Architect IBM](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1bc57f8b9ab0599d480d/html5/thumbnails/28.jpg)
Fractional UCAFractional weights for compressionGaps for tailoring, future UCA additionsOnly stores differences in tailoring fileReduces memory footprint
a æ ɒ b a æ ɒ bprimary 0861 0865 0871 0875 17 18 60 18 66 19
secondary 20 20 20 20 03 03 03 03tertiary 02 02 02 02 03 03 03 03
UCA Frac. UCA
![Page 29: Collation in ICU 1.8 Mark Davis Chief SW Globalization Architect IBM](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1bc57f8b9ab0599d480d/html5/thumbnails/29.jpg)
Flat File I
Flat-file (memory mapped)speeds initializationreduces memory footprint(next slide)
![Page 30: Collation in ICU 1.8 Mark Davis Chief SW Globalization Architect IBM](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1bc57f8b9ab0599d480d/html5/thumbnails/30.jpg)
Flat-File II
Old: separate allocations
New: offsets within mem-map
![Page 31: Collation in ICU 1.8 Mark Davis Chief SW Globalization Architect IBM](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1bc57f8b9ab0599d480d/html5/thumbnails/31.jpg)
Delta Tailoring II
“a”
FR
found
UCA not
found
codenot
synthesized
![Page 32: Collation in ICU 1.8 Mark Davis Chief SW Globalization Architect IBM](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1bc57f8b9ab0599d480d/html5/thumbnails/32.jpg)
Processing Overview
Checks for identical prefixesTolerant of most unnormalized text
invokes normalization rarely
Uses “exceptional values”Compresses sort keysIncremental length/normalization
![Page 33: Collation in ICU 1.8 Mark Davis Chief SW Globalization Architect IBM](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1bc57f8b9ab0599d480d/html5/thumbnails/33.jpg)
Identical Prefixes
Sorting / Searching DatabasesMany comparisons to “close” stringsCheck initial prefixes with binary compareDrop into collation loop at first differenceComplication…
![Page 34: Collation in ICU 1.8 Mark Davis Chief SW Globalization Architect IBM](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1bc57f8b9ab0599d480d/html5/thumbnails/34.jpg)
Initial Prefix Complication
Need to backup if in “bad” position:
TypeContraction (Spanish) c hNormalization a °Surrogate Pair <L> <T>
Example
![Page 35: Collation in ICU 1.8 Mark Davis Chief SW Globalization Architect IBM](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1bc57f8b9ab0599d480d/html5/thumbnails/35.jpg)
Fast C or D (FCD)
Accepts all NFD, most NFC, without normalization
X FCD NFC NFD
A- ring Y YAngstrom YA + ring Y YA + grave Y YA-ring + grave YA + cedilla + ring Y YA + ring + cedillaA-ring + cedilla Y
![Page 36: Collation in ICU 1.8 Mark Davis Chief SW Globalization Architect IBM](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1bc57f8b9ab0599d480d/html5/thumbnails/36.jpg)
Exceptional Values
Normal weight storageP P P P P P P P P P P P P P P P S S S S S S S S C C T T T T T T
1 116b 8b 6b
F F F F T T T T d d d d d d d d d d d d d d d d d d d d d d d d4b 4b Tag 24 bit data
Special Weight StorageNOT_FOUND, EXPANSION, CONTRACTION, THAI, …
![Page 37: Collation in ICU 1.8 Mark Davis Chief SW Globalization Architect IBM](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1bc57f8b9ab0599d480d/html5/thumbnails/37.jpg)
Sort Key CompressionCommon weights are 1-byte
Primary, secondary, tertiary, quarternarySequences are compressedUTF-16 Values for “Märk Davis” (22 bytes)
004D 00E4 0072 006B 0020 0044 0061 0076 0069 0073 0000
Sort Key (L3, ignorable punctuation - 19 bytes)2F 17 39 2B 1D 17 41 27 3B 0177 96 0A 018F 80 8F 07 00
![Page 38: Collation in ICU 1.8 Mark Davis Chief SW Globalization Architect IBM](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1bc57f8b9ab0599d480d/html5/thumbnails/38.jpg)
ICU 1.8 vs. Windows, glibcFull UCAWarning: perf. comparisons approx.
Depends on data, parameters, featuresglibc - UTF-8 locales
String comparison: comparable≈ 20% worse to 400% better
Sort keys: shorter≈ half as long
![Page 39: Collation in ICU 1.8 Mark Davis Chief SW Globalization Architect IBM](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1bc57f8b9ab0599d480d/html5/thumbnails/39.jpg)
More InformationICU
http://oss.software.ibm.com/icu/Design Document
http://oss.software.ibm.com/cvs/icu/icuhtml/design/collation/
These Slideshttp://www.macchiato.com
Q & A
![Page 40: Collation in ICU 1.8 Mark Davis Chief SW Globalization Architect IBM](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1bc57f8b9ab0599d480d/html5/thumbnails/40.jpg)
Backup Slides
![Page 41: Collation in ICU 1.8 Mark Davis Chief SW Globalization Architect IBM](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1bc57f8b9ab0599d480d/html5/thumbnails/41.jpg)
WARNING 5: Math. RelationS = {Unicode Strings}Reflexive∀a ∊ S: a ≤ a
Antisymmetric∀a, b ∊ S: a ≤ b & b ≤ a → a = b
Transitive∀a, b ∊ S: a ≤ b & b ≤ c → a ≤ c
Total∀a, b ∊ S: a ≤ b ∨ b ≤ a