bits of unicode

Upload: plakalscribd

Post on 10-Apr-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/8/2019 Bits of Unicode

    1/36

    Bits of Unicode

    Data structures for a

    largecharacter setMark Davis

    IBM Emerging Technologies

  • 8/8/2019 Bits of Unicode

    2/36

    Caution

    Characters ambiguous, sometimes:

    Graphemes: x (also ch,)

    Code points: 0078 0323

    Code units: 0078 0323 (or UTF-8: 78 CC A3)

    For programmers

    Unicode associates codepoints (or sequences ofcodepoints) with properties

    See UTR#17

  • 8/8/2019 Bits of Unicode

    3/36

    The Problem Programs often have to do

    lookups

    Look up properties by codepoint

    Map codepoints to values

    Test codepoints for inclusion in set

    e.g. value == true/false Easy with 256 codepoints: just use

    array

  • 8/8/2019 Bits of Unicode

    4/36

    Size Matters

    Not so easy with Unicode!

    Unicode 3.0subset(except PUA)

    up to FFFF16 = 65,53510

    Unicode 3.1full range

    up to 10FFFF16 = 1,114,11110

  • 8/8/2019 Bits of Unicode

    5/36

    Array Lookup

    With ASCII

    Simple

    Fast

    Compact

    codepoint bit:

    32 bytescodepoint short:

    K

    With Unicode

    Simple

    Fast

    Huge (esp. v3.1)

    codepoint bit:

    136 Kcodepoint short:

    2.2 M

  • 8/8/2019 Bits of Unicode

    6/36

    Further complications

    Mappings, tests, properties often must

    be for sequencesof codepoints.

    Human languages dont just use singlecodepoints.

    ch in Spanish, Slovak; etc.

  • 8/8/2019 Bits of Unicode

    7/36

    First step:

    Avoidance

    Properties from libraries often suffice

    Test for (Character.getType(c) == Nd)

    instead of long list of codepoints Easier

    Automatically updated with new versions

    Data structures from libraries often suffice

    Java HashtableICU (Java or C++) CompactArray

    JavaScript properties

    Consult http://www.unicode.org

  • 8/8/2019 Bits of Unicode

    8/36

    Data structures: criteria Speed

    Read (static)

    Write (dynamic)Startup

    Memory footprint

    RamDisk

    Multi-threading

  • 8/8/2019 Bits of Unicode

    9/36

    Hashtables Advantages

    Easy to use out-of-the-box

    Reasonably fast

    General

    Disadvantages

    High overheadDiscrete (no range lookup)

    Much slower than array lookup

  • 8/8/2019 Bits of Unicode

    10/36

    Overhead: char1

    char2

    value

    next

    key

    overhead

    char1overhead

    char2overhead

    hash

    overhead

  • 8/8/2019 Bits of Unicode

    11/36

    Trie Advantages

    Nearly as fast as array lookupMuchsmaller than arrays or Hashtables

    Take advantage of repetition

    DisadvantagesNot suited for rapidly changing data

    Best for static, preformed data

  • 8/8/2019 Bits of Unicode

    12/36

    Trie structure

    Index

    Data

    M1 M2

    Codepoint

  • 8/8/2019 Bits of Unicode

    13/36

    Trie code

    5 Operations

    Shift, Lookup, Mask, Add, Lookup

    v = data[index[c>>S1]+(c&M2)]]

    S1

    M1 M2

    Codepoint

  • 8/8/2019 Bits of Unicode

    14/36

    Trie: double indexed

    Double, for more compaction:

    Slightly slower than single indexSmaller chunks of data, so more

    compaction

  • 8/8/2019 Bits of Unicode

    15/36

    Trie: double indexed

    Index2

    Data

    Index1

    M1 M3M2

    Codepoint

  • 8/8/2019 Bits of Unicode

    16/36

    Trie code: double indexed

    b1 = index1[ c >> S1 ]

    b2 = index2[ b1 + ((c >> S2) & M2)]

    v = data[ b2 + (c & M3) ]

    S2

    S1

    M1 M3M2

    Codepoint

  • 8/8/2019 Bits of Unicode

    17/36

    Inversion List

    Compaction of set of codepoints

    Advantages

    Simple

    Very compact

    Faster write than trie

    Very fast boolean operations Disadvantages

    Slower read than trie or hashtable

  • 8/8/2019 Bits of Unicode

    18/36

    Inversion ListStructure

    Structure

    Index (optional)

    List of codepoints inascending order

    Example Set

    [ 0020-0061, 0135,19A3-201B ]

    00200062

    0135013619A3201C

    Index

    0:

    1:

    2:

    3:

    4:

    5:

    in

    out

    in

    out

    in

    out

  • 8/8/2019 Bits of Unicode

    19/36

    Inversion List Example

    Find smallest i such thatc < data[i]

    If no i, i = length

    Then

    c List odd(i)

    Examples:In: 0023, 0135

    Out: 001A, 0136, A357

    00200062

    0135013619A3201C

    Index

    0:

    1:

    2:

    3:

    4:

    5:

    in

    out

    in

    out

    in

    out

  • 8/8/2019 Bits of Unicode

    20/36

    Inversion ListOperations

    Fast Boolean Operations

    Example: Negation

    00200062

    0135013619A3201C

    Index

    0:

    1:

    2:3:

    4:

    5:

    0020

    00620135013619A3

    201C

    Index

    1:

    3:

    2:

    4:

    5:

    6:

    00000:

  • 8/8/2019 Bits of Unicode

    21/36

    Inversion List: BinarySearch

    from Programming Pearls

    Completely unrolled, precalculatedparameters

    int index = startIndex;

    if (x >= data[auxStart]) {

    index += auxStart;

    }

    switch (power) {

    case 21: if (x < data[t = index-0x10000])

    index = t;

    case 20: if (x < data[t = index-0x8000])

    index = t;

  • 8/8/2019 Bits of Unicode

    22/36

    Inversion Map

    Inversion List

    plus

    Associated Values

    Lookup index just as

    in Inversion ListTake corresponding

    value

    0020

    00620135013619A3201C

    Index

    0:

    1:2:

    3:

    4:

    5:

    05

    3983

    0

    0:

    1:

    2:3:

    4:

    5:

    6:

  • 8/8/2019 Bits of Unicode

    23/36

    Key String Value

    Problem

    Often almost all values are 1 codepoint

    But, must map to strings in a few casesDont want overhead for strings always

    Solution

    Exception values indicate extra processingCan use same solution for UTF-16 code

    units

  • 8/8/2019 Bits of Unicode

    24/36

    Example

    Get a character ch

    Find its valuev

    If v is in [D800..E000], may be string

    check v2 =valueException[v -D800]

    ifv2not null, process it, continue

    Process v

  • 8/8/2019 Bits of Unicode

    25/36

    StringKey Value

    Problem

    Often almost all keys are 1 codepoint

    Must have string keys in a few cases

    Dont want overhead for strings always

    Solution

    Exception values indicate possible follow-on

    codepointsCan use same solution for UTF-16 code units

    Use key closure!

  • 8/8/2019 Bits of Unicode

    26/36

    Closure

    If (X + Y) is a key, then X is a key

    Before

    s x

    sh yshch z

    After

    shc yw

    c w

    s x

    sh yshch z

    c w

  • 8/8/2019 Bits of Unicode

    27/36

    WhyClosure?

    s h c h a

    x

    y

    ywz

    not found,

    use last

  • 8/8/2019 Bits of Unicode

    28/36

    Bitpacking

    Squeeze information into value

    Example: Character Properties

    category: 5 bits

    bidi: 4 bits (+ exceptions)

    canonical category: 6 bits + expansion

    compressCanon = [bits >> SHIFT] & MASK;

    canon = expansionArray[compressCanon];

  • 8/8/2019 Bits of Unicode

    29/36

    Statetables

    Classic:

    entry = stateTable[ state, ch ];

    state = entry.state;

    doSomethingWith( entry.action );

    until (state < 0);

  • 8/8/2019 Bits of Unicode

    30/36

    Statetables

    Unicode:

    type = trie[ch];

    entry = stateTable[ state, type ];

    state = entry.state;

    doSomethingWith( entry.action );

    until (state < 0);

    Also, String Key Value

  • 8/8/2019 Bits of Unicode

    31/36

    SampleDataStructures: ICU

    Trie: CompactArray

    Customized for each datatype

    Automatic expansionCompact after setting

    Character Properties

    use CompactArray, Bitpacking Inversion List: UnicodeSet

    Boolean Operations

  • 8/8/2019 Bits of Unicode

    32/36

    Sample Usage #1: ICU

    Collation

    Trie lookup

    Expanding character: String Key ValueContracting character: Key String Value

    Break Iterators

    For grapheme, word, line, sentence breakStatetable

  • 8/8/2019 Bits of Unicode

    33/36

    Sample Usage #2: ICU

    Transliteration

    Requires

    Mapping codepoints in contextto others Rearranging codepoints

    Controlling the choice of mapping

    Character Properties

    Inversion List

    Exception values

  • 8/8/2019 Bits of Unicode

    34/36

    Sample Usage #3: ICU

    Character Conversion

    From Unicode to bytes

    Trie

    From bytes to Unicode Arrays for simple maps

    Statetables for complex maps

    recognizes valid / invalid mappings

    provides compaction

    Complications

    Invalid vs. Valid mapped vs. Valid unmapped

    Fallbacks

  • 8/8/2019 Bits of Unicode

    35/36

    References

    Unicode Open Source ICU

    http://oss.software.ibm.com/icu

    ICU4j: Java APIICU4c: C and C++ APIs

    Other references see Marks

    website:http://www.macchiato.com

  • 8/8/2019 Bits of Unicode

    36/36

    Q& A