sorting unicode tibetan using a multi-weight collation...
Post on 10-Oct-2020
6 Views
Preview:
TRANSCRIPT
Sorting Unicode Tibetan using a Sorting Unicode Tibetan using a MultiMulti--Weight Collation AlgorithmWeight Collation Algorithm
Robert R. ChiltonRobert R. ChiltonTechnical Director,Technical Director,
The Asian Classics Input Project (ACIP)The Asian Classics Input Project (ACIP)acip@well.comacip@well.com
www.asianclassics.orgwww.asianclassics.org
Why do we need a sorting Why do we need a sorting algorithm for Unicode Tibetan?algorithm for Unicode Tibetan?
Tibetan script encoded in Unicode and Tibetan script encoded in Unicode and ISO/IEC 10646ISO/IEC 10646Full support of Tibetan within a computer Full support of Tibetan within a computer environment also requires:environment also requires:–– Keyboard(sKeyboard(s) or other input methods) or other input methods–– Rendering: readable, printable display of the Rendering: readable, printable display of the
encoded Tibetanencoded Tibetan--script data (fonts, etc.)script data (fonts, etc.)–– Collation rules for generating culturally Collation rules for generating culturally
acceptable sortingacceptable sorting
Previous efforts to sort Tibetan dataPrevious efforts to sort Tibetan data
Utilized a singleUtilized a single--weight sorting model weight sorting model
Generally adequate for sorting native Generally adequate for sorting native Tibetan orthographies within a specific Tibetan orthographies within a specific application/environmentapplication/environment
Not able to robustly handle foreign Not able to robustly handle foreign transcriptions and other nontranscriptions and other non--standard standard orthographiesorthographies
Designed for Romanized or fontDesigned for Romanized or font--encoded encoded Tibetan, not Unicode TibetanTibetan, not Unicode Tibetan
Treat TibetanTreat Tibetan--script sorting in an exclusive, script sorting in an exclusive, special case fashion; such proprietary special case fashion; such proprietary sorting methods are not likely to be widely sorting methods are not likely to be widely implementedimplemented
Features of multiFeatures of multi--weight sorting weight sorting methodologymethodology
WellWell--understood and widely implemented*understood and widely implemented*Uses a Uses a collation element tablecollation element table to achieve to achieve culturally acceptable sortingculturally acceptable sortingEnables searching at different degrees of Enables searching at different degrees of precision (e.g., caseprecision (e.g., case--sensitive searches)sensitive searches)
*References:*References:
•• ISO/IEC 14651 (2001ISO/IEC 14651 (2001--04) Ed. 1.004) Ed. 1.0Information technology Information technology ---- International string ordering and comparison International string ordering and comparison ----Method for comparing character strings and description of the coMethod for comparing character strings and description of the common mmon template template tailorabletailorable ordering.ordering.
•• Unicode Technical Standard #10:Unicode Technical Standard #10: Unicode Collation Algorithm (UCA).Unicode Collation Algorithm (UCA).
Advantages for implementers and Advantages for implementers and users of Unicode Tibetanusers of Unicode Tibetan
Collation element table for Tibetan can Collation element table for Tibetan can ““plug intoplug into”” existing sort logic at the existing sort logic at the operating system leveloperating system levelRobust searching and sorting of Tibetan Robust searching and sorting of Tibetan data thus becomes automatically available data thus becomes automatically available to all compliant applications running within to all compliant applications running within that operating system environment that operating system environment
The same collation element table can be The same collation element table can be used across multiple platforms used across multiple platforms –– resulting resulting in consistent sorting of Tibetan data within in consistent sorting of Tibetan data within different different operating system environmentsoperating system environments
Moving from methodology to Moving from methodology to algorithm: an overviewalgorithm: an overview
A look at dictionary sortingA look at dictionary sortingUnderstanding the multiUnderstanding the multi--weight sorting weight sorting ((““international string orderinginternational string ordering””) model and ) model and extending this model to Tibetanextending this model to TibetanDetermining the collation elements needed Determining the collation elements needed for sorting Unicode Tibetanfor sorting Unicode Tibetan
Dictionary sorting of TibetanDictionary sorting of TibetanGeneral agreement (since 1900) on dictionary General agreement (since 1900) on dictionary order of native orthographiesorder of native orthographies–– Main exception: treatment of Main exception: treatment of wazurwazur
Lack of consensus on details of sorting foreignLack of consensus on details of sorting foreign--origin orthographiesorigin orthographiesUniversal agreement that all entries must appear Universal agreement that all entries must appear under one of the 30 letters under one of the 30 letters ((ཀ་ཀ་ to to ཨ་ཨ་))–– Example of Example of ††ññ -- (Sanskrit: (Sanskrit: ““skandhaskandha””): sorts under the ): sorts under the
collation slot for collation slot for ††
How foreign words are sorted in How foreign words are sorted in dictionaries generallydictionaries generally
Words from foreign languages are sorted Words from foreign languages are sorted according to the sort rules of the according to the sort rules of the dictionarydictionary’’s language (and not the sort s language (and not the sort rules of the origin language)rules of the origin language)–– a Danish word beginning with a Danish word beginning with åå appears after appears after
letter letter ZZ in a Danish dictionaryin a Danish dictionary–– the same word is sorted under letter the same word is sorted under letter AA in an in an
English dictionaryEnglish dictionary
Extending this Extending this convention to Tibetan, all convention to Tibetan, all words in a Tibetan dictionary words in a Tibetan dictionary –– including including foreign words foreign words –– are sorted under 30 lettersare sorted under 30 lettersExtending this convention still further, all Extending this convention still further, all vowel signs are treated in terms of the 5 vowel signs are treated in terms of the 5 standard Tibetan vowelsstandard Tibetan vowels–– implicit vowel implicit vowel ཨ་ཨ་–– 4 explicit vowel signs4 explicit vowel signs
The multiThe multi--weight sorting model for weight sorting model for international string orderinginternational string ordering
Weights are generally assigned at three Weights are generally assigned at three (or more) levels(or more) levelsIn Latin scripts these levels correspond to:In Latin scripts these levels correspond to:
1. alphabetic ordering = primary level1. alphabetic ordering = primary level2. diacritic ordering = secondary level2. diacritic ordering = secondary level3. case ordering = tertiary level3. case ordering = tertiary level
(Additional levels (Additional levels may be used for tiemay be used for tie--breaking breaking between strings not distinguished at the first between strings not distinguished at the first three levels)three levels)
Examples in Latin scriptExamples in Latin script
rolerole and and rulerule differ at the primary (alphabetic) differ at the primary (alphabetic) levellevel
role role and and rôlerôle differ at the secondary (diacritic) differ at the secondary (diacritic) levellevel
rolerole and and ROLEROLE differ at the tertiary (case) leveldiffer at the tertiary (case) level
rolerole and and RÔLE RÔLE differ at both the secondary level differ at both the secondary level and the tertiary leveland the tertiary level
Extending the multiExtending the multi--weight model weight model to Tibetanto Tibetan
KK-- and and MM -- differ at the primary leveldiffer at the primary level
K-K- andand Kı-Kı- differ at the secondary leveldiffer at the secondary level
K-K- and and ff -- differ at the tertiary leveldiffer at the tertiary level
K-K- and and gÛgÛ -- differ at both the secondary differ at both the secondary
level and the tertiary levellevel and the tertiary level
Determining collation elements for Determining collation elements for Unicode Tibetan: an overviewUnicode Tibetan: an overview
Prescripts in Tibetan orthographies Prescripts in Tibetan orthographies The Unicode model for encoding Tibetan The Unicode model for encoding Tibetan scriptscriptWhat is a collation element?What is a collation element?167 primary167 primary--weighted collation elementsweighted collation elements9 secondary9 secondary--weighted collation elementsweighted collation elements
Prescripts in Tibetan orthographiesPrescripts in Tibetan orthographies
In English, all 26 letters always have In English, all 26 letters always have primary weight; thus primary weight; thus ““atat”” sorts far away sorts far away from from ““vatvat””In Tibetan, letters written before the radical In Tibetan, letters written before the radical letter have less than primary weight; thus letter have less than primary weight; thus ཀནཀན,, སྐནསྐན,, བཀནབཀན,, and and བསྐནབསྐན sort relatively near sort relatively near to each other, underto each other, under letter letter ཀ་ཀ་
11 possible prescripts (or 11 possible prescripts (or ““prepre--radicalsradicals””) ) might occur before the radical letter:might occur before the radical letter:–– 5 prefix letters: 5 prefix letters: གག དད བབ མམ འའ–– 3 head letters: 3 head letters: རར ལལ སས–– 3 two3 two--letter sequences of letter sequences of བབ prefix followed by prefix followed by
one of the head letters, i.e.: one of the head letters, i.e.: བརབར བལབལ བསབས
Grammar rules define which radical letters Grammar rules define which radical letters can take which prescriptscan take which prescripts–– For letterFor letter ཀཀ་་ there are 7 possible prescripts:there are 7 possible prescripts:
N@N@ T@ T@ áá ññ †† TáTá T†T†No radical letter can take all 11 prescript No radical letter can take all 11 prescript forms (and some take none at all)forms (and some take none at all)
The Unicode encoding model for The Unicode encoding model for Tibetan scriptTibetan script
193 distinct characters defined in Unicode193 distinct characters defined in UnicodeThe 30 letters (along with conjunct and reversed The 30 letters (along with conjunct and reversed forms) are encoded twice: in nominal position forms) are encoded twice: in nominal position and in orthographicand in orthographic--subjoined position subjoined position –– reflects the fact that the Tibetan script is written from reflects the fact that the Tibetan script is written from
top to bottom as well as from left to righttop to bottom as well as from left to right
Example of Example of Tå‰P-Tå‰P- encoded as 6 characters:encoded as 6 characters:
བབ ++ རར ++ ྟྟ ++ ེེ ++ ནན ++ ་་0F56 0F62 0F9F 0F7A 0F53 0F0B 0F56 0F62 0F9F 0F7A 0F53 0F0B
What is a collation element?What is a collation element?
A A collation elementcollation element enables clustering of enables clustering of multiple Unicode characters such that they multiple Unicode characters such that they can be treated together as a single item can be treated together as a single item for determining sort weightsfor determining sort weightsSingle characters also function as collation Single characters also function as collation elementselementsThe weights assigned to the collation The weights assigned to the collation elements determine their sort (or collation) elements determine their sort (or collation) order relative to one another order relative to one another
Defining Tibetan preDefining Tibetan pre--scribed radical scribed radical sequences as collation elements sequences as collation elements For letter For letter ཀཀ་་ we can define each of the 7 we can define each of the 7 prescriptprescript ++ radical clusters (radical clusters (N@N@ T@T@ áá ññ ††TáTá T†T†)) as a collation element (alsoas a collation element (alsocalled a called a ““collation graphemecollation grapheme””))
We can then assign sort weights to these We can then assign sort weights to these collation elements such that they sort in a collation elements such that they sort in a culturally acceptable relative orderculturally acceptable relative order
PrimaryPrimary--weighted collation weighted collation elementselements
30 nominal letters: 30 nominal letters: ཀ་ཀ་ to to ཨ་ཨ་ –– which may be which may be either radical letters either radical letters ((མིང་གཞི་མིང་གཞི་)) or suffix lettersor suffix letters
103 multi103 multi--letter preletter pre--scribed radical formsscribed radical forms–– In many of these 103 forms the prescript is written at In many of these 103 forms the prescript is written at
the head line (encoded as 1 or 2 nominal characters) the head line (encoded as 1 or 2 nominal characters) and the radical letter is encoded as a subjoined and the radical letter is encoded as a subjoined charactercharacter
–– In the example of In the example of Tå‰P-Tå‰P-, the radical letter , the radical letter KK is encoded is encoded
in subjoined positionin subjoined position
Defining the 4 explicit vowels as Defining the 4 explicit vowels as collation elementscollation elements
As collation elements, suffix letters cannot be As collation elements, suffix letters cannot be distinguished from bare radicals* distinguished from bare radicals* Because a nominal letter serving as a radical Because a nominal letter serving as a radical letter carries the implicit vowel letter carries the implicit vowel ཨ་ཨ་,, the 4 explicit the 4 explicit vowels must be given primary weights; and must vowels must be given primary weights; and must be weighted heavier than the nominal letters be weighted heavier than the nominal letters ----since a radical letter marked with an explicit since a radical letter marked with an explicit vowel will sort vowel will sort afterafter the same letter not marked the same letter not marked by an explicit vowelby an explicit vowel
* in a stateless implementation* in a stateless implementation
Defining the 30 postDefining the 30 post--radical letters radical letters as collation elementsas collation elements
PostPost--radicals = the 30 letters in subjoined radicals = the 30 letters in subjoined position (when not functioning as the radical position (when not functioning as the radical letter in a preletter in a pre--scribed radical form)scribed radical form)–– Requires maximumRequires maximum--length substring matchinglength substring matching
Only 4 postOnly 4 post--radical (subscribed) letters occur in radical (subscribed) letters occur in native Tibetan orthographiesnative Tibetan orthographies::
ྭྭ ྱྱ ྲྲ ླླ
Remaining 26 are required to treat nonRemaining 26 are required to treat non--native native orthographies in a consistent mannerorthographies in a consistent mannerMust be given primary weights; and heavier than Must be given primary weights; and heavier than the 4 explicit vowelsthe 4 explicit vowels
Relative order of the 167 primaryRelative order of the 167 primary--weighted collation elementsweighted collation elements
First: 30 nominal letters and 103 multiFirst: 30 nominal letters and 103 multi--letter preletter pre--scribed radical forms (= 133 collation elements)scribed radical forms (= 133 collation elements)–– given sort weights such that the 103 pregiven sort weights such that the 103 pre--scribed scribed
radical forms are interleaved as appropriate with the radical forms are interleaved as appropriate with the 30 nominal letters30 nominal letters
Next: 4 explicit vowelsNext: 4 explicit vowelsNext: 30 postNext: 30 post--radical letters (i.e., in orthographic radical letters (i.e., in orthographic subscribed position) subscribed position) Thus, total collation slots at the primaryThus, total collation slots at the primary--weight weight level: 133 + 4 + 30 = 167level: 133 + 4 + 30 = 167
SecondarySecondary--weighted collation weighted collation elementselements
These 9 have no primary weightThese 9 have no primary weight
–– 4 combining marks: 4 combining marks:
◌྄◌྄ ◌ཱ◌ཱ ◌༹◌༹ ◌ཿ◌ཿ
–– 5 signs: 5 signs: ྅྅ ྈྈ ྉྉ ྊྊ ྋྋ
The remaining 120 Unicode The remaining 120 Unicode Tibetan charactersTibetan characters
30 + 4 + 30 + 9 = 73 of the 193 Unicode Tibetan 30 + 4 + 30 + 9 = 73 of the 193 Unicode Tibetan characters have been treated above, leaving 120characters have been treated above, leaving 120
59 of these 120 have a primary weight (in addition 59 of these 120 have a primary weight (in addition to a secondary and/or tertiary weight):to a secondary and/or tertiary weight):–– 19 can be decomposed into simple elements and thus 19 can be decomposed into simple elements and thus
need not be treated in the collation element tableneed not be treated in the collation element table–– 9 are variants (primary and 9 are variants (primary and tertiary weighted) of certain tertiary weighted) of certain
of the 30 nominal lettersof the 30 nominal letters–– 3 are variants 3 are variants (primary and (primary and tertiary weighted) of certain tertiary weighted) of certain
of the 4 explicit vowelsof the 4 explicit vowels
–– 8 are 8 are variants (primary and variants (primary and tertiary weighted) of tertiary weighted) of certain of the 30 subscribed letterscertain of the 30 subscribed letters
–– 20 are the digits and half20 are the digits and half--digitsdigits
The remaining 61 characters are punctuation The remaining 61 characters are punctuation marks and other symbols which generally have marks and other symbols which generally have no impact on dictionary sort order and thus have no impact on dictionary sort order and thus have no primary, secondary or tertiary weight no primary, secondary or tertiary weight
AppendicesAppendices
•• The Unicode (and ISO/IEC 10646) The Unicode (and ISO/IEC 10646) charactercharacter--encoding chart for Tibetanencoding chart for Tibetan–– highlighting characters in example: highlighting characters in example: Tå‰P-Tå‰P-
•• An ordered list of collation elements for Unicode Tibetan
The Unicode Standard 3.0, Copyright © 1991-2000, Unicode, Inc. All rights reserved 435
0F7FTibetan0F00
0F0 0F1 0F2 0F3 0F4 0F5 0F6 0F7
�������
��� ���
����������
�����
!"#$%&'()*+,-./0
123456
78
9:;<=>?@
ABCDEFGH
IJKLMNO
PQRSTUVWXYZ[\]^_
`abcdefghij
klmno
pq
rs
tuvwxy
0F00
0F01
0F02
0F03
0F04
0F05
0F06
0F07
0F08
0F09
0F0A
0F0B
0F0C
0F0D
0F0E
0F0F
0F10
0F11
0F12
0F13
0F14
0F15
0F16
0F17
0F18
0F19
0F1A
0F1B
0F1C
0F1D
0F1E
0F1F
0F20
0F21
0F22
0F23
0F24
0F25
0F26
0F27
0F28
0F29
0F2A
0F2B
0F2C
0F2D
0F2E
0F2F
0F30
0F31
0F32
0F33
0F34
0F35
0F36
0F37
0F38
0F39
0F3A
0F3B
0F3C
0F3D
0F3E
0F3F
0F40
0F41
0F42
0F43
0F44
0F45
0F46
0F47
0F49
0F4A
0F4B
0F4C
0F4D
0F4E
0F4F
0F50
0F51
0F52
0F53
0F54
0F55
0F56
0F57
0F58
0F59
0F5A
0F5B
0F5C
0F5D
0F5E
0F5F
0F60
0F61
0F62
0F63
0F64
0F65
0F66
0F67
0F68
0F69
0F6A
0F71
0F72
0F73
0F74
0F75
0F76
0F77
0F78
0F79
0F7A
0F7B
0F7C
0F7D
0F7E
0F7F
0
1
2
3
4
5
6
7
8
9
A
B
C
D
E
F
The Unicode Standard 3.0, Copyright © 1991-2000, Unicode, Inc. All rights reserved436
0FFFTibetan0F80
0F8 0F9 0FA 0FB 0FC 0FD 0FE 0FF
�
�
�
��
�
�
�
�
�
�
��
�
�
�
�
�
�
�
��
�
�
�
��
�
!
"#
$
%
&
'(
)*
+
,
-
.
/
0
1
2
3
45
6
7
8
9
:
;
<
=
>
?
@
A
B
C
D
E
F
G
H
0F80
0F81
0F82
0F83
0F84
0F85
0F86
0F87
0F88
0F89
0F8A
0F8B
0F90
0F91
0F92
0F93
0F94
0F95
0F96
0F97
0F99
0F9A
0F9B
0F9C
0F9D
0F9E
0F9F
0FA0
0FA1
0FA2
0FA3
0FA4
0FA5
0FA6
0FA7
0FA8
0FA9
0FAA
0FAB
0FAC
0FAD
0FAE
0FAF
0FB0
0FB1
0FB2
0FB3
0FB4
0FB5
0FB6
0FB7
0FB8
0FB9
0FBA
0FBB
0FBC
0FBE
0FBF
0FC0
0FC1
0FC2
0FC3
0FC4
0FC5
0FC6
0FC7
0FC8
0FC9
0FCA
0FCB
0FCC
0FCF
0
1
2
3
4
5
6
7
8
9
A
B
C
D
E
F
An Ordered List of Collation Elements for Unicode Tibetan
[*Note: a comprehensive Collation Element Table for Tibetan script will include additional collation elements, such as , , ྋྙ, ྉྤ, ྉྥ, beyond those listed here.]
A. Primary Weighted Collation Elements A.1. The 133 radical-initial sequences (also covers the suffix letters):
ཀ དཀ བཀ རྐ ལྐ སྐ བརྐ བསྐ ཁ མཁ འཁ ག དག བག མག འག རྒ ལྒ སྒ བརྒ བསྒ ང དང མང རྔ ལྔ སྔ བརྔ བསྔ ཅ གཅ བཅ ལྕ བལྕ ཆ མཆ འཆ ཇ མཇ འཇ རྗ ལྗ བརྗ ཉ གཉ མཉ རྙ སྙ བརྙ བསྙ ཏ གཏ བཏ རྟ ལྟ སྟ བརྟ བལྟ བསྟ ཐ མཐ འཐ ད གད བད མད འད རྡ ལྡ སྡ བརྡ བལྡ བསྡ ན གན མན རྣ སྣ བརྣ བསྣ པ དཔ ལྤ སྤ ཕ འཕ བ དབ འབ རྦ ལྦ སྦ མ དམ རྨ སྨ ཙ གཙ བཙ རྩ སྩ བརྩ བསྩ ཚ མཚ འཚ ཛ མཛ འཛ རྫ བརྫ ཝ ཞ གཞ བཞ ཟ གཟ བཟ འ ཡ གཡ ར བར(seen in བརླ) ལ ཤ གཤ བཤ ས གས བས ཧ ལྷ ཨ
Key: Black for the 30 nominal letters. Note that whereas any of these 30 can serve as a bare radical,
10 of these can also appear in suffix position in native orthographies. Blue for (relatively) unambiguous cases of prefixed and/or superscribed radical letters. Note that
certain unavoidable ambiguities arise between native orthographies and transcriptions from foreign languages.
Red for ambiguous cases where a 3rd codepoint is required to distinguish the sequence as being a prefixed radical letter (as opposed to a root letter followed by a suffix). Note that certain cases (in Dzongkha) require a 4th codepoint in order to distinguish a case of a prefixed radical letter from a case of a suffix letter followed by a secondary syllable that involves a vowel (i.e., ནི or མོ ).
Magenta for an ambiguous case (in Dzongkha) where a 3rd (or possibly 4th) codepoint is required to distinguish the sequence as being a prefixed radical letter (as opposed to a suffix letter ད followed by a secondary syllable པ or པ ོ).
A.2. The 4 explicit vowels:
◌ི ◌ུ ◌ེ ◌ོ A.3. The 30 post-radicals: ◌ྐ ◌ྑ ◌ྒ ◌ྔ ◌ ྕ◌ྖ ◌ྗ ◌ྙ ◌ྟ ◌ ྠ◌ྡ ◌ྣ ◌ྤ ◌ྥ ◌ ྦ◌ྨ ◌ྩ ◌ ྪ◌ྫ ◌ ྭ◌ྮ ◌ྯ ◌ཱ ◌ྱ ◌ ྲ◌ླ ◌ྴ ◌ྶ ◌ྷ ◌ ྸ
B. Secondary Weighted Collation Elements (have no primary weight) B.1. The 4 combining marks:
◌྄ ◌ཱ ◌༹ ◌ཿ B.2. The 5 signs (used in transliteration):
྅ ྈ ྉ ྊ ྋ C. The 120 Remaining Unicode Tibetan Characters The characters listed above (in items A and B) account for 73 of the 193 Tibetan characters defined in Unicode. This leaves 120 characters, of which 19 can be decomposed into simple elements and thus need not be treated in the collation element table. There is also no need to assign primary secondary or tertiary weights to the 61 characters that function as punctuation marks and other symbols since these generally have no impact on dictionary sort order. [Note that "Syllable OM" at U+0F00 is here treated as an ornamental symbol rather than as having any lexical value due to the fact that there is no canonical or compatibility decompostion specified for this character.] The digits and half digits account for 20 further characters. The remaining 20 characters are variations (i.e., having both primary and tertiary weights) of certain of the 30 nominal letters, 4 vowels, and 30 subjoined post-initial letters listed previously. 9 Nominal letter variants:
ཊ ཋ ཌ ཎ ◌ཾ ◌ྂ ◌ྃ ར(fixed form) ཥ 3 Vowel variants:
◌ྀ ◌ཻ ◌ཽ 8 Subjoined letter variants:
◌ྚ ◌ྛ ◌ྜ ◌ྞ ◌ྺ ◌ྻ ◌ྼ ◌ྵ
top related