urdu character set and collating sequence
DESCRIPTION
Urdu Character Set and Collating Sequence. Sarmad Hussain مرکزتحقیقاتِ اردو Center for Research in Urdu Language Processing FAST National University of Computer and Emerging Sciences. Purpose of Presentation. Indicate the “state of affairs” Character set Collating sequence - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Urdu Character Set and Collating Sequence](https://reader036.vdocuments.us/reader036/viewer/2022062315/5681595d550346895dc699ee/html5/thumbnails/1.jpg)
Urdu Character Set and Urdu Character Set and Collating SequenceCollating Sequence
Sarmad HussainSarmad Hussain
اردو اردوِ ِمرکزتحقیقاتمرکزتحقیقاتCenter for Research in Urdu Language ProcessingCenter for Research in Urdu Language Processing
FAST National University of Computer and Emerging SciencesFAST National University of Computer and Emerging Sciences
![Page 2: Urdu Character Set and Collating Sequence](https://reader036.vdocuments.us/reader036/viewer/2022062315/5681595d550346895dc699ee/html5/thumbnails/2.jpg)
2 مرکزتحقیقات اردو
Purpose of PresentationPurpose of Presentation
► Indicate the “state of affairs”Indicate the “state of affairs” Character setCharacter set Collating sequenceCollating sequence
►Show what has been done regarding Show what has been done regarding the standardizationthe standardization
► Identify what needs to be doneIdentify what needs to be done
![Page 3: Urdu Character Set and Collating Sequence](https://reader036.vdocuments.us/reader036/viewer/2022062315/5681595d550346895dc699ee/html5/thumbnails/3.jpg)
3 مرکزتحقیقات اردو
SourcesSources
► Data from four dictionaries of UrduData from four dictionaries of Urdu
سنز 1.1. فیروز ، جامع سنز فیروزاللغات فیروز ، جامع لاہور فیروزاللغات لاہور ، ،((FLJFLJ ) )
.2.2Standard Twentieth Century Dictionary: Standard Twentieth Century Dictionary:
Urdu to English, Educational Publishing Urdu to English, Educational Publishing
House, New Dehli, India (STCD)House, New Dehli, India (STCD)
زبان ????????فرہنگفرہنگ3.3. قومی مقتدرہ ، زبان تلفظ قومی مقتدرہ ، اسلام تلفظ اسلام ، ( ( FTFT))آابادآاباد ،
زبان 4.4. قومی مقتدرہ ، لغت اردو زبان جدید قومی مقتدرہ ، لغت اردو اسلام جدید اسلام ، ((JULJUL ) )آابادآاباد ،
![Page 4: Urdu Character Set and Collating Sequence](https://reader036.vdocuments.us/reader036/viewer/2022062315/5681595d550346895dc699ee/html5/thumbnails/4.jpg)
4 مرکزتحقیقات اردو
Character SetCharacter Set
►AlphabetAlphabet
►Harakat (Aerab)Harakat (Aerab)
►Other SymbolsOther Symbols
![Page 5: Urdu Character Set and Collating Sequence](https://reader036.vdocuments.us/reader036/viewer/2022062315/5681595d550346895dc699ee/html5/thumbnails/5.jpg)
5 مرکزتحقیقات اردو
““Typical” AlphabetTypical” Alphabet
خ ح چ ج ث ٹ ت پ ب ا خ آ ح چ ج ث ٹ ت پ ب ا آ
ژ ز ڑ ر ذ ڈ ژ د ز ڑ ر ذ ڈ ض د ص ش ض س ص ش سغ ع ظ غ ط ع ظ گ ط ک ق گ ف ک ق فم م ل ے ل ی ء ہ و ے ن ی ء ہ و ن
لاہور- ، سنز فیروز ، قاءدہ لاہور- اردو ، سنز فیروز ، قاءدہ اردو
![Page 6: Urdu Character Set and Collating Sequence](https://reader036.vdocuments.us/reader036/viewer/2022062315/5681595d550346895dc699ee/html5/thumbnails/6.jpg)
6 مرکزتحقیقات اردو
Do zabar Do zabar ًًدد Do zerDo zer ٍٍدد
Do peshDo pesh ُُدد Tashdeed Tashdeed ّّدد Noon ghunnaNoon ghunna نن
““Familiar” Harakaat (Aerab)Familiar” Harakaat (Aerab)
JazmJazm ددْْZabarZabar ََدد ZerZer دد?? PeshPesh ُُدد Khari zabarKhari zabar دد Khari zerKhari zer ددUlta peshUlta pesh دد
![Page 7: Urdu Character Set and Collating Sequence](https://reader036.vdocuments.us/reader036/viewer/2022062315/5681595d550346895dc699ee/html5/thumbnails/7.jpg)
7 مرکزتحقیقات اردو
““Common” Other SymbolsCommon” Other SymbolsNumbersNumbers
00 ۰۰11 ١١22 ٢٢33 ٣٣44
55 ۵۵66 ٦٦77
88 ٨٨9 9 ٩٩
Punctuation Punctuation
؟؟؛؛٬٬--
HonorificsHonorifics
Other SymbolsOther Symbols
ס
![Page 8: Urdu Character Set and Collating Sequence](https://reader036.vdocuments.us/reader036/viewer/2022062315/5681595d550346895dc699ee/html5/thumbnails/8.jpg)
8 مرکزتحقیقات اردو
Urdu Alphabet: State of Urdu Alphabet: State of AffairsAffairs
FT, JULFT, JUL خ ح چھ چ جھ ج ث ٹھ ٹ تھ ت پھ پ بھ ب آ خ ا ح چھ چ جھ ج ث ٹھ ٹ تھ ت پھ پ بھ ب آ د د ا
ژ ز ڑھ ڑ رھ ر ذ ڈھ ڈ ژ دھ ز ڑھ ڑ رھ ر ذ ڈھ ڈ غ دھ ع ظ ط ض ص ش غ س ع ظ ط ض ص ش سگھ گ کھ ک ق گھ ف گ کھ ک ق ء ف وھ و نھ ن ںھ ں مھ م لھ ء ل وھ و نھ ن ںھ ں مھ م لھ ل
ے ے ی ی
FLJ, STCDFLJ, STCD خ ح چ ج ث ٹ ت پ ب ا خ آ ح چ ج ث ٹ ت پ ب ا ژ آ ز ڑ ر ذ ڈ ژ د ز ڑ ر ذ ڈ ص د ش ص س ش س
غ ع ظ ط غ ض ع ظ ط و ض ن ں م ل گ ک ق و ف ن ں م ل گ ک ق ے ف ی ء ھ ے ہ ی ء ھ ہ
![Page 9: Urdu Character Set and Collating Sequence](https://reader036.vdocuments.us/reader036/viewer/2022062315/5681595d550346895dc699ee/html5/thumbnails/9.jpg)
9 مرکزتحقیقات اردو
Cu
rrent G
oP S
tan
dard
: UZ
T 1
.01
Cu
rrent G
oP S
tan
dard
: UZ
T 1
.01
![Page 10: Urdu Character Set and Collating Sequence](https://reader036.vdocuments.us/reader036/viewer/2022062315/5681595d550346895dc699ee/html5/thumbnails/10.jpg)
10 مرکزتحقیقات اردو
Logical Sections of UZT 1.01Logical Sections of UZT 1.01► Alphabet (80 – 122)Alphabet (80 – 122)► Aerab/diacritics/harakat (66 – 79, 123 – 126)Aerab/diacritics/harakat (66 – 79, 123 – 126)► Other charactersOther characters
Punctuation and arithmetic symbols (32 – 47, 58 – Punctuation and arithmetic symbols (32 – 47, 58 – 65)65)
Digits (48 – 57)Digits (48 – 57) Special symbols (160 – 176, 192 – 199)Special symbols (160 – 176, 192 – 199) MiscellaneousMiscellaneous
► Control characters (0 – 31, 127) Control characters (0 – 31, 127) ► Reserved control space (128 – 159, 255)Reserved control space (128 – 159, 255)► Reserved expansion space (177 – 191, 200 – 207, 240 – Reserved expansion space (177 – 191, 200 – 207, 240 –
253)253)► Vendor area (208 – 239)Vendor area (208 – 239)► Toggle character (254)Toggle character (254)
![Page 11: Urdu Character Set and Collating Sequence](https://reader036.vdocuments.us/reader036/viewer/2022062315/5681595d550346895dc699ee/html5/thumbnails/11.jpg)
11 مرکزتحقیقات اردو
Conclusions: Standard Urdu Conclusions: Standard Urdu Character SetCharacter Set
► No general agreement on Urdu Character No general agreement on Urdu Character Set by dictionary publishersSet by dictionary publishers
► Standard Character Set defined by National Standard Character Set defined by National Language Authority Language Authority not well-publicized not well-publicized not widely adoptednot widely adopted
► GoP Computing Standard for Computing, GoP Computing Standard for Computing, UZT 1.01 implements the NLA-defined UZT 1.01 implements the NLA-defined character and symbol set character and symbol set
► Will soon be fully represented in Will soon be fully represented in Unicode/ISO 10646Unicode/ISO 10646
![Page 12: Urdu Character Set and Collating Sequence](https://reader036.vdocuments.us/reader036/viewer/2022062315/5681595d550346895dc699ee/html5/thumbnails/12.jpg)
12 مرکزتحقیقات اردو
Urdu Collating Sequence: Urdu Collating Sequence: State of AffairsState of Affairs
FT, JULFT, JULج ٹھٹھٹ ٹ تھتھت ت پھپھپ پ بھبھب ب آآ اا ج ث خ چھچھچ چ جھجھث خ ح ڈ ڈ دھدھد د ح
ر ڈھڈھ ر ذ ژ ڑھ ڑھ ڑ ڑ رھرھذ ژ ز غ ز ع ظ ط ض ص ش غ س ع ظ ط ض ص ش ک س ق ک ف ق فےے ییء ء ہہ وھوھو و نھنھ نن ںھںھ ںں مھمھم م لھلھل ل گھگھگ گ کھ کھ
FLJFLJ ا ا آ خ آ ح چ ج ث ٹ ت پ خ ب ح چ ج ث ٹ ت پ ژ ب ز ڑ ر ذ ڈ ژ د ز ڑ ر ذ ڈ ض د ص ش ض س ص ش س
غ ع ظ غ ط ع ظ م ط ل گ ک ق م ف ل گ ک ق ن ف ن ں ھ و و ں ھ ہ ے ء ء ہ ے ی ی
STCDSTCD ا ا آ خ آ ح چ ج ث ٹ ت پ خ ب ح چ ج ث ٹ ت پ ژ ب ز ڑ ر ذ ڈ ژ د ز ڑ ر ذ ڈ ض د ص ش ض س ص ش س
غ ع ظ غ ط ع ظ م ط ل گ ک ق م ف ل گ ک ق ں ف ں ن ے ء ء ہہ ھھ و و ن ے ی ی
![Page 13: Urdu Character Set and Collating Sequence](https://reader036.vdocuments.us/reader036/viewer/2022062315/5681595d550346895dc699ee/html5/thumbnails/13.jpg)
13 مرکزتحقیقات اردو
آا آا ا VariationVariation ا
► STCD and FLJSTCD and FLJ
آابآابآاپآاپابابایوانایوان
► FT and JULFT and JUL
ابابایوانایوانآابآابآاپآاپ
![Page 14: Urdu Character Set and Collating Sequence](https://reader036.vdocuments.us/reader036/viewer/2022062315/5681595d550346895dc699ee/html5/thumbnails/14.jpg)
14 مرکزتحقیقات اردو
ں ں ن VariationVariation ن
► FLJ, FT & STCDFLJ, FT & STCDماںماںمانمان
► JULJULمانمانماںماں
![Page 15: Urdu Character Set and Collating Sequence](https://reader036.vdocuments.us/reader036/viewer/2022062315/5681595d550346895dc699ee/html5/thumbnails/15.jpg)
15 مرکزتحقیقات اردو
ھ ھ ہ VariationVariation ہ
►FLJFLJباپباپبہنبہنبہنگیبہنگیبھابیبھابیبھنگیبھنگیبیٹابیٹا
►STCDSTCDباپباپبھابیبھابیبہنبہنبھنگیبھنگیبہنگیبہنگیبیٹابیٹا
►FT & JULFT & JULباپباپبہنبہنبہنگیبہنگیبیٹابیٹابھابیبھابیبھنگیبھنگی
بانوبانوبانھبانھبانیبانی
![Page 16: Urdu Character Set and Collating Sequence](https://reader036.vdocuments.us/reader036/viewer/2022062315/5681595d550346895dc699ee/html5/thumbnails/16.jpg)
16 مرکزتحقیقات اردو
ےے یی VariationVariation
►FJL,FJL, FT & JULFT & JULبیبی بی بی بی بیبےبےبیابانبیابان
►STCDSTCDبیبیبےبےبیابانبیابان بی بی بی بی
► Middle “yay” predicament: Middle “yay” predicament: ےے or or ییب = ییبب ب = کار ر ےےکار ا ر ک ا کل = = وژن وژن ییٹیلٹیل ی ل ٹ ی ن ییٹ ژ ن و ژ و
![Page 17: Urdu Character Set and Collating Sequence](https://reader036.vdocuments.us/reader036/viewer/2022062315/5681595d550346895dc699ee/html5/thumbnails/17.jpg)
17 مرکزتحقیقات اردو
Role of Aerab in SortingRole of Aerab in Sorting
► Aerab ignored in the first (primary) pass of Aerab ignored in the first (primary) pass of sorting an Urdu stringsorting an Urdu string
ب )= ِِبب ب )= ہار ( ِِہار ( ہار ہار ہانہہانہََببب )= ِِبب ب )= ہاءی ( ِِہاءی ( ہاءی ہاءی
► However, aerab are relevant in second pass, However, aerab are relevant in second pass, when first pass gives an exact matchwhen first pass gives an exact match
ب ََبب ب ن ب ِِن ب ن نُنُُُنس ََسس س ن س ِِن س ن نُنُُُن
![Page 18: Urdu Character Set and Collating Sequence](https://reader036.vdocuments.us/reader036/viewer/2022062315/5681595d550346895dc699ee/html5/thumbnails/18.jpg)
18 مرکزتحقیقات اردو
Vocalic Aerab - Zabar, Zer, Vocalic Aerab - Zabar, Zer, PeshPesh
►FT, FLJ, JULFT, FLJ, JULنَنَببنِنِببنُنُُُبب
یریرََبب یریرِِب ب بیر بیر
►STCDSTCDنَنَببنُنُُُببنِنِبب
ننََسسننِِسسننُُُُسس
یریرِِب ب بیر بیر
![Page 19: Urdu Character Set and Collating Sequence](https://reader036.vdocuments.us/reader036/viewer/2022062315/5681595d550346895dc699ee/html5/thumbnails/19.jpg)
19 مرکزتحقیقات اردو
Vocalic Aerab – Khari ZabarVocalic Aerab – Khari Zabar
► No effect at primary level sortingNo effect at primary level sorting وسیوسیََمماعلااعلا وسیوسیُُمماعلان اعلاناعلماعلماعلیاعلی
► No minimal pairs found so secondary No minimal pairs found so secondary level so involvement could not be level so involvement could not be determineddetermined
![Page 20: Urdu Character Set and Collating Sequence](https://reader036.vdocuments.us/reader036/viewer/2022062315/5681595d550346895dc699ee/html5/thumbnails/20.jpg)
20 مرکزتحقیقات اردو
Consonantal Aerab - HamzaConsonantal Aerab - Hamza
► Ignored at primary levelIgnored at primary level►Minimal pairs not found to determine Minimal pairs not found to determine
secondary level effectsecondary level effect مرامراتتٲٲمرمرمراتبمراتبمراممرامآات آاتمر مر
باواباواٹاٹاٶٶباباباونباون
![Page 21: Urdu Character Set and Collating Sequence](https://reader036.vdocuments.us/reader036/viewer/2022062315/5681595d550346895dc699ee/html5/thumbnails/21.jpg)
21 مرکزتحقیقات اردو
Consonantal Aerab - Consonantal Aerab - TashdeedTashdeed
► Ignored are primary level Ignored are primary level ►Effects secondary level sorting Effects secondary level sorting
““heavier than null” heavier than null”
► Interacts with vocalic aerabInteracts with vocalic aerab
راناراناََبب انااناّّبر بر رایارایاََب ب
بدیبدی بّدی بّدی بّدیا بّدیا
بدوبدو وُوُبّد بّد بّدیا بّدیاallall examples from examples from
FTFT
![Page 22: Urdu Character Set and Collating Sequence](https://reader036.vdocuments.us/reader036/viewer/2022062315/5681595d550346895dc699ee/html5/thumbnails/22.jpg)
22 مرکزتحقیقات اردو
Ligature-Break (Half Space) Ligature-Break (Half Space)
► Ignored at primary level and Ignored at primary level and secondary levelsecondary level
وژن ٹیلی ، وژن ٹیلیوژن ٹیلی ، ٹیلیوژن فون ٹیلی ، فون ٹیلیفون ٹیلی ، ٹیلیفون بیکار ، کار بیکار بے ، کار بے
►But given each pair, which word first?But given each pair, which word first? Tertiary level decisionTertiary level decision
![Page 23: Urdu Character Set and Collating Sequence](https://reader036.vdocuments.us/reader036/viewer/2022062315/5681595d550346895dc699ee/html5/thumbnails/23.jpg)
23 مرکزتحقیقات اردو
Word-Break (Normal Space)Word-Break (Normal Space)
► Ignored at primary level ? Ignored at primary level ? ►American Heritage Dictionary (2American Heritage Dictionary (2ndnd Collegiate Collegiate
ed.)ed.) black artblack art black bearblack bear blackberryblackberry black boxblack box blackenblacken Black DeathBlack Death black goldblack gold
►Space ignored at primary levelSpace ignored at primary level
![Page 24: Urdu Character Set and Collating Sequence](https://reader036.vdocuments.us/reader036/viewer/2022062315/5681595d550346895dc699ee/html5/thumbnails/24.jpg)
24 مرکزتحقیقات اردو
Word-Break (Normal Space) - Word-Break (Normal Space) - IIII
► FLJFLJ
بانگبانگ1.1.
درا دراِ ِبانگبانگ2.2.
دینا 3.3. دینا بانگ بانگ If sorting is done at word break then If sorting is done at word break then
1,3,2 1,3,2 So sorting ignores word break So sorting ignores word break
![Page 25: Urdu Character Set and Collating Sequence](https://reader036.vdocuments.us/reader036/viewer/2022062315/5681595d550346895dc699ee/html5/thumbnails/25.jpg)
25 مرکزتحقیقات اردو
Conclusions: Urdu Collating Conclusions: Urdu Collating SequenceSequence
► Multi-level Complex Multi-level Complex ProblemProblem
► Pre-processingPre-processing Contractions (Contractions ( ھ ھ ب ب
((بھبھ► Primary LevelPrimary Level
characterscharacters
► Secondary LevelSecondary Level Vocalic aerabVocalic aerab Consonantal aerabConsonantal aerab Interaction of Vocalic Interaction of Vocalic
and Consonantal and Consonantal aerabaerab
Others (?)Others (?)
► Tertiary LevelTertiary Level Ligature BreakLigature Break Others (?)Others (?)
![Page 26: Urdu Character Set and Collating Sequence](https://reader036.vdocuments.us/reader036/viewer/2022062315/5681595d550346895dc699ee/html5/thumbnails/26.jpg)
26 مرکزتحقیقات اردو
What Needs to be Done: What Needs to be Done: Urdu Urdu
► If required revisit and revise the Urdu If required revisit and revise the Urdu character setcharacter set
► Extensive work on sorting done at linguistic Extensive work on sorting done at linguistic level by NLA and UDB. Need to level by NLA and UDB. Need to Standardize itStandardize it Publicize itPublicize it
► Need to develop at computational level to build Need to develop at computational level to build Collation Element Table to generate sort keysCollation Element Table to generate sort keys Standardize itStandardize it Publicize itPublicize it
![Page 27: Urdu Character Set and Collating Sequence](https://reader036.vdocuments.us/reader036/viewer/2022062315/5681595d550346895dc699ee/html5/thumbnails/27.jpg)
27 مرکزتحقیقات اردو
What Needs to be Done: What Needs to be Done: Other Languages of PakistanOther Languages of Pakistan
►Need to work towards standardization Need to work towards standardization of of Character setCharacter set Collating Sequence Collating Sequence
►Need to do gap analysis of character Need to do gap analysis of character sets with Unicode/ISO 10646 for sets with Unicode/ISO 10646 for international standardizationinternational standardization
►Need to develop Collation Element Need to develop Collation Element Tables for these Languages for sortingTables for these Languages for sorting
![Page 28: Urdu Character Set and Collating Sequence](https://reader036.vdocuments.us/reader036/viewer/2022062315/5681595d550346895dc699ee/html5/thumbnails/28.jpg)
28 مرکزتحقیقات اردو
Thank youThank you
Questions?Questions?