multilingual issues of open source ilsaccessola2.com/superconference2009/sat/1808/zou_liu.pdf ·...
TRANSCRIPT
-
Multilingual Issues of Open Source ILS
Jason (Qing) ZouSystems Librarian
Lakehead University
Grace (Guoying) LiuSystems Librarian
University of Windsor
January 31 2009
-
OutlineWriting SystemsRomanizationUnicodeOpen Source ILSSimplified Chinese Version of EvergreenConclusions
-
Boroditsky, L., 2002
-
Writing SystemsRepresent the sounds of a language by written or
printed symbols (WordNet)
Requirements:Defined base elements/symbols (scripts)Rules and conventionsA language (generally a spoken language)
-
Types of Writing Systems
Logographic (Chinese characters: CJK)Alphabetic (Latin alphabet: Cyrillic, Latin)Abjad (Arabic alphabet, Arabic, Hebrew) Abugida (Indian Devanagari: India, Canadian
Aboriginal Syllabics)
-
Scripts10 major scripts, to write ~95% of all languages
Roman abcdeéèêœ …Greek αβγδε …Cyrillic авгдеж …Hebrew … אבגדהArabic … ا ب ت ث ج حIndic (11) अआइईउऊ …Thai กขฃคฅฆ …
Japanese– Hiragana あいうえお…– Katakana アイウエオ…
Korean 가각갂갃간…Chinese甲乙丙丁…
Arsenault, C., 2003
-
Multilingual Information SystemsContains records in more than one languageThe system interface is in more than one
languageThe system is able to display text in more than
one scriptThe system allows the end user to build
queries in more than one script
Arsenault, C., 2003
-
Multilingual Information SystemsTwo models for multi-script records in MARC21:Model Aoriginal scripts in 880 fieldsprimary descriptive fields using Romanized form
Model B transcribe data directly into regularly tagged fields
-
000 02333cam a22004454a 450245 00 |6 880-01 |a Zhong guo wen hua jing dian / |c [Zhu Xi
deng zhu ; zhu bian Ren Jiyu ; zhi xing zhu bian Pan Yuan].250 __ |6 880-02 |a Di 1 ban260 __ |6 880-03 |a Hangzhou : |b Xi leng yin she chu ban she, |c
2007.300__ |a 14 v. ; |c 29 cm.…
880 00 |6 245-01/$1 |a 中國文化經典 / |c [朱熹等著 ; 主編任繼愈 ; 執行主編潘淵].
880 __ |6 250-02/$1 |a 第1版880 __ |6 260-03/$1 |a 杭州: |b 西冷印社出版社, |c 2007.880 1_ |6 700-04/$1 |a 朱熹, |d 1130-1200.880 1_ |6 700-05/$1 |a 任繼愈, |d 1916-880 1_ |6 700-06/$1 |a 潘淵.
From Yale University Catalogue
-
1001 |6 01 |a Chen, Xiyong.1001 |6 01 |a 陳錫勇.24510 |6 02 |a Guodian Chu jian Laozi lun zheng / |c zuo
zhe Chen Xiyong.24510 |6 02 |a 郭店楚簡老子論證 / |c 作者陳錫勇.250 |6 03 |a Chu ban.250 |6 03 |a 初版.260 |6 04 |a Taibei shi : |b Li ren shu ju, |c 2005.260 |6 04 |a 台北市 : |b 里仁書局, |c 2005.…
70002 |6 06 |a Laozi. |t Dao de jing.70002 |6 06 |a 老子. |t 道德經.
From McGill University Catalogue
-
Multilingual Information SystemsNon-Roman data in North American OPACs
Stored? Displayed?yes
no
yes
no
Indexed?yes
no
Romanization Vernacular
Cataloguing
Retrieval
Arsenault, C., 2003
-
Romanization
Representation of a written word or spoken speech with Roman alphabet
Methods Transliteration: written text, Russian Transcription: spoken word, CJK
-
☆ Transliteration
Language and Library
-
Transcription新年好 xin nian hao
温莎大学 wen sha da xue
刘国英 liu guo ying
明清小说比较研究 Ming Qing xiao shuo bi jiao yan jiu
李白和他的诗歌 Li Bai he ta de shi ge
-
Only Transcription is possibleTwo Romanization systems for bibliographic control in North America:Wade-Giles (through October 2000)Pinyin (After October 2000)
Chinese Romanization
唐宋全诗Wade-Giles: T‘ang2 Sung4 ch‘üan2 shih1Pinyin: Táng Sòng quán shī
-
Pinyin 拼音
Literally “spell the sound”Based on Hanyu Pinyin (Chinese Phonetics) which
was adopted in 1958 by mainland ChinaUsed for many years in libraries in Europe and
AustraliaOct. 1, 2000, LC and other libraries in US adopted
Pinyin
-
Issues of PinyinChinese characters47,043 in 1716 (康熙字典)~60,000 in 1990 (漢語大字典)sharing only around 1,300 syllables in
spoken Chinese (Arsenault, 2001)High level of Homophonous ambiguity
liu: 刘 六 流 留 柳 … (over 30 possibilities)
-
Word division (syllable integration)下午我去图书馆了. (I went to the library this
afternoon)Pinyin: Xia wu wo qu tu shu guan le.
or: Xiawu wo qu tushuguan le.
Issues of Pinyin
-
No consistent rules on syllable integration中国话
zhong guo huazhong-guo huazhongguo huazhongguohua
More difficult to form queries
Issues of Pinyin
-
Issues of Pinyin
A standard based on MandarinIn cataloguing, impossible to maintain
consistency Infrequently used characters may be
impossible accessed by phonetic scripts
-
☆ASCII
-
Language and UnicodeWhat is Unicode?
Unicode provides a unique number for every character,
no matter what the platform,no matter what the program,no matter what the language.
http://www.unicode.org/standard/WhatIsUnicode.html
-
Unicode
Support by large software companies
Support by organizations
Support by countries
-
UTF-8Unicode Transformation FormatUTF-8
11110xxx 10xxxxxx 10xxxxxx 10xxxxxxU-00010000 –
U-001FFFFF
1110xxxx 10xxxxxx 10xxxxxxU-00000800 –
U-0000FFFF
110xxxxx 10xxxxxxU-00000080 –
U-000007FF
0xxxxxxxU-00000000 –
U-0000007F
-
Unicode: examples
家 (home) book:書 书UTF-8: E5 AE B6 E6 9B B8 E4 B9 A6
UTF-16: 5BB6 66F8 4E66UTF-32: 00005BB6 000066F8 00004E66
UTF-32
UTF-16
UTF-8
00 00 4E 6600 00 66 F800 00 5B B6
4E 6666 F85B B6
E4 B9 A6E6 9B B8E5 AE B6
书(book)書(book)家 (home)
-
Commercial ILS Major commercial ILS products Voyager Aleph Symphony Millennium
-
Open Source ILS
Major Open Source ILS Products
Koha
Evergreen
-
Open Source ILS
What is open source?
-
Open Source Software
Perl, Php, PythonMicrosoft C++, Visual Basic, C#
ProgrammingLanguage
Open OfficeMicrosoft OfficeOffice
FirefoxInternet explorerBrowser
ApacheInternet Information server
Web Server
MySQL, PostgreSQLOracle, DB2, SQL Server
Database
Linux variants (Red hat, Debian, Ubunto)
WindowsOperatingSystem
Library Technology Reports 2008, vol. 44, no. 8
-
Integrated Library Systems
Regular modules Circulation Acquisition Cataloguing System Administration OPAC
-
Store Display Index/Search Sort
Language Issues in ILS
-
Display
Language Issues in ILS
-
Language Issues in ILS
Index/Search
Sort
-
Koha
Features Debian/Linux, Windows Perl MySQL/Zebra LibLime Company
-
http://koha.wikispaces.com/
-
http://koha.wikispaces.com/
-
Koha Debian/Linux Perl MySQL/Zebra Large community Fully developed ILS Fully support standards Language supports
-
Evergreen
Debian/Linux, Windows OpenSRF PostgreSQL C, Perl Equinox software Company
-
Evergreen in Canada
BC public consortium (Sitka) UPEI Project Conifer
-
Chinese Localization
Introduction
Goals
-
Chinese Localization
Features are crucial to localization
Supports Unicode
Indexing/Searching: PostgreSQL Tsearch2
-
Simplified Chinese Version
Interface localization Indexing Searching Sorting Others
-
Simplified Chinese Version
Interface Utilizes Pootle Uses gettext tools to convert Portable Objects
to Document Type Definition files Updates DTD files hourly
-
Tsearch2 (PostgreSQL full text search engine)
Configure Tsearch2 to be able to handle Chinese records
Utilize Chinese words segment algorithms
Indexing
-
Indexing
Tsearch2 (to_tsvector, to_tsquery)
# SELECT to_tsvector('Social history of China') ;
to_tsvector
------------------------------------
'china':4 'social':1 'histori':2 (1 row)
-
Indexing
Tsearch2 (to_tsvector, to_tsquery)
# SELECT to_tsvector(‘中国社会历史’)
to_tsvector
------------------------------------
‘中国社会历史’: 1 (1 row)
-
Indexing
Tsearch2 (to_tsvector, to_tsquery)
# SELECT olis_cn_index('中国社会历史') as titletitle ------------------------------------
“中”:1 ‘史’:6 ‘国’:2 ‘会’:4 ‘历’:5 ‘社’:3 ‘中国’:7 ‘国社’:8 ‘会历’:10 ‘历史’:11 ‘社会’:9
-
SearchingSearching is the opposite process of indexing
Utilize Chinese words segment algorithms to divide and form queries which can be understood by the system
-
The search phrase “中国社会历史” will be formed as “ 中 & 国 & 社 & 会 & 历 & 史”and then fed into the system
Searching
-
Sorting
Romanize Chinese characters with tone information
Sort the corresponding Pinyin of Chinese records to obtain A-Z order
-
# Select utf8_pinyin(‘中国社会历史’) as pinyin;Pinyin
-----------------------------zhong4guo2she4hui4li4shi3
Sorting
-
CMARC
CNMARC
880 fields
Import Issues
-
Unicode is not the answer for everything
More practical to tweak a system language by language
Conclusions
-
Thank you for your interest and attention!
Any Questions?
-
Contact InformationJason Zou
Guoying [email protected]