taiwan’s ndap language archives project: from bronze...

38
Taiwan’s NDAP Language Archives Project: From bronze inscription texts to Austronesian field recording Cui-xia Weng*, Ru-yng Chang*, Elizabeth Zeitoun*, Chao-jung Chen*, Derming Juang*, Chu-ren Huang*, and Chin-chuan Cheng # *Academia Sinica and # City University of Hong Kong Abstract The Language Archives Project is part of Taiwan's National Digital Archives Program (NDAP). The project digitizes and archives a wide range of linguistic data, from heritage texts to endangered Formosan languages. The goal is two-fold: both to preserve unique cultural heritages and to provide a comprehensive linguistic infrastructure to support content interpretation of archives. Based on these two goals, the main challenges of this project are: to provide versatile yet uniform presentation of different text types, to account for language change, and to account for language variation. We take two archives of contrasting characteristics to illustrate how these challenges are met. The Bronze Inscription archives deal with an archaic 1

Upload: others

Post on 22-Mar-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Taiwan’s NDAP Language Archives Project: From bronze ...emeld.org/workshop/2003/ruyng-paper.doc  · Web viewSecond, the Formosan languages are indigenous languages in Taiwan that

Taiwan’s NDAP Language Archives Project:From bronze inscription texts to

Austronesian field recording

Cui-xia Weng*, Ru-yng Chang*, Elizabeth Zeitoun*, Chao-jung Chen*, Derming Juang*, Chu-ren Huang*, and Chin-chuan Cheng#

*Academia Sinica and #City University of Hong Kong

Abstract

The Language Archives Project is part of Taiwan's National Digital Archives Program (NDAP). The project digitizes and archives a wide range of linguistic data, from heritage texts to endangered Formosan languages. The goal is two-fold: both to preserve unique cultural heritages and to provide a comprehensive linguistic infrastructure to support content interpretation of archives. Based on these two goals, the main challenges of this project are: to provide versatile yet uniform presentation of different text types, to account for language change, and to account for language variation.

We take two archives of contrasting characteristics to illustrate how these challenges are met. The Bronze Inscription archives deal with an archaic language preserved in a written form that is significantly different from Modern Chinese writing. The Formosan (i.e. Taiwan Austronesian) archives deal with indigenous languages that are endangered and have no written conventions. We show how OLACMS lays the common ground for content documentation of these contrasting archives.

First, for the Bronze Inscription Archives, the fundamental issue is how to represent the archaic inscribed written form and to establish the direct correspondences with modern writing systems at the same time. We adopt the Intelligent Character System to deal with this issue. Basically, although glyphs vary greatly, the composition of Chinese characters from basic glyph remains regular. Hence an

1

Page 2: Taiwan’s NDAP Language Archives Project: From bronze ...emeld.org/workshop/2003/ruyng-paper.doc  · Web viewSecond, the Formosan languages are indigenous languages in Taiwan that

system based on composition of basic glyphs will not only help with diachronic Chinese archives but can also deal with cross-lingual variations (e.g. Korean and Japanese Kanji, new characters from Hong Kong, etc.).

Second, the Formosan languages are indigenous languages in Taiwan that are also thought to be close to the common ancestor of Austronesian languages. The first issue we face is that of establishing orthography, which is solved by the common use of IPA among field linguists. The second issue involves establishing segmentation and tagging standards. The third issue involves audio-representation of field recording. And the last issue involves mapping the lexicon to GIS (geographic information system) to represent language variations and contrasts.

1.0 Introduction

The Language Archives Project is part of Taiwan’s five-year Digital Archives Program (NDAP), which was launched in 2002. The NDAP Language Archives Project, carried out primarily at Academia Sinica, digitizes and archives a wide range of linguistic data, from heritage texts to endangered Formosan languages. In the face of these diverse data types, how to digitize and annotate data properly, and how to provide versatile yet uniform presentation to account for language change and language variation are two main challenges for this project. Two goals of this project are: both to preserve unique cultural heritages and to provide a comprehensive linguistic infrastructure to support content interpretation of archives.

In this paper, we will first briefly describe each sub-project in this project. We then discuss how OLACMS lays the common ground for content documentation of these contrasting archives. In ensuing more detailed discussion, two archives of contrasting characteristics will be focused on to illustrate how we meet the challenges mentioned above. The Bronze Inscription archives deal with an archaic language preserved in a written form that is significantly different from Modern Chinese writing. Hence we take

2

Page 3: Taiwan’s NDAP Language Archives Project: From bronze ...emeld.org/workshop/2003/ruyng-paper.doc  · Web viewSecond, the Formosan languages are indigenous languages in Taiwan that

examples from this archive to show how the missing characters problem is solved in our project. Lastly, the Formosan (i.e. Taiwan Austronesian) archives deal with indigenous languages that are endangered and have no written conventions. We offer our example of how to create a multimodal archive for endangered languages. The paper ends with a short conclusion.

2.0 Organization of the NDAP Language Archives Project

The Language Archives Project has two branch projects on Chinese and Formosan Language archives. The former is further divided into 5 sub-projects. These five sub-projects represent different language usage and historical period of Chinese. Formosan Language archives project aim to preserve the endangered Formosan Austronesian languages with corpora, lexicons and grammars of each language.

The Formosan languages are Austronsian languages. There is a great diversity and complexity among these indigenous languages spoken in Taiwan. Unfortunately, most of these languages are endangered. Hence, we aim not only to preserve their linguistic data and structures, we would also like to preserve some of their cultural heritage through the preservation of their languages. This is why audio story telling, as well as mapping to GIS is used. More detailed on our approaches and experience will be given in a later session.

Among the Chinese archives, “Early Mandarin Chinese Lexicon” is designed as part of the Lexical Knowledgebase (LKB) tracing the historical changes of the Chinese language. The LKB will contain a series of synchronic lexicon from Pre-Qin to Modern Mandarin. The archived materials include written records of lectures, documents of laws and decrees, and fiction and drama of the Ming-Qing period.

The “Lexicon of Pre-Qin Bronze Inscriptions and Bamboo Scripts (LBB)” project aims to build a lexicon of Yin, Zhou, and Chun Qiu bronze inscriptions (from 13th century through 3rd century BC), and

3

Page 4: Taiwan’s NDAP Language Archives Project: From bronze ...emeld.org/workshop/2003/ruyng-paper.doc  · Web viewSecond, the Formosan languages are indigenous languages in Taiwan that

the bamboo manuscripts of the Warring States (475BC-221BC). This will be one of the earliest lexicon in the LKB series of Chinese language evolution. For a long time, manual copying and rubbings reproduction are two ways to preserve archaic written languages. However, if a lexical database can be built to preserve these characteristic ideograms, it would make a great progress in archives of ancient Chinese culture. The first difficulty that has to be conquered while developing this kind of database is missing character problem in computers. This project has adopted the Intelligent Character System to solve this problem. We will describe this system in the next session.

The “Modern Chinese Corpus and Treebank” will complete a 10 million-word tagged and balanced corpus for modern Mandarin, as well as complete a grammatically annotated treebank. The emphasis will be on value-added applications such as information search, retrieval, automatic Q&A, and summarization.

The “New Age Corpus: Linguistic Representations and Archives of Multimedia Data” project documents the everyday usage of modern Chinese, such as oral communication, discussion topics, lexicons, gestures and facial expressions, in Taiwan in digital multimedia forms.

The “Southern-Min Archive: A Database of Historical Change in Language Distribution” project is a new addition in 2003. It aims to provide both a historical depth and sociological variation to the archives of Chinese languages in Taiwan.

All subsidiary projects of the Language Archives Project are listed below for easier reference. In addition, the three main components of the Linguistic Anchoring project are also included. The Linguistics Anchoring project is a NDAP technology research and development project. Its goal is to provide the infrastructure for language-based knowledge processing and management. The anchoring reference of this project will transfer the Language archive contents into inter-operable information. The website is at

4

Page 5: Taiwan’s NDAP Language Archives Project: From bronze ...emeld.org/workshop/2003/ruyng-paper.doc  · Web viewSecond, the Formosan languages are indigenous languages in Taiwan that

http://corpus.ling.sinica.edu.tw/project/LanguageArchive/lc_index.html.

Table 1: All subsidiary projects of the Language Archives Project, and the Linguistics Anchoring Project

Diagram 1, using the numeral and alphabetical designation of each subsidiary project given above, illustrates the functional structure of the Language Archives Project. The green column at the center represents the standards and tools supporting the digitizing and archiving of language data. Each of the peripheral circles extended from the column represents a language group or variety. The diagram shows how a sharable and reusable set of technologies can be used to support a wide range of language archives. This is exactly the design feature of OLAC, which we adopt and will discuss in the following section.

Diagram 1: Functional Structure of the Language Archives Project

Language Archives Project1. Chinese Languages Archives and Structure

1.1 Early-Modern Chinese Lexicon1.2 Lexicon of Pre-Qin Bronze Inscriptions and Bamboo

Scripts (LBB)1.3 Modern Chinese Corpus and Treebank1.4 New Age Corpus: Linguistic Representations and

Archives of Multimedia Data1.5 Southern-Min Archive: A Database of Historical

Change in Language Distribution2. Formosan Language Archives

Linguistics Anchoring ProjectA. Linguistic anchoringB. Technological supportC. International standards, such as OLAC, ISLE and ISO etc.

5

Page 6: Taiwan’s NDAP Language Archives Project: From bronze ...emeld.org/workshop/2003/ruyng-paper.doc  · Web viewSecond, the Formosan languages are indigenous languages in Taiwan that

3.0 The Application of OLAC to the NDAP Language Archives Project

The Open Language Archives Community (OLAC) is an international partnership of institutions and individuals who are creating a worldwide virtual library of language resources. Three primary standards serve to bridge the multiple gaps which now lie in between language resources and users: (1) OLACMS: the OLAC Metadata Set (Qualified DC, Dublin Core), (2) OLAC MHP: refinements to the OAI (Open Archives Initiative) protocol, and (3) OLAC Process: a procedure for identifying Best Common Practice Recommendations. On December 2002 there was an OLAC Workshop (IRCS Workshop on Open Language Archives) in Philadelphia which revised the OLAC standards and controlled vocabularies, reviewed OLAC archives and services, and considered proposals for new activities. The metadata format, OLAC extensions, defining a third-party extension and documenting an extension are described in the OLAC Metadata

WEB INTERFACE

(1+2+B)

ARCHIVETOOL

(1+A+B)

ARCHIVESTRUCTURE

(1+A)

Modern general language

(1.3)

Multimedia and colloquial language

(1.4)

Multilingual Communication

(1.5, A)

Southern-Min and Haaka languages

(1.5)

METADATA(A+B+C)

Aboriginal languages

(2)

Chinese traditional languages(1.1) (1.2)

6

Page 7: Taiwan’s NDAP Language Archives Project: From bronze ...emeld.org/workshop/2003/ruyng-paper.doc  · Web viewSecond, the Formosan languages are indigenous languages in Taiwan that

1.0 version.

The NDAP language archives plan to be OLAC compliant. Three of the resultant archives are already registered with the OLAC repository: Academia Sinica Balanced Corpus or Modern Chinese, Academia Sinica Formosan Language Archive, and Academia Sinica Tagged Corpus of Early Mandarin Chinese. These resources have also been registered at the OLAC-compliant Asian Language Resources Repository (hosted by Tokyo Institute of Technology, yet to be released.) Chang and Huang (2002) reported on the application of OLACMS to the Language Archives Project. They found OLACMS to provide a solid basis that will allow productive and in depth description of our archives with extensions and elaborations. The additional information that we need are Temporal and Geographic Location, as well as textual information such as style, mode, genre, and medium. The suggested additions and elaborations are discussed in section 3.1.-3.4. 3.1 Temporal and Geographic Location

Since China used a different calendar system until early 20th century, all temporal description of inherited Chinese archives do not conform to the current DC standard. The sub-type of Chinese calendar will then include time, dynasty name, state name, and emperor’s reign. We may also add other chronological methods, such as lunar or solar calendar. Take the Academia Sinica Ancient Chinese Corpus for example. Its coverage is Early Mandarin Chinese. The users will be able to refer to a historical calendar and find that the time equals to the dynasties of Yuan, Ming, and Qing. And will be able to convert the time to western calendar using the conversion table provided by Academia Sinica. It offers conversion table for the past 2000 years between Chinese and Western calendars.

When Coverage has a spatial refinement, a location can have different names because of the unit used in cataloguing, as well as because of temporal and linguistic variations. When describing spatial coverage, we need to know more than a place name. E.g.

7

Page 8: Taiwan’s NDAP Language Archives Project: From bronze ...emeld.org/workshop/2003/ruyng-paper.doc  · Web viewSecond, the Formosan languages are indigenous languages in Taiwan that

Washington State is different from Washington D.C. and Taipei City is different from Taipei County. Hence we need to define the sub-types of spatial description that include Continent, Country, Administrative Division, Longitude, Latitude, Address, etc..

3.2 Mode and Genre

Each text in Academia Sinica Balanced Corpus of Modern Chinese (Sinica Corpus) is marked up with five textual parameters: Mode, Genre, Style, Topic and Medium. These are important textual information that needs to be catalogued in metadata.

Table 2: The relation between Mode and Genre of Sinica Corpus (Chinese Knowledge Information Processing (CKIP) 1993)

Mode GenreWritten Reportage, Commentary, Advertisement,

Letter, Announcement, Fiction, Prose, Biography & Diary, Poem, Manual

Written-to-be-spoken Script, SpeechSpoken ConversationSpoken-to-be-written Analects, Speech, Meeting Minute

3.3 StyleThere are four styles that are differentiated in Sinica Corpus: narrative, argumentative, expository, and descriptive.

3.4 MediumSinica Corpus specifies the media of the language resources as: Newspaper, General Magazine, Academic Journal, Textbook, Reference Book, Thesis, General Book, Audio/Visual Medium, Conversation/Interview.

Table 3: Topic of Sinica Corpus (CKIP 1993)

Primary SubPhilosophy Thoughts | Psychology | Religion |Natural Science

Mathematics | Astronomy | Physics | Chemical | Mineral | Creature | Agriculture | Archeology |

8

Page 9: Taiwan’s NDAP Language Archives Project: From bronze ...emeld.org/workshop/2003/ruyng-paper.doc  · Web viewSecond, the Formosan languages are indigenous languages in Taiwan that

Primary SubGeography | Environmental Protection | Earth Science | Engineering |

Social Sciences

Economy | Finance | Business & Management | Marketing | Politics | Political Party | Political Activities | National Policy | International Relations | Domestic Affairs | Military |Judicature | Education | Transportation | Culture | History | Race | Language | Mass Media | Public Welfare | Welfare | Personnel Matters | Statistical Survey | Crime | Calamity | Sociological Facts |

Arts Music | Dance | Sculpture | Painting | Photography | Drama | Artistry | Historical Relics | Architecture | General Arts |

General/Leisure

Travels | Sport | Foods | Medical Treatment | Hygiene | Clothes | Movie and popular arts | People | Information | Consume | Family |

Literature Literary Theory | Criticism | Other literary work | Indigenous Literature | Children’s Literature | Martial Arts Literature | Romance |

An example for the adoption follows: for a Sinica Corpus text with a Topic of Arts and a sub-topic of Music.

<topic lang=x-sil-CHN>Art/Music</topic>

4.0 The Intelligent Character System

For the Bronze Inscription Archives, the fundamental issue is how to represent the archaic inscribed written form and to establish the direct correspondences with modern writing systems at the same time. We adopt the Intelligent Character System to deal with this issue.

4.1 Structure of the System

The Intelligent Character System, which was developed by the Chinese Document Processing Lab of the Institute of Information Science at Academia Sinica, mainly contains four parts: components, glyphs, operators and production rules. Components

9

Page 10: Taiwan’s NDAP Language Archives Project: From bronze ...emeld.org/workshop/2003/ruyng-paper.doc  · Web viewSecond, the Formosan languages are indigenous languages in Taiwan that

are the basic unit of glyph. Take “楹(ying2)” for example. “木(mu4)” and “盈(ying2)” are two components that compose “楹(ying2)”, and further, “盈(ying2)” can be decomposed into two components: “夃(gu1)” and “皿(min3)”. Also, “夃(gu1)” can further be divided into “乃(nai3)” and “又(you4)”. In this system each component is stored as a file, and each glyph is stored as a folder. Figure 1 gives the glyph structure of “楹(ying2)”, and its component directory.

Figure 1: The glyph structure of “楹(ying2)”

There are three basic operators to express the structure of a glyph: horizontal, vertical and contained composition. Again, take “楹(ying2)” for example. The composition of “木(mu4)” and “盈(ying2)” is horizontal composition; “夃(gu1)” and “皿(min3)” is vertical one; and “又(you4)” is contained within “乃(nai3)”. Table 4 gives production rules of “楹(ying2)” to illustrate how glyphs are composed.

Table 4: Decomposition of “楹(ying2)”

Operator Symbols Production rules

Horizontal composition 楹 = 木 盈Vertical composition 盈 = 夃 皿Contained composition 夃 = 乃 又

10

Page 11: Taiwan’s NDAP Language Archives Project: From bronze ...emeld.org/workshop/2003/ruyng-paper.doc  · Web viewSecond, the Formosan languages are indigenous languages in Taiwan that

Although glyphs vary greatly, the composition of glyphs from basic components is basically according to these three production rules. Hence an encoding scheme based on the composition of basic components will not only help with diachronic Chinese archives but can also deal with cross-lingual variations that also uses glyphs to compose their language characters (e.g. Korean and Japanese Kanji, new glyphs from Hong Kong, etc.).

4.2 User Interface

Intelligent Character System provides tools and Hanzi (Chinese character) glyph database to let users browse and transform missing characters through Java Applet on the webpage. In addition, users can edit and browse missing characters under the Microsoft Office environment. Besides, a user interface was provided to search and access missing characters. Diagram 2 gives a chart of system’s structure to show the processing between tools and the database.

Diagram 2: Database and tools for the Intelligent Character System

The basic idea of the intelligent Character System comes from the traditional study on Chinese “forms” of characters, especially the knowledge about glyphs. Through assistance of modern

Fonts of Small Seal Style and

missing characters

Hanzi Glyph Database

User Interface for Hanzi Glyph

Database

Java Applet

COM Add-in for missing

characters processing

Browser

Microsoft Office

11

Page 12: Taiwan’s NDAP Language Archives Project: From bronze ...emeld.org/workshop/2003/ruyng-paper.doc  · Web viewSecond, the Formosan languages are indigenous languages in Taiwan that

technology, this system not only solves missing character problem in the Lexicon of Pre-Qin Bronze Inscriptions and Bamboo Scripts project, but also improves the performance of the existing Hanzi processing system for possible application, such as data sharing, in the future.

5.0 The Formosan Language Digital Archive

The Formosan languages are indigenous languages in Taiwan that are also thought to be close to the common ancestor of Austronesian languages. According to linguistic studies, there are still 15 extant languages (Thao, Kavalan, Pazeh, Atayal, Saisiyat, Bunun, Tsou, Rukai, Paiwan, Puyuma, Amis, Seediq, Saaroa, Kanakanavu, Yami), but declining rapidly. So far, this project has built four out of six Rukai dialects corpora (including Mantauran, Tanan, Maga, and Tona), and can be browsed and searched via internet as well. Others are being added to the archive progressively. It is hoped that by the end of this project, there will be at least nine Formosan languages archived.

The Formosan language archive, which includes both Chinese and English browsing display, contains three main types of information databases: (1) corpora with annotated texts, (2) a language GIS (geographic information system), and (3) four bibliographical databases. These respective databases allow all kinds of research and are briefly introduced below.

5.1 Linguistic CorporaThe collection of the Formosan language corpora includes folktales, narratives, conversations, songs and elicited sentences. The last two categories are not yet available on the web. The structure of an annotated text comprises of the transcription of the original language, divided into paragraphs, sentences; glosses; and free translations. IPA symbols are used to transcribe collected text. This is based on two reasons: (1) there is no standardized writing system for Formosan languages, and (2) IPA is an international standard for transcribing sound recordings in other Archives projects.

12

Page 13: Taiwan’s NDAP Language Archives Project: From bronze ...emeld.org/workshop/2003/ruyng-paper.doc  · Web viewSecond, the Formosan languages are indigenous languages in Taiwan that

Glosses, on the other hands, can be provided at the word level (stems) or at the morphemic level (roots and affixes). For Rukai corpus, morphemic analysis has been adopted for the annotations. The information tagging on each morpheme contains grammatical functions and lexical and syntactic categories. The tags of grammatical functions are according to Formosan Linguistics conventions. A tagset of abbreviations of grammatical functions used in the corpora is shown in table 5. The tags of lexical categories are following the standardization of CKIP (CKIP 1993) but with some reservation. Meanwhile, in addition to annotations, each sentence is heard on an audio output that was digitally recorded in the original file and then transformed into MP3 format. This audio-representation allows users to download recorded sentences, to view and analyze the sound spectrographs, and to process the sound data with sound editing software. Figure 2 gives an example to show how a text is displayed on a webpage interface.

Table 5: Abbreviations of grammatical functions (for Rukai as a pilot study)

Abbreviations Chinese EnglishActNmz 動態名物化 Action nominalizationAgtNmz 主事名物化 Agentive nominalizationCaus 使役 CausativeClsNmz 子句名物化 Clausal nominalizationCnc 讓步 ConcessiveCntrfct 違反事實 CounterfactualDyn 動態 DynamicE 排除式(=我們) ExclusiveFin 限定 FiniteGen 屬格 GenitiveImp 祈使 ImperativeImprs 無人稱 Impersonal pronounI 包含式(=咱們) InclusiveInstNom 工具名物化 Instrument nominalizationLocNom 處所名物化 Locative nominalizationNom 主格 Nominative

13

Page 14: Taiwan’s NDAP Language Archives Project: From bronze ...emeld.org/workshop/2003/ruyng-paper.doc  · Web viewSecond, the Formosan languages are indigenous languages in Taiwan that

Neg 否定 NegationNegImp 否定命令 Negative ImperativeNFin 非限定 Non-FiniteObjNom 受事名物化 Objective nominalizationObl 斜格 ObliquePass 被動 PassiveP, plur 複數 PluralRef 反身 ReflexiveRec 相互 ReciprocalRed 重疊 ReduplicationS 單數 SingularStat 狀態 StativeStatNmz 狀態名物化 State nominalizationSubj 虛擬式 SubjunctiveSubjNmz 主語名物化 Subject nominalizationSup 最高級 SuperlativeTempNmz 時間名物化 Temporal nominalizationTop 主題 Topic1 我(們) 1st person2 你(們) 2nd person3 他(們) 3rd person“.” 帶著兩種功能之詞素 Portmanteau morpheme“:” (可區分之)詞綴 (divisible) affix“-” 接詞 affix or clitic

Figure 2: Sentence Display

14

Page 15: Taiwan’s NDAP Language Archives Project: From bronze ...emeld.org/workshop/2003/ruyng-paper.doc  · Web viewSecond, the Formosan languages are indigenous languages in Taiwan that

Besides, a set of metadata regarding general information of a text, such as text profile, fieldwork activity, and management statements, was also developed. This will facilitate data access and sharing with other similar resources in the future. Figure 3 shows a piece of metadata information of a text.

Figure 3: A metadata set of a text

15

Audio Output

English Translation

Transcription

Annotation

Page 16: Taiwan’s NDAP Language Archives Project: From bronze ...emeld.org/workshop/2003/ruyng-paper.doc  · Web viewSecond, the Formosan languages are indigenous languages in Taiwan that

5.2 The Formosan Geographical Information System

As for geographic information database, the language distribution search enables users to learn the geographical distribution of each language and dialect. Another search system, comparative word search, allows users to spot the distribution of cognates/non-cognates within the Formosan languages and identify spatial features. In the future, we hope to add these two functions in this database: (1) a system to observe the expansion or decrease of a particular linguistic community over the last hundred years; and (2) an audio recording mapping system.

5.3 Four Bibliographic Databases

The on-line reference search system is provided for user to access Formosan languages information on linguistic references, indigenous teaching references, indigenous literature, and music references. These pieces of information are regularly updated. And, it is hoped that a complete and abundant Formosan language

16

Page 17: Taiwan’s NDAP Language Archives Project: From bronze ...emeld.org/workshop/2003/ruyng-paper.doc  · Web viewSecond, the Formosan languages are indigenous languages in Taiwan that

bibliographic databases will be achieved to satisfy linguistic worker’s needs.

6.0 Conclusion

Unlike artifact and specimen, the non-physical characteristic of languages is the biggest challenge to language archives. Advances in technology make it possible for us to digitize ancient writing manuscripts, as well as oral or written records of an endanger language. However, preservation goes beyond digitization. In order to prevent the undesirable consequence of the digitized data becoming cold and lifeless digital antiques, we emphasize the reusability, sharability, and accessibility of the archives. This philosophy coincides with the vision and mission of OLAC. In describing the NDAP Language Archives project in Taiwan, we showed digitization of two archives as different as early Chinese documents and endangered Formosan languages can be done under the same project and with the same infrastructure. This is important testimony to the open archives initiative vision. We hope that this work can symbolize a small step to the direction where linguistic and cultural diversity can be accepted and shared by all.

References

Bird, Steven, Gary Simons, and Chu-Ren Huang. (2001). “The Open Language Archives Community and Asian Language Resources”. Paper presented at the 6th Natural Language Processing Pacific Rim Symposium Post-Conference Workshop, Tokyo, Japan. November 30, 2001.

CDP. (2002). Manual for Hanzi Formation Database, (in Chinese), Chinese Document Processing Lab. http://www.sinica.edu.tw/~cdp/zip/hanzi/hzmanual.zip.

Chen, Chao-jung and W. Lin. (2003). “Missing character problem and requirement: A experience from the bronze inscriptions of Shang and Zhou Dynasty”. Paper presented at the

17

Page 18: Taiwan’s NDAP Language Archives Project: From bronze ...emeld.org/workshop/2003/ruyng-paper.doc  · Web viewSecond, the Formosan languages are indigenous languages in Taiwan that

conference on the Guidelines for Digital Archives Technologies. Taipei, Taiwan. April 30~May 1, 2003.

CKIP. (1993). Analysis of Chinese Part of Speech (Zhung Wen Ci Lei Fen Xi). In Chinese. CKIP Technical Report no. 93-05. Taipei: Academia Sinica.

Lin, Hui-chuan. (1999) Let’s talk Mantauran, 1-6. Taipei: Crane Publishing.

Hsieh, Ching-Chun, C. Chang and K. Huang. (1990). “On the Formalization of Glyph in the Chinese Language”, ISO/IEC JTCI/SC18/WG8 and AFII Meeting, Kyoto, Feb, 1990.

Huang, Chu-Ren (2003). “Semantic web, WordNet and Ontology: A talk on knowledge management on future’s web (語意網、詞網與知識本體:淺談未來網路上的知識運籌)”. Information Management for Buddhist Libraries. 33: 6-21.

(2003). “Word Knowledge, World Knowledge, and Ontology: Towards a linguistic infrastructure for knowledge representation and knowledge engineering”. Presented at Language and Knowledge Representation: Mini-workshop on functional approaches to language. Hsinchu: National Jiao-Tung University. February 26, 2003

Ru-Yng Chang, Chu-Ren Huang. (2002). "OLACMS: Comparisons and Applications in Chinese and Formosan Languages". Paper presented at the 19th COLING 2002 Post-Conference Workshop --The 3rd Workshop on Asian Language Resources and International Standardization. Center of Academia Activities, Academia Sinica. Taipei. Taiwan. August 31, 2002.

Simons Gary. (2002). “SIL Three-letter Codes for Identifying Languages: Migrating from in-house standard to community standard”. Paper presented at the International Workshop on Resources and Tools in Field Linguistics (LREC 2002), Las Palmas, Canary Islands. May 26-27, 2002.

Yu, Ching-hua. (2002). “Discussion on the digitization of the Formosan Language Archive: Building up of the architecture

18

Page 19: Taiwan’s NDAP Language Archives Project: From bronze ...emeld.org/workshop/2003/ruyng-paper.doc  · Web viewSecond, the Formosan languages are indigenous languages in Taiwan that

of the archive”. Paper presented at the Firse workshop on the Digital Library Project, Taipei, July 25-26.

Zeitoun, Elizabeth, and Hui-chuan Lin. (2001). We should not forget the stories of the Mantauran, vol.2: Traditional folktales. MS.

. (2003). “We should not forget the stories of the Mantauran, vol.I: Memories of the past”. Language and Linguistics Monograph Series, No. A4. Taipei: Institute of Linguistics (Preparatory Office), Academia Sinica.

, Ching-hua Yu, and Cui-xia Weng. (2003). “The Formosan Language Archive: Development of a Multimedia Tool to Salvage the Languages and Oral Traditions of the Indigenous Tribes of Taiwan”. Oceanic Linguistics, 42.1.

Referential Websites

[1] The Language Archives Project, 2003. Institute of Linguistics, Academia Sinica. http://corpus.ling.sinica.edu.tw/project/LanguageArchive/la_index.html

[2] The National Digital Archives Program, 2003. Taiwan. http://www.ndap.org.tw/

[3] The Early-Modern Chinese Lexicon Project, 2003. http://www.sinica.edu.tw/Early_Mandarin/

[4] Lexicon of Pre-Qin Bronze Inscriptions and Bamboo Scripts (LBB), 2003. http://www.sinica.edu.tw/~lbb/

[5] The Modern Chinese Corpus and Treebank Project, 2003. http://www.sinica.edu.tw/SinicaCorpus/

[6] OLAC http:// www.language-archives.org [7] OAI http://www.openarchives.org[8] OLAC Metadata (version 1.0), http://www.language-

archives.org/OLAC/metadata.html [9] Western Calendar and Chinese Calendar Conversion Table of

Academia Sinica Computing Centre. http://www.sinica.edu.tw/~tdbproj/sinocal/luso.html

[10] The Chinese Document Processing Lab, 2003.

19

Page 20: Taiwan’s NDAP Language Archives Project: From bronze ...emeld.org/workshop/2003/ruyng-paper.doc  · Web viewSecond, the Formosan languages are indigenous languages in Taiwan that

http://www.sinica.edu.tw/~cdp/[11] The Formosan Language Digital Archives:

http://www.ling.sinica.edu.tw/Formosan/.[12] SIL http://www.sil.org/[13] Ethnologue, Language of the World,

http://www.ethnologue.com[14] Dublin Core http://dublincore.org/

20