design of a multimedia corpus of austronesian linguisticsciet/form/paper/4.doc · web view 3.2...
TRANSCRIPT
Design of a Multimedia Corpus of Austronesian Linguistics
Zhemin Lin, Li-May Sung, I-wen Su
Graduate Institute of Linguistics of the National Taiwan University
Abstract In this paper, the design of an integrated platform of multimedia on-
line corpora aiming to serve both linguists and the public is introduced along
with database schema and programming details. Compared with the Formosan
language archive of Academia Sinica, our design emphasizes more in terms of
normalization, accessibility and interoperability of the system. The design of
an automatically generated dictionary with cross-references and the capability
of searching the entire database in various ways are also described here.
1 Introduction
The development of natural language processing techniques and dynamic web pages
has generated wide interest in the construction of an integrated platform which en-
ables people to submit, to browse and to search among collected texts in corpora.
However, most online corpora are specially built for experts; they are sentence-based
and do not provide multimedia contents. The NTU corpus of Austronesian lan-
guages1 introduced in this paper is an attempt to construct a multi-lingual online cor-
pus with multimedia contents meeting the needs of both linguists and the public. In
the following sections, we will take a brief review of previous works and then focus
on the features of our current work.
1 http://corpus.linguistics.ntu.edu.tw
106 Zhemin Lin, Li-May Sung, I-wen Su
2 Formosan Language Archive of Academia Sinica
Zeitoun et al. (2003) discussed some of the problems in the conservation of Formosan
Austronesian languages. The continuous enhancement of their work with many
newly designed tools is further described in Zeitoun and Yu (2005). As discussed in
the two articles, fieldwork data are rarely shared in the linguistic community.
Collected materials are sometimes inaccessible even in the office where they are
stored, due to the change of storage media or data damage. One of the most serious
problems is that, although there are elicitated sentences and recordings, few of them
are rearranged and published. As a response to the problems, researchers in the
Academia Sinica have built a Formosan language archive, i.e., an online corpora with
texts, translations, word glosses and sounds from native speakers of 14 languages and
dialects.2
Despite their labour, there are however insufficiencies in their system, one of
them being the theoretical issue: the Sinica corpora are sentence-based, where pauses,
pause fillers, repetitions, intonation contours, IU boundaries and other discoursal
clues are either discarded or missing. A sentence-based corpus excludes important
linguistic information only present in discourse. Words in the system are written in
an ad hoc mixture style via International Phonetic Alphabet (IPA), in a transcription
style that prevents their respective native speakers from using the data directly.
Nearly every word is altered to some extent. Example (1) is a Saisiyat example
extracted from the Sinica archive.
(1)
(a) yao noka maʔiiæh ... hayðaʔ ʔæhæʔ maʔiiæh la m-waaiʔ, yao mina-
ŋaʔŋaʔ nak hini mina-ʃaaəŋ.
(b) ʔinʔalay hikor may nak hini yakin, ʃβət yakin ho.
(c) ʔok-ik ʃəβət, m-waaiʔ nak hini pa-paʃœʃ, yao h<œm>ʃœʃ atomalan.
(Extracted from 05.002a -- 05.002c of “5.我的故事” of the Sinica archive.)
There is so far no dictionary available with cross-referencing function in the
2 http://formosan.sinica.edu.tw/formosan/ch/select_corpus.htm
Design of a Multimedia Corpus of Austronesian Linguistics 107
Sinica corpus, even though cross-referencing for an online corpus is essential for
researchers deal with elicited or authentic data. Like KWIC (Keyword-in-context, cf.
Luhn (1960)), a user can trace a word back to the context where it occurs, and browse
its surrounding IUs. Zeitoun et al (2003) has planed a data schema that ran on
Microsoft Access. Their design, however, cannot take advantage of the SQL92 query
language. Moreover, they designed an XML dialect to improve the interoperability,
which does not encourage researchers to share their collected data in a convenient
way. The Sinica archive, though primitive in design, is the first attempt to provide
public access to the nearly extinct linguistic data, which is an effort highly respectable
by itself.
3 NTU Corpus of Austronesian Languages
The system designed in this paper is based on the NTU corpus of Austronesian
languages. The NTU corpus, first described in Huang, Su, and Sung (2003), is
composed of spoken texts in various languages. Currently NTU Saisiyat corpus
contains 22 texts, 3081 intonation units (IUs) and approximately 10635 words, whose
transcription follows the conventions of Du Bois (1993). There are one conversation,
eight narratives of indigenous legends, thirteen elicited narratives based on “Pear
Stories” (5 narratives based on a six-minute color mute film made by Wallace Chafe,
see Chafe (1980)) and “Frog Stories” (8 narratives from a sketch book by Mayer
(1980)). An example of an original data segment follows:
(2)
9. ... (1.7) m-wa:i' 'aehae' ka
AF-comeone NOM
10. ...(1.1) ma'iaeh ima h<oem>oehoe' ka siri'
person ASP <AF>pull ACC goat
11. ... may hiza
pass.by[AF] there
108 Zhemin Lin, Li-May Sung, I-wen Su
12. ...(1.9) ilahiza kabih
move.to.that.place side
“(The man pulling a goat) passed by this way and went that way.” (Pear 3:9-
12)
Spoken corpus, in contrast to written corpus, is composed of utterances shorter or
equal to sentences, which are transcribed according to certain criteria, such as turn-
taking, pause, and ruptures in intonation contours of monologue (Tao 1996:35). Fig.
1. shows a unified intonation contours in a praat3 window.
Fig. 1. A unified intonation contour
When a corpus is transcribed, tagged and analyzed, one needs to look for a means
to make it accessible to the public. An integrated platform to store, to represent and
to lower the technological boundaries for further use of the collected data is thus
necessary. With the insufficiencies of the Sinica archive in mind, normalization,
accessibility and interoperability are emphasized in the design of our system. The
following guidelines are thus proposed.
3 praat is a programmable phonetic analyser written by Paul Boersma and David Weenink, Institute of Phonetic Sciences, University of Amsterdam. It is licensed in GNU Public License, with the courtesy of their outrageous work and the free software. Cf. http://www.fon.hum.uva.nl/praat/
Design of a Multimedia Corpus of Austronesian Linguistics 109
(3) Guidelines of the integrated platform
(a) Easy to customize for most Austronesian languages
(b) Standardized procedures of transcription, annotation and process
(c) Automatic extraction of morphosyntactic information to reduce
repetition of human labor
(d) Web-based, unified input/output interface
(e) Searchable corpus that fit the needs of both linguists and the public
(f) Multimedia representation of collected texts
(g) Interoperable with other systems
(h) Cross-platform, operating system independent
Below is a description of the input, processing and output of our system design.
3.1 Standardization of text commitment and standards of committed texts
The standardization comprises the procedure of handling transcribed texts, the tran-
scription itself and morphosyntactic and discoursal codes used in the transcription.
The procedure to handle collected texts is designed with low coupling in order to re-
duce complexity. Therefore, the dependence in human manipulation in the system is
almost uni-directional, as can be seen in Figure 2. Whenever a spoken text is collec-
ted, some worker transcribes it. Once the transcription is complete, it is given to the
database maintainer for processing and storage. The web interface shows the corpus
in the database, so that people on the other end of Internet can browse and search the
corpus.
110 Zhemin Lin, Li-May Sung, I-wen Su
Fig. 2. Use cases of the system
The transcription follows Du Bois (1993), a de facto standard in the linguistic
community. Word glosses and annotations follow a standardized coding list inherited
from conventional mark-ups (cf. Appendix A) and the Leipzig glossing rules4. A
standard operation is also set for the database maintainer to handle fieldwork
collections as shown in Figure 3.
4 http://www.eva.mpg.de/lingua/files/morpheme.html
Design of a Multimedia Corpus of Austronesian Linguistics 111
Fig.3. Standard operation of text commitment
The corpus in our system is stored in Unicode (UTF-8 encoding) for potential
need of IPA, Japanese, and annotations in other languages. If some of the tribes
decide to adopt non-ASCII letters, such as “ ɖ ʈ ɼ ɫ ʔ ”, into their writing systems, the
programs can process them correctly with no need of modification. As Unicode
BOM (byte-order-mark, U+FEFF) is appended in the beginning of the file in
Microsoft Windows and is absent in Unix-based systems, the mark may cause a
potential problem in reading files edited in different operating systems. It is properly
dealt with in order to fulfil the criteria of platform-independence.
A set of metadata is defined in the head of committed files. An example of
the head of a committed text is given in (4) and the description of the fields is shown
in Table 1.
(4)
Topic: Pear story
112 Zhemin Lin, Li-May Sung, I-wen Su
Type: Narrative
Language: Kavalan
Dialect: Xinshe
Speaker: Imui, 潘金妹, F,1952
Time: 00:01:15
Total IUs: 31
Collected: 2003-05-30
Revised: 2003-11-11
Transcribed by: 葉俞廷, 王以勤Double checked: 鍾曉芳,沈嘉琪 ,葉俞廷
Table 1. Metadata of committed dataField name Description Format
Topic Topic of text String (e.g., Pear Story)
Type Style of text Narrative|Conversation|...
Language Language of text String, first letter in capital
Dialect Dialect or district String
Speaker Base data of the informant Native/Chinese name, Gender, Age
Time Length of recording hh:mm:ss
Total IUs Number of IUs in text Numeric
Design of a Multimedia Corpus of Austronesian Linguistics 113
Field name Description Format
Collected Date of recording yyyy-mm-dd
Revised Date of latest revision yyyy-mm-dd
Transcribed by Transcribers and annotaters Comma separated string
Double checked Inspectors of text Comma separated string
The text following the metadata is described below.
(5)
5. [IU #, with a period in the end]
.. qay- .. qay-byabas 'nay ,_ [words separated by spaces]
QAY-guava that [English gloss separated by spaces]
QAY-芭樂 那 [Chinese gloss separated by spaces]
6.
... razat 'nay nani.\
person that DM
人 那 DM
#e That person picked guavas. Then,
#c 那個人採芭樂。然後,#n Elicitaion notes
#n (More elicitation notes)
Lines beginning with a sharp (#) are processor instructions (PI). “#e” indicates a
114 Zhemin Lin, Li-May Sung, I-wen Su
line of English translation of a paragraph composed of the IUs from the last
translation to the current one. “#c” marks a Chinese translation, and “#n” is
elicitation notes. It is possible to have more than one note. The alignment of native
words and glosses is automatically done. Morpheme boundaries, morphological
information and word senses are extracted using the techniques introduced in Lin
(2005: Chapters 2 and 4.2).
As the transcription is supposed to more or less reflect actual pronunciation of an
informant, spelling may vary slightly from word to word. For the system not to be
confused by these variations, a feature vector is configured for each formosan
language. A vector describes how to reduce the variants into a simpler form. For
example, the pronunciation of a and ae is quite similar in Saisiyat, and glottal-stops
are sometimes omitted. 'aehae' “one” is usually spelled 'ahae or aehae. Below are
feature vectors of Saisiyat and Kavalan.5
Saisiyat: ae → a, oe → o, S → s, ' → ∅Kavalan: th → l, d → l, ' → ∅
A string substitution is executed before any operation in the database in order to
prevent possible duplicated entries; otherwise full-text search may fail to work.
3.2 Database design
Database design affects the efficiency in search and storage. For simplifying
programming logic and high-speed query, we proposed a schema that differs from the
Sinica archive. Every relational database engine that follows the SQL92 standard can
be used in the implementation of the schema. SQLite6, among relational database
systems, is recommended for the following reasons:
5 Kavalan is an Austronesian language spoken in Hualien County, east Taiwan.6 http://www.sqlite.org
Design of a Multimedia Corpus of Austronesian Linguistics 115
1. It is light-weight, fast and platform independent.
2. A database is stored in a single file, thus is easy to maintain.
3. It supports UTF-8 encoding.
4. It is a free software.
One formosan language is placed in one database and is thus stored in a single
file. The schema of every language should be the same, therefore cross-linguistic
search can be executed in a single page. It is often argued that a database has to be
normalized to the third level.7 To be realistic, our system is designed for the sake of
efficiency. The relational diagram of tables in the database is shown in Figure 4. A
full list of database schema is given in Appendix B.
7 There is a good tutorial about database normalization at http://dev.mysql.com/tech-resources/articles/intro-to-normalization.html
116 Zhemin Lin, Li-May Sung, I-wen Su
Fig.4. Relational diagram of tables in the database
The text is mainly stored in Table “iu”. In contrast to the word-based design in
the Sinica archive, every intonation unit is stored in one row. For example,
article : pear3
nat : ...(1.2) ima h-oem-angaw kasna'itol ray kahoey babaw
sim : . ima homangaw kasnaitol ray kahoy babaw
eng : . Asp set_a_ladder-AF move_up-AF Loc tree above
For a full-text search, a simple query of “%keyword%” to every field listed
above returns the correct results. The simplified spelling is stored for searching
among spelling variants. Words in the database are separated by a single space, so
Design of a Multimedia Corpus of Austronesian Linguistics 117
that they are easily processed in programs by a single function (explode () in PHP and
split () in Python). Places where no gloss is available are occupied by a period (“.”);
thus, words and glosses are always aligned across the fields.
Another specialized data structure is designed in Table “lemma”. In order to
properly search an affix, the stem is marked for every word in the dictionary. The
morpheme before the stem is a prefix and the one after it a suffix. For example,
Saisiyat kapapama'an 'bicycle' is stored as ka-#papama'#-an in the table. If one looks
for a prefix ka- or a suffix -an, one can always obtain the right answer by taking the
elements before the first sharp (#) or after the second sharp. Since infixation is simple
in the two languages, it is currently analyzed on the fly by external programs.8
3.3 Back-end programs and the POS-tagger
Database maintainer commits a pre-processed transcription into the database through
a batch of back-end programs. Commitment is preferably done in the command-line,
so that mismatches in alignment or failure of automated morphological analysis may
be corrected immediately and interactively. A prototype is implemented to prove the
workability of the system. Here is a list of programs.
features.py
defines language-specific feature-vectors and provides connection DSN.
simplify.py
is the common library for reducing spelling variants.
canon.py
checks input validity, including metadata and text format. It writes the data
into the database when the check passes.
extractmorph.py
defines morphological and discoursal codes and extracts them from the texts.
8 After the corpus complies with the Leipzig glossing rules, infixation will be marked by < and >.
118 Zhemin Lin, Li-May Sung, I-wen Su
makedict.py
extracts information from imported texts and updates the dictionary.
mp3splt.py/mpgsplt.py
splits .mp3 / .mpg files according to the time-file (see below).
tidy.py
utility to convert Chinese punctuation into ASCII and remove unnecessary
Microsoft Word mark-ups.
The coupling of the modules is fairly low. “features.py” and “simplify.py”
provide the necessary functions for all programs.
As texts have been put into the database, they are tagged by a TBL tagger (cf. Lin
2005: Chapter 2), and the dictionary is updated at the same time. When a user looks
up a word, the part-of-speech information can be obtained along with its frequency in
the corpus. Any time the database maintainer finds an error in the tagged corpus, it
can be corrected on-line as an immediate feedback to the tagger. The tagger can later
be retrained by a single click.
3.4 Unified output interface
For the corpus to be accessible to the public, a unified user-friendly interface is built.
The system follows HTML 4.01 (loose) proposed by the World-Wide Web
Consortium9 and is designed to be browsed with a browser, because this is one of the
major means to access data from the Internet. For a dynamic and interactive
representation, the Document Object Model (DOM)10 and JavaScript 1.2 are
preferably used. Popular browsers, such as Internet Explorer 5.0, Mozilla 1.7, Firefox
0.9 and Opera 4, are compliant to these standards. It is important to support major
browsers for the purpose of accessibility. Figure 5 is a screen dump of the web site
under construction.
9 http://www.w3.org/TR/REC-html40/10 http://www.w3.org/DOM/
Design of a Multimedia Corpus of Austronesian Linguistics 119
Fig.5. Screen dump of the web site (under construction)
The interface is composed of the following parts: a window with the informant's
photo where movie clips are played, a list of metadata in the upper-left corner, several
switches to adjust browsing effects and a frame in the bottom of the screen to dump
the selected article in a format following linguistic convention. A dictionary is
popped-up anytime when a user clicks on an unknown word (see Figure 6). The
pages are being revised for a better visual effect.
Ethnological notes and examples are preferably given in the dictionary with
cross-reference. The design for an interface for searching is simple, yet complicated
and special linguistic needs are still possible. For example, by typing tabatathan a
user can find the occurrences of the Kavalan word ta-batad-an; typing 'ahae or aehae
120 Zhemin Lin, Li-May Sung, I-wen Su
results in 'aehae' for Saisiyat, and so on. Interfaces to user-defined functions (UDF)
are also kept for further improvement.
Design of a Multimedia Corpus of Austronesian Linguistics 121
Fig. 6. Pop-up dictionary with cross-references
As the bandwidth is quite limited, it is suggested that multimedia data are stored
and transferred in the formats of 16Kbps 11kHz MPEG-1 layer 3 for audio data and
MPEG-1 for video data.
3.5 Interoperability
It is important to share the corpus with the linguistic community. The Extensible
Mark-up Language (XML)11 is a simple and flexible language used to exchange data
between different systems. It is now a de facto standard on the web. For researchers
of natural language processing to easily profit from our collected data, the corpus
should be able to be exported in XML. Morphological information, gloss and part-of-
speech of every word may be output in a uniform manner. An exported format is
given below.
11 http://www.w3.org/XML/
122 Zhemin Lin, Li-May Sung, I-wen Su
<?xml version="1.0" encoding="utf-8" ?>
<article id="pear_imui">
<topic>Pear Story</topic>
<language>Kavalan</language>
<dialect>Xinshe</dialect>
<speaker>
<natname>imui</natname>
<chnname>潘金妹</chnname>
<gender>F</gender>
<age-of-record>51</age-of-record>
</speaker>
<duration>00:01:15</duration>
<total-iu>31</total-iu>
<collected>2003-05-30</collected>
<revised>2003-11-11</revised>
<transcriber>葉俞廷</transcriber>
<transcriber>王以勤</transcriber>
<doublecheck>鍾曉芳</doublecheck>
<doublecheck>沈嘉琪</doublecheck>
<doublecheck>葉俞廷</doublecheck>
<text>
<iu id="iu_1">
<word>
<nat>tangi</nat>
<sim>tangi</sim>
<eng>today</eng>
<chn>今天</chn>
<pos>RB</pos>
</word>
Design of a Multimedia Corpus of Austronesian Linguistics 123
<word>
...
</word>
</iu>
<iu id="iu_2"> ... </iu>
...
<para von="1" bis="4">
<eng>I just saw a person there ...</eng>
<chn>我剛剛看到 ...</chn>
<notes>Some elicitation notes</notes>
</para>
...
</text>
</article>
4 Conclusive Remarks
The online version of NTU corpus of Austronesian languages is still under
construction and more texts are to be added. The adaptation of the Leipzig glossing
rules will be adopted in the near future. As normalization, accessibility and
interoperability are emphasized for the system, it should be useful and helpful for
linguists, teachers and even native speakers of Austronesian languages. It is hoped
that our work could contribute to the language communities. The system is
extendible for the processing of other languages once the proper feature vector is set.
The implementation is still on its experimental stage. As Saisiyat and most Formosan
languages are on the verge of being endangered, more people are urged to participate,
to use and to promote the enlargement of the corpora.
124 Zhemin Lin, Li-May Sung, I-wen Su
Appendix A. Coding Lists
Table 2. Morphological coding list
English code Chinese code Description
1SG 1SG 1st person singular
2SG 2SG 2nd person singular
3SG 3SG 3rd person singular
1IPL.NOM 1IPL.主格 1st person plural, Inclusive, Nominative
1EPL.NOM 1EPL.主格 1st person plural, Exclusive, Nominative
1PL 1PL 1st person plural
2PL 2PL 2nd person plural
3PL 3PL 3rd person plural
ACC 受格 Accusative
AF 主焦 Agent Focus
ASP 動貌 Aspect
AUX 助動詞 Auxiliary
BC BC Back Channel / Reactive Token
BF 予焦 Benefactive Focus
CAU 使役 Causative
CLF 量詞 Classifier
CLF.HUM 人量詞 Human Classifier
CLF.NHUM 非人量詞 Non-human Classifier
COM ? Comitative
COMP 補語詞 Complementizer
COND 條件詞 Conditional Marker
DAT 予格 Dative
Design of a Multimedia Corpus of Austronesian Linguistics 125
DEF 定指 Definite
DET 限定詞 Determiner
DIST 遠距 Distal
DM DM Discourse Marker
EXCL 排除 Exclusive
EXIST 存在 Existential
EXPER 經驗 Experiential
FIL FIL Pause Filler
FS FS False Start
FUT 未來 Future
GEN 屬格 Genitive
IF 工焦 Instrumental Focus
IMP 祈使 Imperative
INCL 包含 Inclusive
INDF 不定指 Indefinite
INS 工具格 Instrument
INT 感嘆 Interjection
INVIS 不可見 Invisible
IRR 非實現 Irrealis
LF 處焦 Locative Focus
LNK 連詞 Linker
LOC 處格 Locative
NCM Ncm Non-common Name Marker
NEG 否定 Negative
NEU 中性格 Neutral
NMZ 名物化 Nominalizer/Nominalization
NOM 主格 Nominative
NRFUT 即將 Near Future
126 Zhemin Lin, Li-May Sung, I-wen Su
OBL 斜格 Oblique
PF 受焦 Patient Focus
PFV 完成 Perfective
PN 人名/地名 proper name/place name
POSS 所有格 Possessive
PROG 進行 Progressive
PROX 近距 Proximal/Proximate
Q 疑問 Question Marker
QUOT QUOT Quotative
REC 交互 Reciprocal
RED 重疊 Reduplication
REL 關係詞 Relativizer
REFL 反身 Reflexive
RF 指焦 Referential Focus
TOP 主題 Topic
VIS 可見 Visible
VOC 呼格 Vocative
X X Uncertain Hearing
TU TU
mazmun many.HUM many (humans)
mwaza many.NHUM many (animals)
this 這個that 那個
Table 3. Discourse coding list (adopted from Du Bois (1993))
Meaning Marker
Units
Intonation Unit ((newline))
Design of a Multimedia Corpus of Austronesian Linguistics 127
Meaning Marker
Truncated IU --
Word ((space))
Truncated word -
Speaker identity / turn start :
Speech Overlap [ ]
Transitional Continuity
Final .
Continuing ,
Appeal ?
Terminal Pitch Direction
Fall \
Rise /
Level _
Accent and Lengthening
Primary accent ^
Secondary accent `
High booster !
Low booster ;
Lengthening = =
Tone
Fall \
128 Zhemin Lin, Li-May Sung, I-wen Su
Meaning Marker
Rise /
Fall-Rise \/
Rise-fall /\
Level _
Pause
Long ...(N)
Medium ...
Short ..
Latching 0
Vocal Noises
Vocal noises (CAPITAL LETTERS)
Inhalation (H)
Exhalation (Hx)
Glottal stop %
Laughter @
Quality
Quality <Y Y>
Laugh quality <@ @>
Quotation quality <Q Q>
Phonetics
Phonetic / phonemic transcription (/ /)
Design of a Multimedia Corpus of Austronesian Linguistics 129
Meaning Marker
Transcriber's Perspective
Researcher's comment (( ))
Uncertain hearing <X X>
Indecipherable syllable X
Specialized Notations
Duration (N)
IU boundary &
Accent unit boundary |
Embedded IU <| |>
Restart {Capital Initial}
False start < >
Code switching <L2 L2>
Nontranscription line $
Reserved Symbols
Phonetic / orthographic symbols '
Morphosyntactic coding + * # { }
User-definable symbols " ~
Appendix B. Database Schema
Table meta: metadata of text
130 Zhemin Lin, Li-May Sung, I-wen Su
Field name Format Description Example
article varchar(80) Filename pear_imui
topic varchar(80) Text name Pear Story
texttype varchar(40) Text style narrative
language varchar(40) Text language Kavalan
dialect varchar(40) Dialect or district Xinshe
spknat varchar(80) Native name of Informant imui
spkhan varchar(80) Chinese name of Informant 潘金妹spkgdr char(1) Gender of Informant (M|F) F
spkage integer Age of Informant in time of recording 51
duration time Length of the recording 00:01:15
totaliu integer Number of intonation units 31
collected date Date of record 05/5/30
revised date Date of last revision 03/11/11
transcr blob Comma separated names of
transcribers
A, B, C
dblchk blob Names of people who double check the
text
D, E, F
Table iu: storage of a single intonation unitField name Format Description
article varchar(80) Text name (foreign key of meta.article)
no integer IU #
Design of a Multimedia Corpus of Austronesian Linguistics 131
Field name Format Description
nat blob Native words, space separated
sim blob Simplified native words
eng blob English gloss, space separated
chn blob Chinese gloss, space separated
Table para: translation of a block of intonation unitsField name Format Description
article varchar(80) Text name (foreign key of meta.article)
von integer IU# where the block begins
bis integer IU# where the block ends
eng blob English translation
chn blob Chinese translation
note blob Elicitation notes (multi-line, separated by two semi-
colons)
Table dict: the dictionaryField name Format Description
word varchar(80) Word of index (simplified word)
lemma blob Word forms, comma separated
eng blob English gloss with morphological marks
chn blob Chinese gloss with morphological marks
note blob Notes (multi-line, separated by two semi-colons)
ex blob Example of how the word is used
132 Zhemin Lin, Li-May Sung, I-wen Su
Table lemma: analyse of a lemmaField name Format Description
word varchar(80) Lemma of a word (foreign key to an element of
dict.lemma)
lemma varchar(80) Prefix-#stem#-suffix
enmorph varchar(255) Morphological marks in English, comma separated
zhmorph varchar(255) Morphological marks in Chinese, comma separated
enstem varchar(255) Sense of the stem in English, comma separated
zhstem varchar(255) Sense of the stem in Chinese, comma separated
Table affix: dictionary of affixesField name Format Description
affix varchar(20) Affix (prefix-, -infix-, -suffix)
englos varchar(255) Morphological analyse in English
zhglos varchar(255) Morphological analyse in Chinese
note blob Notes
Table xref: cross-reference of a wordField name Format Description
word varchar(80) Word (simplified)
xref blob Article:IU#.number_of_word
Design of a Multimedia Corpus of Austronesian Linguistics 133
References
Chafe, Wallace L. ed. 1980. The pear stories: Cognitive, cultural, and linguistic
aspects of narrative production. Norwood, NJ: Ablex Publishing Corp.
Du Bois, J. W. 1993. Talking data: Transcription and coding in discourse research,
chapter Outline of discourse transcription, 45-89. NJ: Hillsdale: Lawrence
Erlbaum Associates.
Huang, Shuan-Fan, Lily I-wen Su, and Li-May Sung. 2003. Syntax and cognition in
SaiSiyat. NSC 93-2411-H-022-094.
Lin, Zhemin. 2005. Automatic processing of languages with small-scaled corpus:
Part-of-speech tagging and partial parsing saisiyat and applications. Master's
thesis, National Taiwan University.
Luhn, H. P. 1960. Keyword-in-context index for technical literature (kwic index).
American Documentation 11:288-295.
Mayer, Mercer. 1980. Frog, where are you?. NY: Dial Books.
Tao, Hongyin. 1996. Units in mandarin conversation: Prosody, discourse and
grammar. Amsterdam: John Benjamins.
Zeitoun, Elizabeth, Ching hua Yu, and Cui xia Weng. 2003. The formosan language
archive: Development of a multimedia tool to salvage the languages and oral
traditions of the indigenous tribes of taiwan. Oceanic Linguistics 42(1):218-
232.
Zeitoun, Elizabeth, and Ching-Hua Yu. 2005. The formosan language archive:
Linguistic analysis and language processing. Computational Linguistics and
Chinese Language Processing 10(2):167-200