thedevelopmentofcorpus...
TRANSCRIPT
-
The development of corpus linguis4cs in Chinese context
Richard Xiao
-
An overview
• Taking a historical approach to the development of CCL, where appropriate in contrast to ECL • Highligh>ng the key points and unique challenges in the development
• Iden>fying possible fruiDul avenues of development where ECL and CCL can inform and learn from each other
• Three areas of research deeply influenced by corpora – Lexicography – Descrip>ve grammars – Interlanguage analysis
2
-
Corpus revolu>on in lexicography • Earlier corpus-‐informed lexicographic studies
– Thorndike’s (1921) The Teacher’s Word Book – The 1st edi>on of American Heritage Dic6onary (1969)
• Real star>ng point of corpus-‐based lexicography: Sinclair’s COBUILD in 1980 – Providing data, ideas and analyses for Collins, to help them develop a
new corpus-‐based dic>onary (the Collins COBUILD dic>onary, 1987) • Frequency, colloca>on, authen>c illustra>ve examples, and
contextual and genre varia>on are all forms of data which now commonly appear in corpus-‐based dic>onaries – Longman Dic6onary of Contemporary English (LDOCE, 3rd edi>on) – Oxford Advanced Learner’s Dic6onary (OALD, 5th edi>on) – Cambridge Interna6onal Dic6onary of English – Macmillan English Dic6onary
3
-
Corpora in Chinese lexicography • The first study of Chinese character frequency in a modern sense dated back as early as the 1920s – Li Jinxi (1922): ‘Sta>s>cal study of basic vocabulary in Chinese’
• Chen Heqin (1922, 1928) The Applied Glossary of Modern Chinese (《语体文应用字汇》) – A paper-‐based corpus of diverse sources amoun>ng to well over 0.5 M Chinese characters
– Taking Chen and 9 assistants nearly 3 years – A list of 4,261 most frequently and widely used Chinese characters
– Later revised and republished as the booklet by the Commercial Press in 1928
4
-
Corpora in Chinese lexicography • Chen’s (1922, 1928) frequency list of Chinese characters was influenced by Thorndike’s (1921) English word list
• But the contribu>on of Chen’s list to Chinese is more significant than the contribu>on of Thorndike’s (1921) list to English because the former has not only contributed to primary educa>on and the promo>on of literacy in China, it has also helped to shape present-‐day Chinese. – Phone>c language vs. script language – Timeliness of Chen’s character list
5
-
Corpora in Chinese lexicography • Chen’s list is the forerunner of today’s word frequency lists and
frequency dic>onaries of Chinese derived from computer corpora • Since the founding of P. R. China, the central government and local
authori>es have also published a range of lists of Chinese words and characters – Register of Common Characters (MoE 1950): 1,017 characters – List of Common Characters (MoE 1952): 2,000 characters – List of Common Characters in Putonghua Common Speech (Shandong
Provincial Commission of Educa>on 1958): 3,000 characters – Three Thousand Common Words in Putonghua Common Speech
(Commicee of Language Reform 1962) – A List of Four Thousand Words for Foreign Students (BLCU 1964) – List of Common Used Characters (Beijing Municipality Commission of
Educa>on 1965): 3,100 characters
6
-
Corpora in Chinese lexicography • With the rapid development of corpus linguis>cs in general and
Chinese language processing in par>cular, the long standing tradi>on of studying word and character frequency in Chinese linguis>cs has been con>nued into the new millennium – Liu (1973): Frequency Dic6onary of Chinese Words – Project Code 748 (1976): A Comprehensive Frequency Table of
Character Usage in Modern Chinese – Beihang University (1985): A Frequency Table of Character Usage in
Modern Chinese – BLCU (1986): A Frequency Dic6onary of Modern Chinese – Na>onal Language Commicee (1988): Commonly Used Characters in
Modern Chinese – Hong Kong Polytechnic University (1991-‐1997): A Chinese Word Bank
from Mainland China, Taiwan, and Hong Kong – HSK Commicee (1992, 2001): The HSK Lexical Syllabus – Xiao, Rayson and McEnery (2009): A Frequency Dic6onary of Mandarin
Chinese
7
-
Roles of corpora in compiling New Word Dic6onary for Chinese as a Foreign Language (Cui 2011: 85)
8
-
Corpora in Chinese lexicography • The study of neologism is an important area of lexicography which
can benefit greatly from the corpus approach – Corpora can provide the necessary sources of data as well as the
method for reasonably iden>fying new words or new meaning / usage of exis>ng words
• A Dic6onary of New Words in Modern Chinese (Kang 2003) – 20,000 new words that have gained currency and remained rela>vely
stable in 1978 -‐2000 – Based on a huge corpus composed of over 25 years’ archive data of
some major newspapers and magazines • The Global Dic6onary on Chinese Neologism (Tsou & You 2010)
– 1,600 Chinese neologisms that have entered the Chinese language since 2000
– Based on 400-‐M character LIVAC corpus specifically designed to monitor language development in Chinese speech communi>es
9
-
Corpora in Chinese lexicography
• Parallel corpora in bilingual lexicography • Defini>on usually in the target language • Only par>ally equivalent to the headword • An abstract generalisa>on of the typical meanings of the word, hard to cover all of its meanings fully
• Bilingual examples cited from parallel corpora can complement missing meanings
10
-
Corpora in Chinese lexicography • Specialised bilingual dic>onaries: the defini>on and transla>on of the domain-‐specific special usage of ordinary words in par>cular domains – E.g. in business domain, the concept of 表 ‘table, form’ is conven>onally expressed as ‘statement’ instead of ‘table, form’: ‘financial statement’ (财务报表), and ‘statement of income and expenses’ (财务收益与费用表)
• Such issues readily addressed with the help of parallel and comparable corpora of the languages involved
• Specialised corpora are recognised as ideal linguis>c and knowledge resources in lexicography — Corpus-‐based specialised dic>onaries can ensure a systema>c coverage of useful headwords of prac>cal value, accurate defini>ons, and appropriate authen>c examples
11
-
ECL vs. CCL in lexicography • The use of corpora in Chinese lexicography dated back as early as in English lexicography – Importance of defining basic Chinese characters and words from a huge lexicon
• Challenges not encountered in ECL – ‘word segmenta>on’ or ‘tokenisa>on’, usually requiring complex computer processing
– Lack of strict correspondence between word classes and syntac>c func>ons
12
-
ECL vs. CCL in lexicography • Common issues to be solved in corpus-‐based lexicography
for both English and Chinese • Balance and representa>veness of corpora
– Usually very large (BoE, LIVAC, Gigaword English, Gigaword Chinese)
– Licle or no spoken data – Too much dependence on newspaper or newswire text – A >me lag of the corpus data behind the actual language change
• Accuracy of corpus annota>on – Seman>c annota>on or WSD, including word senses belonging to the same word class
• A range of learner dic>onaries have been published for both English and Chinese as the two most populous languages in the world
13
-
ECL vs. CCL in lexicography
• Corpus-‐based English lexicography appears to have been confined largely to monolingual English corpora – A consequence of English monolingualism?
• In contrast, CCL has helped to create both monolingual Chinese dic>onaries as well as a range of bilingual dic>onaries of Chinese with a foreign language like English, French, German and Japanese
• The CCL experience with bilingual lexicography suggests that this undoubtedly presents a fruiDul avenue of development to ECL
14
-
ECL vs. CCL in lexicography • The advantages of using corpora in dic>onary making, whether monolingual Chinese dic>onaries or bilingual dic>onaries, are self-‐evident
• Corpora are a double-‐edge sword: if used inappropriately, they simply mean labour lost; and even worse s>ll, they can lead to falsehood under a scien>fic and objec>ve disguise (cf. Li 2008: 203) – Sinclair (2004b: 2): “A corpus is not a simple object, and it is just as easy to derive nonsensical conclusions from the evidence as insighDul ones.”
15
-
Corpus-‐based descrip>ve grammars
• The development in ECL has redefined what a grammar is, leading not only to becer English grammars, but also a deeper awareness of the very real differences between the grammar of speech and wri>ng
• 3 genera>ons of English reference grammars of Quirk’s SEU tradi>on – Grammar of Contemporary English (Quirk et al. 1972) – Comprehensive Grammar of the English Language (Quirk et al. 1985)
– Longman Grammar of Spoken and WriQen English (Biber et al. 1999)
• Carter and McCarthy’s spoken English grammar (CANCODE) – Cambridge Grammar of English (Carter & McCarthy 2006)
16
-
Corpora in descrip>ve Chinese grammars
• The development of CCL in Chinese descrip>ve grammars is s>ll lagging far behind
• Research in corpus-‐based descrip>ve grammars in Chinese is rather sporadic and fragmentary
• Largely focused on specific linguis>c features of interest to individual researchers
17
-
Corpora in descrip>ve Chinese grammars
• Huang & Ahrens (2003) study the rela>onship between nouns and nominal classifiers in Mandarin Chinese based on the data from the Academia Sinica Balanced Corpus of Modern Mandarin Chinese
• Zhao (2010) studies the gramma>cal meanings and usage contexts of the func>on word 来着 laizhe in the 610 valid instances of the word in the PKU corpus of modern Chinese
• Zhang (2010) is concerned with a syntac>c and pragma>c analysis of a commonly used degree complement structure in Chinese, ‘X得很’, on the basis of the PKU corpus
18
-
Corpora in descrip>ve Chinese grammars
• Siewierska, Xu & Xiao (2010) • A corpus-‐based study of splicable compounds (离合词) in interac>on with morphology, syntax, and pragma>cs, with the aim to produce a systema>c and realis>c account of splicable compounds as acested in two million words of authen>c spoken and wricen Chinese data (LCMC, LLSCC)
• Wang & Wang (2009) • Approaching splicable compounds from a pedagogical perspec>ve by discussing the implica>on of their research findings based on the PKU Chinese corpus for teaching Chinese as a foreign language
19
-
Corpora in descrip>ve Chinese grammars
• Xiao, McEnery & Qian (2006) – A systema>c account of passive construc>ons in Chinese in contrast with English, covering a range of characteris>cs of passives in the two languages
• Xiao & McEnery (2008) – Exploring nega>on in Chinese on the basis of spoken and wricen Chinese corpora
• Xiao & McEnery (2004) – First book-‐length corpus-‐based comprehensive account of aspect in Mandarin Chinese
20
-
Corpora in descrip>ve Chinese grammars
• Apart from the corpus studies of specific linguis>c features in Chinese reviewed so far, there is hardly any descrip>ve grammar of Chinese based on or informed by corpora
• Xiao & McEnery (2010) provides the first book-‐length corpus-‐based contras>ve studies of major gramma>cal categories that contribute to aspectual meaning in Chinese and English – E.g. aspect markers, comple>ve and dura>ve temporal adverbials, quan>fiers, passives, and nega>on all contribu>ng to aspectual meaning by interac>ng with situa>on aspect or viewpoint aspect in one way or another
• This book is a research monograph more than a descrip>ve grammar of Chinese for general reference or pedagogical use
21
-
Corpora in descrip>ve Chinese grammars
• Knowledge bases or electronic dic>onaries of Chinese grammar developed at PKU for use in automa>c Chinese informa>on processing – Specifica6ons for Basic Processing of Contemporary Chinese Corpus at Peking University (Yu & Duan 2002)
– Gramma6cal Knowledge Base of Contemporary Chinese (Yu 2003); Gramma6cal Knowledge Base of Chinese High Frequency Words (Zhu et al. 2004)
– Chinese Func6on Word Usage Knowledge Base (Liu et al. 2005) – Modern Chinese New Words Informa6on Electronic Dic6onary (Kang 2002)
• Providing useful gramma>cal informa>on about Chinese in NLP, but not descrip>ve grammars in a sense that a reference grammar is expected to be
22
-
ECL vs. CCL in descrip>ve grammars
• A huge gap in research in descrip>ve grammars between English and Chinese
– Three genera>ons of systema>cally and substan>ally corpus-‐based English reference grammar • Quirk et al 1972, Quirk et al 1985, Biber et al 1999
– Corpus-‐based Chinese descrip>ve grammars rather sporadic and fragmentary • The first comprehensive and data-‐driven descrip>ve grammar yet to be published: Cambridge Chinese Reference Grammar
23
-
ECL vs. CCL in descrip>ve grammars
• Sharp contrast may be due to the fact that the development of corpus linguis>cs started with the English language, causing corpus-‐based research of Chinese grammar to lag behind in some areas
• But a more important reason for the significant lagging behind for corpus-‐based Chinese descrip>ve grammar is the separa>on between Chinese corpus research and linguis>c research in Chinese context
24
-
ECL vs. CCL in descrip>ve grammars • In China members of the CLSC are almost exclusively university foreign language teachers at universi>es • More interested in ECL than CCL
• Those working with Chinese corpora are usually computer specialists and computa>onal linguists • More interested in NLP technologies than linguis>c theorisa>on
• Coopera>on and collabora>on between these two groups of scholars, and therefore between their corpus building and analysis exper>se and linguis>c knowledge, would substan>ally facilitate the healthy development of CCL in the right direc>on • An important area where CCL can learn from ECL
25
-
Learner English corpora • One of the most exci>ng developments in corpus linguis>cs that can be used to inform teaching directly or directly
• The ICLE launched by Granger in 1990 – To date containing over 4.5 million words arranged in 16 subcorpora, each for a dis>nct L1 background
– LOCNESS: a comparable corpus of L1 English materials • Louvain Interna>onal Database of Spoken English Interlanguage (LINDSEI) – Currently containing only 100,000 words by 50 speakers with a French L1 background, but being expanded by a number of interna>onal teams
– LOCNEC: a comparable control corpus for exploring LINDSEI
26
-
Learner English corpora • Other learner English corpora
– The Uppsala Student English corpus (Swedish L1) – The JEFLL corpus (Japanese L1) – The Standard Speaking Test corpus (Japanese L1) – The HKUST Corpus of Learner English (Cantonese L1) – The Polish Learner English Corpus (Polish L1) – The Chinese Learner English Corpus (Chinese L1) – Chinese Learners’ Spoken English Corpus (Chinese L1) – Spoken and Wricen English Corpus of Chinese Learners (Chinese L1) – Parallel Corpus of Chinese EFL Learners (Chinese L1) – Corpus of English Majors (Chinese L1) – Longman Learner’s Corpus (various L1’s) – Cambridge Learner Corpus (various L1’s)
27
-
Learner English corpora
• The scope and systema>c nature of work with learner corpora today allows for a much more wide-‐ranging and systema>c explora>on of learner data than the earlier studies from the 1970s
• Nowadays learner English corpus research appears to have been undertaken extensively around the world, but ironically not notably in a na>ve English speaking country
28
-
Chinese interlanguage research • Interlanguage analysis of learner errors appears to be the focus of corpus-‐based Chinese teaching and learning
• Rapid development of TCFL since the mid 1990shas led to an increasingly pressing demand for Chinese interlanguage corpora to aid Chinese teaching and learning (cf. Ren 2010) – Providing more direct and readily available help in teaching and learning (e.g. in terms of real-‐>me error analysis, computer-‐aided teaching, per>nent exercises for individual learners, learning evalua>on)
– Playing an increasingly important role in syllabus design, materials development, lexicography etc.
29
-
Chinese interlanguage research
• The Chinese Interlanguage Corpus – The earliest corpus of learner Chinese, created at BLCU in 1993-‐1995
– 1,371 composi>ons by 740 students, 1.04m characters
– Encoded with 23 metadata features – Annotated with POS and learner errors at character, word and sentence level
30
-
Chinese interlanguage research
• The HSK Dynamic Composi>on Corpus – BLCU’s another corpus of over 4.24m characters – 11,569 HSK composi>ons by learners of Chinese as a second or foreign language in 1992-‐2005
– Annotated with rich metadata and learner errors at character, word and sentence level
– Con>nuous data suitable for longitudinal study – Available online: hcp://202.112.195.192:8060/hsk/login.asp
31
-
Chinese interlanguage research • The Mandarin Interlanguage Corpus (MIC)
– Created at the University of Hong Kong – A total of 19 par>cipants from two groups of year 2 students taking a two-‐year Cer>ficate Course in Chinese Language
– A range of L1 backgrounds: English, Korean, Japanese, German, French, Tamil, Indonesian, Spanish, Dutch, and Thai
– Wricen data in the form of (88) short composi>ons ranging from 150-‐700 characters, depending on the genre type
– Spoken data is from their 1-‐2 minute short presenta>ons (60 hours) during in-‐class conversa>on and the examina>on
– Coming with a user-‐friendly online interface that allows a number of search op>ons including searching by source, word class, learner’s L1, topic of the task
32
-
Chinese interlanguage research • Modern Interlanguage Chinese Corpus – Comprising tasks of composi>ons and making sentences, 10,135 sentences
– Collected twice per semester from years 2-‐4 Chinese studies students at 6 Korean universi>es in 2004-‐2006
– hcp://jit.jj.ac.kr:8080/corpus/index.jsp • NTNU created a Chinese interlanguage corpus in 2004-‐2005
• 41,053 sentences by 210 learners of Chinese (mostly L1 English)
• hcp://chinese.mtc.ntnu.edu.tw/moodle/mod/forum/discuss.php?d=210
33
-
Chinese interlanguage research • Learner Chinese corpora reported but not publicly available – Ji’nan University: 3 million characters – Nanjing Normal University: 900,000 characters – Zhongshan University: 750,000 characters
• Corpora planned or under construc>on – NTNU: learner interlanguage corpus (LIC) of Chinese wricen and spoken texts
– Ludong University: country-‐specific (L1 Korean) Chinese interlanguage corpus with a target size of over 3m characters
– Shanghai Jiaotong University: Chinese composi>on corpus
34
-
Chinese interlanguage research • Examples of corpus-‐based studies analysing specific
gramma>cal and lexical features of learners’ Chinese interlanguage – Japanese learners’ acquisi>on of Chinese direc>onal complements
(Yang 2004) – Learner errors with the negators 不 bu and 没有 meiyou (Yuan
2005a, 2005b) – Learners’ acquisi>on of the Chinese 比 bi compara>ve structure
(Wang 2005) – Learner errors with Chinese idioms chengyu (Shi 2008) – L1 Korean learners’ errors with the preposi>on 給 gei (Hua 2009)
35
-
Chinese interlanguage research • A focus of TCFL research, but a number of issues with current
Chinese interlanguage research – Very few exis>ng Chinese interlanguage corpora in comparison
with learner English corpora – Rather small corpus size compared with general corpora – Seriously biased towards to Asian learners (Korean, Japanese,
Southeast Asian) – Mostly composi>ons under test condi>ons – Lack of spoken data – Consistency in manual error tagging – Public availability – Interlanguage research focusing on or confined to error analysis,
overlooking interlanguage usage pacerns
36
-
ECL and CCL in interlanguage research
• While learn English corpus research has largely been undertaken in parts of the world other than major na>ve English speaking countries, research of Chinese interlanguage is highly concentrated in China
• While learner English corpus research has covered the interlanguages of learners from an extensive range of L1 backgrounds including languages both similar to and dis>nctly different from English, Chinese interlanguage research is highly unbalanced as regards learners’ L1, which are essen>ally limited to learners from East and Southeast Asian countries, with learners with a first language of English or an European language markedly under-‐represented at present
37
-
ECL and CCL in interlanguage research
• English is the most popular and most widely learned second or foreign language in the world
• Unsurprising that learner English has been studied more extensively than learner Chinese
• When CCL seeks to address the numerous issues with Chinese interlanguage corpus research as noted earlier, it clearly has a lot to learn from the ECL experience with learner English research
38
-
ECL and CCL in interlanguage research
• A proposal to create the Interna4onal Corpus of Learner Chinese as a joint research project between a number of universi>es in and outside China (Cui & Zhang 2011)
• Target size of 50m characters, with an annotated wricen component of 20m characters and a raw text wricen component of 25m characters
• Also 5m characters of spoken data, with an annotated component of 2m characters and a raw text component of 3m characters
• Data to be collected from non-‐na>ve Chinese learners including both Chinese majors and non-‐Chinese majors at beginner, intermediate, and advanced levels
39
-
ECL and CCL in interlanguage research • Genres to cover narra>ve, argumenta>ve, and expository
types • Task types to include homework, exam script, HSK test etc. • Encoding of rich metadata about learners and about the
text sample • Error tagging at various levels • Basic annota>on including word tokenisa>on, POS tagging,
sentence cons>tuents, sentence type • Resul>ng corpus to be mounted at a dedicated website to
allow registered users to search online, in addi>on to a CD edi>on to be published for use offline on standoff PCs
40
-
ECL and CCL in interlanguage research • Also a need to build a comparable na>ve Chinese ‘control
corpus’ to facilitate comparisons of learner Chinese with na>ve Chinese
• How can ECL contribute to Chinese interlanguage research? – Chinese is owen taught locally in major European and American
countries such as the UK and the US, where ECL has also developed most rapidly
– Corpus linguists in these areas can contribute to Chinese interlanguage research by crea>ng corpora of learner Chinese produced by their local na>ve students
– To complement the exis>ng interlanguage Chinese corpora created in China, facilita>ng contras>ve analysis of interlanguages by learners from Asia and those from Europe and America
41
-
Conclusions
• I hope the survey of the development of corpus linguis>cs in Chinese context in the three areas reviewed will contribute to the further development of CCL
• CCL and ECL clearly can inform and learn from each other, e.g. ECL experience of descrip>ve grammars for CCL, and CCL experience of mul>lingual lexicography for ECL
42