thedevelopmentofcorpus...

The development of corpus linguis4cs in Chinese context

Richard Xiao

An overview

•  Taking a historical approach to the development of CCL, where appropriate in contrast to ECL •  Highligh>ng the key points and unique challenges in the development

•  Iden>fying possible fruiDul avenues of development where ECL and CCL can inform and learn from each other

•  Three areas of research deeply influenced by corpora –  Lexicography –  Descrip>ve grammars –  Interlanguage analysis

2

Corpus revolu>on in lexicography •  Earlier corpus-‐informed lexicographic studies

–  Thorndike’s (1921) The Teacher’s Word Book –  The 1st edi>on of American Heritage Dic6onary (1969)

•  Real star>ng point of corpus-‐based lexicography: Sinclair’s COBUILD in 1980 –  Providing data, ideas and analyses for Collins, to help them develop a

new corpus-‐based dic>onary (the Collins COBUILD dic>onary, 1987) •  Frequency, colloca>on, authen>c illustra>ve examples, and

contextual and genre varia>on are all forms of data which now commonly appear in corpus-‐based dic>onaries –  Longman Dic6onary of Contemporary English (LDOCE, 3rd edi>on) –  Oxford Advanced Learner’s Dic6onary (OALD, 5th edi>on) –  Cambridge Interna6onal Dic6onary of English –  Macmillan English Dic6onary

3

Corpora in Chinese lexicography •  The first study of Chinese character frequency in a modern sense dated back as early as the 1920s –  Li Jinxi (1922): ‘Sta>s>cal study of basic vocabulary in Chinese’

•  Chen Heqin (1922, 1928) The Applied Glossary of Modern Chinese (《语体文应用字汇》) –  A paper-‐based corpus of diverse sources amoun>ng to well over 0.5 M Chinese characters

–  Taking Chen and 9 assistants nearly 3 years –  A list of 4,261 most frequently and widely used Chinese characters

–  Later revised and republished as the booklet by the Commercial Press in 1928

4

Corpora in Chinese lexicography •  Chen’s (1922, 1928) frequency list of Chinese characters was influenced by Thorndike’s (1921) English word list

•  But the contribu>on of Chen’s list to Chinese is more significant than the contribu>on of Thorndike’s (1921) list to English because the former has not only contributed to primary educa>on and the promo>on of literacy in China, it has also helped to shape present-‐day Chinese. –  Phone>c language vs. script language –  Timeliness of Chen’s character list

5

Corpora in Chinese lexicography •  Chen’s list is the forerunner of today’s word frequency lists and

frequency dic>onaries of Chinese derived from computer corpora •  Since the founding of P. R. China, the central government and local

authori>es have also published a range of lists of Chinese words and characters –  Register of Common Characters (MoE 1950): 1,017 characters –  List of Common Characters (MoE 1952): 2,000 characters –  List of Common Characters in Putonghua Common Speech (Shandong

Provincial Commission of Educa>on 1958): 3,000 characters –  Three Thousand Common Words in Putonghua Common Speech

(Commicee of Language Reform 1962) –  A List of Four Thousand Words for Foreign Students (BLCU 1964) –  List of Common Used Characters (Beijing Municipality Commission of

Educa>on 1965): 3,100 characters

6

Corpora in Chinese lexicography •  With the rapid development of corpus linguis>cs in general and

Chinese language processing in par>cular, the long standing tradi>on of studying word and character frequency in Chinese linguis>cs has been con>nued into the new millennium –  Liu (1973): Frequency Dic6onary of Chinese Words –  Project Code 748 (1976): A Comprehensive Frequency Table of

Character Usage in Modern Chinese –  Beihang University (1985): A Frequency Table of Character Usage in

Modern Chinese –  BLCU (1986): A Frequency Dic6onary of Modern Chinese –  Na>onal Language Commicee (1988): Commonly Used Characters in

Modern Chinese –  Hong Kong Polytechnic University (1991-‐1997): A Chinese Word Bank

from Mainland China, Taiwan, and Hong Kong –  HSK Commicee (1992, 2001): The HSK Lexical Syllabus –  Xiao, Rayson and McEnery (2009): A Frequency Dic6onary of Mandarin

Chinese

7

Roles of corpora in compiling New Word Dic6onary for Chinese as a Foreign Language (Cui 2011: 85)

8

Corpora in Chinese lexicography •  The study of neologism is an important area of lexicography which

can benefit greatly from the corpus approach –  Corpora can provide the necessary sources of data as well as the

method for reasonably iden>fying new words or new meaning / usage of exis>ng words

•  A Dic6onary of New Words in Modern Chinese (Kang 2003) –  20,000 new words that have gained currency and remained rela>vely

stable in 1978 -‐2000 –  Based on a huge corpus composed of over 25 years’ archive data of

some major newspapers and magazines •  The Global Dic6onary on Chinese Neologism (Tsou & You 2010)

–  1,600 Chinese neologisms that have entered the Chinese language since 2000

–  Based on 400-‐M character LIVAC corpus specifically designed to monitor language development in Chinese speech communi>es

9

Corpora in Chinese lexicography

•  Parallel corpora in bilingual lexicography •  Defini>on usually in the target language •  Only par>ally equivalent to the headword •  An abstract generalisa>on of the typical meanings of the word, hard to cover all of its meanings fully

•  Bilingual examples cited from parallel corpora can complement missing meanings

10

Corpora in Chinese lexicography •  Specialised bilingual dic>onaries: the defini>on and transla>on of the domain-‐specific special usage of ordinary words in par>cular domains –  E.g. in business domain, the concept of 表 ‘table, form’ is conven>onally expressed as ‘statement’ instead of ‘table, form’: ‘financial statement’ (财务报表), and ‘statement of income and expenses’ (财务收益与费用表)

•  Such issues readily addressed with the help of parallel and comparable corpora of the languages involved

•  Specialised corpora are recognised as ideal linguis>c and knowledge resources in lexicography — Corpus-‐based specialised dic>onaries can ensure a systema>c coverage of useful headwords of prac>cal value, accurate defini>ons, and appropriate authen>c examples

11

ECL vs. CCL in lexicography •  The use of corpora in Chinese lexicography dated back as early as in English lexicography –  Importance of defining basic Chinese characters and words from a huge lexicon

•  Challenges not encountered in ECL –  ‘word segmenta>on’ or ‘tokenisa>on’, usually requiring complex computer processing

– Lack of strict correspondence between word classes and syntac>c func>ons

12

ECL vs. CCL in lexicography •  Common issues to be solved in corpus-‐based lexicography

for both English and Chinese •  Balance and representa>veness of corpora

–  Usually very large (BoE, LIVAC, Gigaword English, Gigaword Chinese)

–  Licle or no spoken data –  Too much dependence on newspaper or newswire text –  A >me lag of the corpus data behind the actual language change

•  Accuracy of corpus annota>on –  Seman>c annota>on or WSD, including word senses belonging to the same word class

•  A range of learner dic>onaries have been published for both English and Chinese as the two most populous languages in the world

13

ECL vs. CCL in lexicography

•  Corpus-‐based English lexicography appears to have been confined largely to monolingual English corpora –  A consequence of English monolingualism?

•  In contrast, CCL has helped to create both monolingual Chinese dic>onaries as well as a range of bilingual dic>onaries of Chinese with a foreign language like English, French, German and Japanese

•  The CCL experience with bilingual lexicography suggests that this undoubtedly presents a fruiDul avenue of development to ECL

14

ECL vs. CCL in lexicography •  The advantages of using corpora in dic>onary making, whether monolingual Chinese dic>onaries or bilingual dic>onaries, are self-‐evident

•  Corpora are a double-‐edge sword: if used inappropriately, they simply mean labour lost; and even worse s>ll, they can lead to falsehood under a scien>fic and objec>ve disguise (cf. Li 2008: 203) –  Sinclair (2004b: 2): “A corpus is not a simple object, and it is just as easy to derive nonsensical conclusions from the evidence as insighDul ones.”

15

Corpus-‐based descrip>ve grammars

•  The development in ECL has redefined what a grammar is, leading not only to becer English grammars, but also a deeper awareness of the very real differences between the grammar of speech and wri>ng

•  3 genera>ons of English reference grammars of Quirk’s SEU tradi>on –  Grammar of Contemporary English (Quirk et al. 1972) –  Comprehensive Grammar of the English Language (Quirk et al. 1985)

–  Longman Grammar of Spoken and WriQen English (Biber et al. 1999)

•  Carter and McCarthy’s spoken English grammar (CANCODE) –  Cambridge Grammar of English (Carter & McCarthy 2006)

16

Corpora in descrip>ve Chinese grammars

•  The development of CCL in Chinese descrip>ve grammars is s>ll lagging far behind

•  Research in corpus-‐based descrip>ve grammars in Chinese is rather sporadic and fragmentary

•  Largely focused on specific linguis>c features of interest to individual researchers

17


•  Huang & Ahrens (2003) study the rela>onship between nouns and nominal classifiers in Mandarin Chinese based on the data from the Academia Sinica Balanced Corpus of Modern Mandarin Chinese

•  Zhao (2010) studies the gramma>cal meanings and usage contexts of the func>on word 来着 laizhe in the 610 valid instances of the word in the PKU corpus of modern Chinese

•  Zhang (2010) is concerned with a syntac>c and pragma>c analysis of a commonly used degree complement structure in Chinese, ‘X得很’, on the basis of the PKU corpus

18


•  Siewierska, Xu & Xiao (2010) •  A corpus-‐based study of splicable compounds (离合词) in interac>on with morphology, syntax, and pragma>cs, with the aim to produce a systema>c and realis>c account of splicable compounds as acested in two million words of authen>c spoken and wricen Chinese data (LCMC, LLSCC)

•  Wang & Wang (2009) •  Approaching splicable compounds from a pedagogical perspec>ve by discussing the implica>on of their research findings based on the PKU Chinese corpus for teaching Chinese as a foreign language

19


•  Xiao, McEnery & Qian (2006) –  A systema>c account of passive construc>ons in Chinese in contrast with English, covering a range of characteris>cs of passives in the two languages

•  Xiao & McEnery (2008) –  Exploring nega>on in Chinese on the basis of spoken and wricen Chinese corpora

•  Xiao & McEnery (2004) –  First book-‐length corpus-‐based comprehensive account of aspect in Mandarin Chinese

20


•  Apart from the corpus studies of specific linguis>c features in Chinese reviewed so far, there is hardly any descrip>ve grammar of Chinese based on or informed by corpora

•  Xiao & McEnery (2010) provides the first book-‐length corpus-‐based contras>ve studies of major gramma>cal categories that contribute to aspectual meaning in Chinese and English –  E.g. aspect markers, comple>ve and dura>ve temporal adverbials, quan>fiers, passives, and nega>on all contribu>ng to aspectual meaning by interac>ng with situa>on aspect or viewpoint aspect in one way or another

•  This book is a research monograph more than a descrip>ve grammar of Chinese for general reference or pedagogical use

21


•  Knowledge bases or electronic dic>onaries of Chinese grammar developed at PKU for use in automa>c Chinese informa>on processing –  Specifica6ons for Basic Processing of Contemporary Chinese Corpus at Peking University (Yu & Duan 2002)

–  Gramma6cal Knowledge Base of Contemporary Chinese (Yu 2003); Gramma6cal Knowledge Base of Chinese High Frequency Words (Zhu et al. 2004)

–  Chinese Func6on Word Usage Knowledge Base (Liu et al. 2005) –  Modern Chinese New Words Informa6on Electronic Dic6onary (Kang 2002)

•  Providing useful gramma>cal informa>on about Chinese in NLP, but not descrip>ve grammars in a sense that a reference grammar is expected to be

22

ECL vs. CCL in descrip>ve grammars

•  A huge gap in research in descrip>ve grammars between English and Chinese

–  Three genera>ons of systema>cally and substan>ally corpus-‐based English reference grammar •  Quirk et al 1972, Quirk et al 1985, Biber et al 1999

–  Corpus-‐based Chinese descrip>ve grammars rather sporadic and fragmentary •  The first comprehensive and data-‐driven descrip>ve grammar yet to be published: Cambridge Chinese Reference Grammar

23

ECL vs. CCL in descrip>ve grammars

•  Sharp contrast may be due to the fact that the development of corpus linguis>cs started with the English language, causing corpus-‐based research of Chinese grammar to lag behind in some areas

•  But a more important reason for the significant lagging behind for corpus-‐based Chinese descrip>ve grammar is the separa>on between Chinese corpus research and linguis>c research in Chinese context

24

ECL vs. CCL in descrip>ve grammars •  In China members of the CLSC are almost exclusively university foreign language teachers at universi>es •  More interested in ECL than CCL

•  Those working with Chinese corpora are usually computer specialists and computa>onal linguists •  More interested in NLP technologies than linguis>c theorisa>on

•  Coopera>on and collabora>on between these two groups of scholars, and therefore between their corpus building and analysis exper>se and linguis>c knowledge, would substan>ally facilitate the healthy development of CCL in the right direc>on •  An important area where CCL can learn from ECL

25

Learner English corpora •  One of the most exci>ng developments in corpus linguis>cs that can be used to inform teaching directly or directly

•  The ICLE launched by Granger in 1990 –  To date containing over 4.5 million words arranged in 16 subcorpora, each for a dis>nct L1 background

–  LOCNESS: a comparable corpus of L1 English materials •  Louvain Interna>onal Database of Spoken English Interlanguage (LINDSEI) –  Currently containing only 100,000 words by 50 speakers with a French L1 background, but being expanded by a number of interna>onal teams

–  LOCNEC: a comparable control corpus for exploring LINDSEI

26

Learner English corpora •  Other learner English corpora

–  The Uppsala Student English corpus (Swedish L1) –  The JEFLL corpus (Japanese L1) –  The Standard Speaking Test corpus (Japanese L1) –  The HKUST Corpus of Learner English (Cantonese L1) –  The Polish Learner English Corpus (Polish L1) –  The Chinese Learner English Corpus (Chinese L1) –  Chinese Learners’ Spoken English Corpus (Chinese L1) –  Spoken and Wricen English Corpus of Chinese Learners (Chinese L1) –  Parallel Corpus of Chinese EFL Learners (Chinese L1) –  Corpus of English Majors (Chinese L1) –  Longman Learner’s Corpus (various L1’s) –  Cambridge Learner Corpus (various L1’s)

27

Learner English corpora

•  The scope and systema>c nature of work with learner corpora today allows for a much more wide-‐ranging and systema>c explora>on of learner data than the earlier studies from the 1970s

•  Nowadays learner English corpus research appears to have been undertaken extensively around the world, but ironically not notably in a na>ve English speaking country

28

Chinese interlanguage research •  Interlanguage analysis of learner errors appears to be the focus of corpus-‐based Chinese teaching and learning

•  Rapid development of TCFL since the mid 1990shas led to an increasingly pressing demand for Chinese interlanguage corpora to aid Chinese teaching and learning (cf. Ren 2010) –  Providing more direct and readily available help in teaching and learning (e.g. in terms of real-‐>me error analysis, computer-‐aided teaching, per>nent exercises for individual learners, learning evalua>on)

–  Playing an increasingly important role in syllabus design, materials development, lexicography etc.

29

Chinese interlanguage research

•  The Chinese Interlanguage Corpus – The earliest corpus of learner Chinese, created at BLCU in 1993-‐1995

– 1,371 composi>ons by 740 students, 1.04m characters

– Encoded with 23 metadata features – Annotated with POS and learner errors at character, word and sentence level

30

Chinese interlanguage research

•  The HSK Dynamic Composi>on Corpus –  BLCU’s another corpus of over 4.24m characters –  11,569 HSK composi>ons by learners of Chinese as a second or foreign language in 1992-‐2005

– Annotated with rich metadata and learner errors at character, word and sentence level

–  Con>nuous data suitable for longitudinal study – Available online: hcp://202.112.195.192:8060/hsk/login.asp

31

Chinese interlanguage research •  The Mandarin Interlanguage Corpus (MIC)

–  Created at the University of Hong Kong –  A total of 19 par>cipants from two groups of year 2 students taking a two-‐year Cer>ficate Course in Chinese Language

–  A range of L1 backgrounds: English, Korean, Japanese, German, French, Tamil, Indonesian, Spanish, Dutch, and Thai

–  Wricen data in the form of (88) short composi>ons ranging from 150-‐700 characters, depending on the genre type

–  Spoken data is from their 1-‐2 minute short presenta>ons (60 hours) during in-‐class conversa>on and the examina>on

–  Coming with a user-‐friendly online interface that allows a number of search op>ons including searching by source, word class, learner’s L1, topic of the task

32

Chinese interlanguage research •  Modern Interlanguage Chinese Corpus –  Comprising tasks of composi>ons and making sentences, 10,135 sentences

–  Collected twice per semester from years 2-‐4 Chinese studies students at 6 Korean universi>es in 2004-‐2006

–  hcp://jit.jj.ac.kr:8080/corpus/index.jsp •  NTNU created a Chinese interlanguage corpus in 2004-‐2005

•  41,053 sentences by 210 learners of Chinese (mostly L1 English)

•  hcp://chinese.mtc.ntnu.edu.tw/moodle/mod/forum/discuss.php?d=210

33

Chinese interlanguage research •  Learner Chinese corpora reported but not publicly available –  Ji’nan University: 3 million characters –  Nanjing Normal University: 900,000 characters –  Zhongshan University: 750,000 characters

•  Corpora planned or under construc>on –  NTNU: learner interlanguage corpus (LIC) of Chinese wricen and spoken texts

–  Ludong University: country-‐specific (L1 Korean) Chinese interlanguage corpus with a target size of over 3m characters

–  Shanghai Jiaotong University: Chinese composi>on corpus

34

Chinese interlanguage research •  Examples of corpus-‐based studies analysing specific

gramma>cal and lexical features of learners’ Chinese interlanguage –  Japanese learners’ acquisi>on of Chinese direc>onal complements

(Yang 2004) –  Learner errors with the negators 不 bu and 没有 meiyou (Yuan

2005a, 2005b) –  Learners’ acquisi>on of the Chinese 比 bi compara>ve structure

(Wang 2005) –  Learner errors with Chinese idioms chengyu (Shi 2008) –  L1 Korean learners’ errors with the preposi>on 給 gei (Hua 2009)

35

Chinese interlanguage research •  A focus of TCFL research, but a number of issues with current

Chinese interlanguage research –  Very few exis>ng Chinese interlanguage corpora in comparison

with learner English corpora –  Rather small corpus size compared with general corpora –  Seriously biased towards to Asian learners (Korean, Japanese,

Southeast Asian) –  Mostly composi>ons under test condi>ons –  Lack of spoken data –  Consistency in manual error tagging –  Public availability –  Interlanguage research focusing on or confined to error analysis,

overlooking interlanguage usage pacerns

36

ECL and CCL in interlanguage research

•  While learn English corpus research has largely been undertaken in parts of the world other than major na>ve English speaking countries, research of Chinese interlanguage is highly concentrated in China

•  While learner English corpus research has covered the interlanguages of learners from an extensive range of L1 backgrounds including languages both similar to and dis>nctly different from English, Chinese interlanguage research is highly unbalanced as regards learners’ L1, which are essen>ally limited to learners from East and Southeast Asian countries, with learners with a first language of English or an European language markedly under-‐represented at present

37


•  English is the most popular and most widely learned second or foreign language in the world

•  Unsurprising that learner English has been studied more extensively than learner Chinese

•  When CCL seeks to address the numerous issues with Chinese interlanguage corpus research as noted earlier, it clearly has a lot to learn from the ECL experience with learner English research

38


•  A proposal to create the Interna4onal Corpus of Learner Chinese as a joint research project between a number of universi>es in and outside China (Cui & Zhang 2011)

•  Target size of 50m characters, with an annotated wricen component of 20m characters and a raw text wricen component of 25m characters

•  Also 5m characters of spoken data, with an annotated component of 2m characters and a raw text component of 3m characters

•  Data to be collected from non-‐na>ve Chinese learners including both Chinese majors and non-‐Chinese majors at beginner, intermediate, and advanced levels

39

ECL and CCL in interlanguage research •  Genres to cover narra>ve, argumenta>ve, and expository

types •  Task types to include homework, exam script, HSK test etc. •  Encoding of rich metadata about learners and about the

text sample •  Error tagging at various levels •  Basic annota>on including word tokenisa>on, POS tagging,

sentence cons>tuents, sentence type •  Resul>ng corpus to be mounted at a dedicated website to

allow registered users to search online, in addi>on to a CD edi>on to be published for use offline on standoff PCs

40

ECL and CCL in interlanguage research •  Also a need to build a comparable na>ve Chinese ‘control

corpus’ to facilitate comparisons of learner Chinese with na>ve Chinese

•  How can ECL contribute to Chinese interlanguage research? –  Chinese is owen taught locally in major European and American

countries such as the UK and the US, where ECL has also developed most rapidly

–  Corpus linguists in these areas can contribute to Chinese interlanguage research by crea>ng corpora of learner Chinese produced by their local na>ve students

–  To complement the exis>ng interlanguage Chinese corpora created in China, facilita>ng contras>ve analysis of interlanguages by learners from Asia and those from Europe and America

41

Conclusions

•  I hope the survey of the development of corpus linguis>cs in Chinese context in the three areas reviewed will contribute to the further development of CCL

•  CCL and ECL clearly can inform and learn from each other, e.g. ECL experience of descrip>ve grammars for CCL, and CCL experience of mul>lingual lexicography for ECL

42

thedevelopmentofcorpus...

Documents