powerconc: an r-gram based corpus analysis tool jiajin xu & yunlong jia beijing foreign studies...

17
PowerConc: An R-gram Based Corpus Analysis Tool Jiajin Xu & Yunlong Jia Beijing Foreign Studies University

Upload: adam-benson

Post on 29-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PowerConc: An R-gram Based Corpus Analysis Tool Jiajin Xu & Yunlong Jia Beijing Foreign Studies University

PowerConc: An R-gram Based Corpus Analysis Tool

Jiajin Xu & Yunlong JiaBeijing Foreign Studies University

Page 2: PowerConc: An R-gram Based Corpus Analysis Tool Jiajin Xu & Yunlong Jia Beijing Foreign Studies University

2

PowerConc• National Research Centre for Foreign Language E

ducation, Beijing Foreign Studies University• A general purpose tool for corpus analysis• Developed in Delphi• can deal with any ANSI encoded texts

– E.g. on a Simplified Chinese OS– works well with Simplified/Trad. Chinese texts,

(un)tokenised or raw/POS-tagged, as well as raw/POS-tagged English texts

Page 3: PowerConc: An R-gram Based Corpus Analysis Tool Jiajin Xu & Yunlong Jia Beijing Foreign Studies University

3

• Size: 1.5MB, compressed package less than 1MB

• Installation: Doesn’t require any installation.

• OS: Works only on Windows now.

PowerConc

Page 4: PowerConc: An R-gram Based Corpus Analysis Tool Jiajin Xu & Yunlong Jia Beijing Foreign Studies University

Design principles for PowerConc

Page 5: PowerConc: An R-gram Based Corpus Analysis Tool Jiajin Xu & Yunlong Jia Beijing Foreign Studies University

5

Ideally• Most powerful, can do anything that a concor

dancer can do and cannot do.• involves least effort in learning to use it

• Doing MORE with less• Reductionism in software design

Page 6: PowerConc: An R-gram Based Corpus Analysis Tool Jiajin Xu & Yunlong Jia Beijing Foreign Studies University

6

Less buttons and/or tabs

Frequencycount

SearchList

Page 7: PowerConc: An R-gram Based Corpus Analysis Tool Jiajin Xu & Yunlong Jia Beijing Foreign Studies University

7

Page 8: PowerConc: An R-gram Based Corpus Analysis Tool Jiajin Xu & Yunlong Jia Beijing Foreign Studies University

8

Page 9: PowerConc: An R-gram Based Corpus Analysis Tool Jiajin Xu & Yunlong Jia Beijing Foreign Studies University

9

Freq. Count

Concordance N-gram list

Collocation &Colligation Key n-gram list

Page 10: PowerConc: An R-gram Based Corpus Analysis Tool Jiajin Xu & Yunlong Jia Beijing Foreign Studies University

10

More possibilities in tool develop’t

• Corpus-informed/related ‘grammars’– Pattern grammar (local grammar)– Collostruction– Lexical grammar (natural grammar, real grammar)– Lexical priming (textual colligation)– Longman grammar: Biber et al. grammar register

variation• Tool development lags behind

Page 11: PowerConc: An R-gram Based Corpus Analysis Tool Jiajin Xu & Yunlong Jia Beijing Foreign Studies University

11

From phraseology to R-gram

• Many of the ‘grammars’ as some sort of phraseology

• We coined a technical term ‘R-gram’.– An operational parallel to phraseology– The unit of language can be words, lemmata,

phrases, POS, POS sequence, and combination of all these.

– Can be linguistic structures with uncertain words or categories (e.g. be passive/get passive).

Page 12: PowerConc: An R-gram Based Corpus Analysis Tool Jiajin Xu & Yunlong Jia Beijing Foreign Studies University

12

• a * of: collocational framework• It be ADJ that: evaluative construction• Noun noun compounds• Bi-nominal constructions• Passive constructions: be/get ADV. V-EN• All these could be matched with Regular

Expressions.• But Regex is too difficult for lay users.

Page 13: PowerConc: An R-gram Based Corpus Analysis Tool Jiajin Xu & Yunlong Jia Beijing Foreign Studies University

13

Easy search with enhanced hits

• Smart Input• Three meta-characters in Smart Input syntax,

the simplest grammar ever.

• @be returns all inflectional forms of ‘be’

• #n returns all nouns

• * refers to any single word

Page 14: PowerConc: An R-gram Based Corpus Analysis Tool Jiajin Xu & Yunlong Jia Beijing Foreign Studies University

14

• a * of => a * of• It be ADJ that => It @be #adj that• Noun noun compound => #n #n• Bi-nominal => #n and #n• Passive => \S+_VB\S+\s(\S+_[RXPJDN]\S+\s)*\

S+_V\S*N

Page 15: PowerConc: An R-gram Based Corpus Analysis Tool Jiajin Xu & Yunlong Jia Beijing Foreign Studies University

15

Limitation

• speed• A concordancer without applying indexing• can't process texts larger than a few million

words anyway.

Page 16: PowerConc: An R-gram Based Corpus Analysis Tool Jiajin Xu & Yunlong Jia Beijing Foreign Studies University

16

Download PowerConc

•www.fleric.org.cn/powerconc/• http://www.bfsu-corpus.org/channels/tools

Page 17: PowerConc: An R-gram Based Corpus Analysis Tool Jiajin Xu & Yunlong Jia Beijing Foreign Studies University

Thank you!