corpus methods in linguistics and nlp lecture 1: introduction · corpus methods in linguistics and...

Corpus methods in linguistics and NLPLecture 1: Introduction

UNIVERSITY OF

GOTHENBURG

Richard Johansson

November 3, 2015

-20pt

UNIVERSITY OF

GOTHENBURG

overview

course-related matters

overview of corpora

types of corpora

collecting corpora

adding structure to the text

quick overview of search tools

-20pt

UNIVERSITY OF

GOTHENBURG

teaching: technical and nontechnical tracks

I MLT students and doctoral students in NLP follow thetechnical track

I the others follow the nontechnical track

-20pt

UNIVERSITY OF

GOTHENBURG

lectures

I 8 lectures in total (possibly 9)

I the �rst four are relevant to all:I introduction and overview (this lecture)I annotation: adding linguistic information to corporaI treebanks: syntactically annotated corporaI quantitative methods & statistics

I lectures mainly for a NLP audience:I distributional semantics: discovering word meaningI large-scale data processingI (possibly one on corpora in information retrieval)

I lectures mainly for a linguistic audience:I historical corpora: why more di�cult than modern?I (possibly: introduction to clustering and topic modeling)I quantitative investigations in syntax

-20pt

UNIVERSITY OF

GOTHENBURG

literature

I we will use a few parts of the online book edited by Wynne(2004): Developing Linguistic Corpora: a Guide to GoodPractice

I in addition, the web page contains pointers to a number ofarticles

http://www.ahds.ac.uk/creating/guides/linguistic-corpora/index.htm

http://www.ahds.ac.uk/creating/guides/linguistic-corpora/index.htm

-20pt

UNIVERSITY OF

GOTHENBURG

assignments

technical nontechnical

designing annotation • •searching Swedish corpora •searching in treebanks • •�nding collocations •distributional semantics •using Spark (VG)

project (PhD) •

I the �rst three assignments are already out

I we will discuss them the next few times we meet

-20pt

UNIVERSITY OF

GOTHENBURG

project

I pick a corpus-related topic related to your research

I formulate a research question

I carry out an investigation (quantitative or qualitative)

I write a report (December�January)

I present it at a seminar (December)

-20pt

UNIVERSITY OF

GOTHENBURG

overview


overview of corpora

types of corpora

collecting corpora



-20pt

UNIVERSITY OF

GOTHENBURG

what is a corpus?

I a corpus (pl. corpora; Swedish en korpus�korpusar) is acollection of authentic text

I selected,I annotated (linguistically analyzed),I and computerized (stored in an electronic form),I for a speci�c purpose (but can of course be reused for other

purposes)

-20pt

UNIVERSITY OF

GOTHENBURG

why use corpora in natural language processing?

I in development of NLP systems:

I �training� of statistical/machine learning systems, e.g.estimate HMM probabilities for a tagger

I . . . but also development of knowledge-based systems, e.g.Grammatical Framework

I evaluation:I what is the accuracy of our tagger?

I �discovery�:I collocations: discovering patternsI clustering of words, documents, . . .I distributional vectorsI topic models

-20pt

UNIVERSITY OF

GOTHENBURG

but how about linguistics (and the humanities in general)?

I linguistics is much more empirical nowadays than before

I corpus methods allow the linguist to work more e�ciently andobjectively

I . . . but also to pose new research questions

-20pt

UNIVERSITY OF

GOTHENBURG

criticism of the use of corpora in linguistics

I Chomsky (recently): �Corpus linguistics doesn't meananything. It's like saying suppose a physicist decides [. . . ] thatinstead of relying on experiments, what they're going to do istake videotapes of things happening in the world and they'llcollect huge videotapes of everything that's happening andfrom that maybe they'll come up with some generalizations orinsights.�

I but it's not clear what would be the linguistic counterpart of acontrolled experiment in the �hard sciences�

I introspection? can hardly be called scienti�c!I interviews? should be OK, but costly, and limitedI eye tracking, etc? also OK, but even more costly and limited

-20pt

UNIVERSITY OF

GOTHENBURG

typical uses of corpora (1)

I quantitative investigations, most importantly frequencies:I syntax: are short noun phrase objects more often fronted?I sociolinguistics: is the passive voice more frequent among

academics?I lexicography: what are the senses of the English word line?I dialectology: how common is the word bamba as a function of

the distance from Gothenburg?I press history: how often was Napoleon mentioned in Swedish

newspapers from the early 19th century?I cultural history: what crimes were people convicted of in

17th-century Stockholm?

-20pt

UNIVERSITY OF

GOTHENBURG


I �nding attestations:I is that word used nowadays?I is it possible to extract more than one wh-pronoun in English?

-20pt

UNIVERSITY OF

GOTHENBURG


I cultural heritage:I documenting a moribund languageI giving the people access to older stages of the language, e.g.

old law texts, runestones

-20pt

UNIVERSITY OF

GOTHENBURG

but why not just use Google?

I probably the largest �corpus� around

I but Google searches are not reproducibleI Kilgarri� (2007): Googleology is bad science

I we know nothing about the selection of the data

I but still nice if we just want to �nd an attestation: Can yousay that?

I . . . and a rough indication of relative frequenciesI is believe in more common than believe on?

https://www.kilgarriff.co.uk/Publications/2007-K-CL-Googleology.pdf

-20pt

UNIVERSITY OF

GOTHENBURG

short and biased history of (computerized) corpora

I Index Thomisticus (1946�2005)I http://www.corpusthomisticum.org

I Brown Corpus (1967)

I Press-65 (1970)

I early treebanking: Talbanken / MAMBA(1976�78)

I spoken: London�Lund (1980)

I parallel: HANSARD (∼1990)I Penn Treebank (1995)

I �Web as a Corpus� (∼2003) (SweWAC 2010)

http://www.corpusthomisticum.org

-20pt

UNIVERSITY OF

GOTHENBURG

overview


overview of corpora

types of corpora

collecting corpora



-20pt

UNIVERSITY OF

GOTHENBURG

types of corpora: a few dimensions

I modality: written, spoken, sign language, multimodal, . . . ?

I genre / domain: �ction, news, debates, . . . ?

I speaker/writer: �normal�, child, aphatic, learner, . . . ?

I language(s): one, two, many?I and if more than one: what relation between languages?

parallel, comparable?

I size

I annotation: just words, syntax, dialogue, . . . ?

-20pt

UNIVERSITY OF

GOTHENBURG

some examples of corpora: typical mixed �general language�

I Stockholm�Umeå Corpus (general written Swedish):I 500 texts, ∼2000 words eachI 9 main genres, with subgenres:

I K � imaginative proseI KK � general �ctionI KL � science �ction and mysteryI KN � light readingI KR � humour

I Brown corpus (Francis and Ku£era, 1964)

I British National Corpus

-20pt

UNIVERSITY OF

GOTHENBURG

some examples of corpora: limited amount

I written Finnish Romani (Borin, 2000):I ∼110.000 wordsI a signi�cant part of the total written production in

this language

I Gothic: the Silver Bible and a few fragments

http://k2xx.spraakdata.gu.se/personal/lars/pblctns/lrec2-romani.pdf

-20pt

UNIVERSITY OF

GOTHENBURG

some examples of corpora: parallel

I Europarl: http://www.statmt.org/europarlI proceedings of the European Parliament 1996�2006I the proceedings are translated into all the member

languagesI 20 languages, ∼10�50 million words / languageI all languages linked sentence by sentence to English

EN: Technical requirements for inland waterway vessels (vote)

SV: Tekniska föreskrifter för fartyg i inlandssjöfart (omröstning)

I a similar corpus: the Canadian Hansard (English�French)I the Bible (or parts of it) in 100 languages:

http://homepages.inf.ed.ac.uk/s0787820/bible/

I UN charters. . .

http://www.statmt.org/europarl

http://homepages.inf.ed.ac.uk/s0787820/bible/

-20pt

UNIVERSITY OF

GOTHENBURG

some examples of corpora: comparable

I Wikipedia (http://www.wikipedia.org, . . . )

I linked article�article (several hundred languages)

I the texts are related but not (often) direct translations

I also contains useful semi-structured information

Sweden, o�cially the Kingdom of Sweden, is a Scandinavian country in Northern Europe. It bor-ders Norway and Finland, and is connected to Denmark by a bridge-tunnel across the Öresund. At450,295 square kilometres (173,860 sq mi), Sweden is the third-largest country in the EuropeanUnion by area, with a total population of over 9.8 million. Sweden consequently has a low pop-ulation density of 21 inhabitants per square kilometre (54/sq mi), with the highest concentrationin the southern half of the country.. . .

La Svèsia (449 964 km2; 9 082 995) la xe un Stado del nord de l'Eoropa inte la Penisola Scandinava.La so cavedal la xe Stocolma che La gà 765 044 abitanti. Altre cità importante le xe Göteborg,Uppsala, Malmö e Lund. La con�na co la Norveja a nord-ovest e co la Finlandia a est; la xecoligada co la Danimarca traverso del Ponte del Øresund. La xe bagnà dal Mar Baltego. La ga nadensità demografega pitosto bassa e despensada iregolarmente. El so teritorio el xe sior de legne,fero e aqua. I Svedesi i gode de un bon livel de vita . . .

http://www.wikipedia.org

-20pt

UNIVERSITY OF

GOTHENBURG

some examples of corpora: parent�child dialogues

I CHILDES (http://childes.psy.cmu.edu)

*NOR: did you wash your hands Jinny ?

*JIN: yeah .

*NOR: you did already ?

*JIN: already .

*NOR: you forgot the candy off from around your mouth huh ?

<n is wiping J's face>

*JIN: yeah .

*NOR: <laughs> okay <% N goes back to sit down> .

*JIN: do I look dirty ?

http://childes.psy.cmu.edu

-20pt

UNIVERSITY OF

GOTHENBURG

overview


overview of corpora

types of corpora

collecting corpora



-20pt

UNIVERSITY OF

GOTHENBURG

selection of a corpus

I top-down:I why? what is the purpose of the corpus?I what? what type of language are we interested in?

I bottom-up:I from where? what material can we access?I how? what is the method for gathering the material?

-20pt

UNIVERSITY OF

GOTHENBURG

corpus as a sample

I a corpus is a sample in the statistical sense

I it's a similar situation as when we carry out an opinion pollI we select a representative sample from a well-de�ned

population that is large enoughI then we can carry out investigations with some degree of

statistical certainty

-20pt

UNIVERSITY OF

GOTHENBURG

what's the population? what's �representative�?

I the sample is representative if what's true about the sampleis also true in general

I but: is New York Times representative of �English in general�?I probably more meaningful to speak about whether it's

representative of a genre

I in practice, it's hard to determine whether a corpus isrepresentative

I Clear (1992) Corpus samplingI Biber (1993) Representativeness in corpus design

I corpus collectors sometimes also try to make the corpus�balanced�, which is equally hard to de�ne

I it is probably more important to document the composition

http://otipl.philol.msu.ru/media/biber930.pdf

-20pt

UNIVERSITY OF

GOTHENBURG

the e�ect of sampling on NLP systems

I the way the corpus has been sampled also has implications inNLP

I at least in data-driven NLP systems that observe a corpus

I well-known example: the WSJ part of the Penn Treebank,often used for developing syntactic parsers in English

I vocabulary e�ectsI the relative frequencies of e.g. PoS tags di�er between genresI constructions: for instance, questions are rare in the WSJ

-20pt

UNIVERSITY OF

GOTHENBURG

example: the written part of the BNC

http://www.natcorp.ox.ac.uk

DOMAIN % TIME %Imaginative 21.91 1960-74 2.26Arts 8.08 1975-93 89.23Belief and thought 3.40 Unclassi�ed 8.49Commerce/�nance 7.93 MEDIUM %Leisure 11.13 Book 58.58Natural/pure science 4.18 Periodical 31.08Applied science 8.21 Misc. published 4.38Social science 14.80 Misc. unpublished 4.00World a�airs 18.39 To-be-spoken 1.52Unclassi�ed 1.93 Unclassi�ed 0.40

http://www.natcorp.ox.ac.uk

-20pt

UNIVERSITY OF

GOTHENBURG

example: the spoken part of the BNC

REGION % CONTEXT-GOVERNED %South 45.61 Educational 20.56Midlands 23.33 Business 21.47North 25.43 Institutional 21.86Unclassi�ed 5.61 Leisure 23.71

Unclassi�ed 12.38INTERACTION %Monologue 18.64Dialogue 74.87Unclassi�ed 6.48

-20pt

UNIVERSITY OF

GOTHENBURG

availability of text

I sometimes, we don't have the luxury of selecting a�representative� sample: we just have to take what we can get

I for instance Finnish Romani, Runic Swedish, . . .

I copyright issues:I published on the web 6= freely available!I there is a risk that the work you do will be wasted

I ESPC, the English�Swedish Parallel CorpusI Twitter corpora

I also, technical issues:I �ction is harder to access than web-published text (news,

blogs, . . . )I so at the Department of Swedish we have a huge amount of

web data but a much smaller amounts of �ction, academictext, etc

-20pt

UNIVERSITY OF

GOTHENBURG

example: the Koala corpus of contemporary Swedish

I in the recent Koala corpus, we selected �ve subcorpora wherewe were sure that the texts were legally kosher

I blogs: the authors agreed to release their textsI �ction: we selected text under a Creative Commons licenseI European parliament proceedings: they are publicI Wikipedia: everything under CCI government press releases

-20pt

UNIVERSITY OF

GOTHENBURG

overview


overview of corpora

types of corpora

collecting corpora



-20pt

UNIVERSITY OF

GOTHENBURG

making the text usable for automatic tools

I to be able to carry out quantitative research about text, weneed to add more information to the texts

I information about the text: metadataI information inside the text: annotation

I but �rst, what do we mean by the �text�?

-20pt

UNIVERSITY OF

GOTHENBURG

storing language in our computers

I plain text �le: the �le contains only the �letters�I this is what we tend to use when processing

corpora

I rich text: not only the �letters�, but alsoformatting information such as font, size, color

I for instance Word, PDF, HTML, . . .I will typically be converted into plain text before

inclusion in a corpus

I other media: require complex conversion methodsI scanned page: requires OCRI recording: requires speech recognition

Pierre Vinken, 61

years old, will

join the board as

nonexecutive director

Nov. 29.

-20pt

UNIVERSITY OF

GOTHENBURG

the anatomy of a plain text �le

I in a text �le, the �letters� � the character symbols �are stored in a sequence

I the character symbols are de�ned by an internationalstandard called Unicode

-20pt

UNIVERSITY OF

GOTHENBURG

character encoding

I when the text is stored in a �le (or transmitted over theInternet), the character symbols are converted into bytes(numbers)

I there are several encoding systemsI nowadays, UTF-8 dominates: it can handle all UnicodeI a few years back, many language-speci�c encodings. . .

-20pt

UNIVERSITY OF

GOTHENBURG

when we accidentally use the wrong encoding. . .

-20pt

UNIVERSITY OF

GOTHENBURG

are there letters for my language?

I http://www.unicode.org/charts

I most known writing systems (living and dead) are nowstandardized in Unicode

I for instance, several kinds of runes(see http://www.unicode.org/charts/PDF/U16A0.pdf)but not some lesser-known types (e.g. Dalrunor)

I there are also a number of semi-standardized writing systems

http://www.unicode.org/charts

http://www.unicode.org/charts/PDF/U16A0.pdf

-20pt

UNIVERSITY OF

GOTHENBURG

some languages require more complex rendering

I Arabic is drawn from right to left, and the shape of the letterdepends on its position in the word:

kaf, teh, 'alef, beh →I Indic scripts (e.g. Devanagari), and Hangul (Korean) combine

vowel symbols and consontant symbols in irregular ways:

I but even in the Latin script, we use ligatures such as � and �

-20pt

UNIVERSITY OF

GOTHENBURG

processing rich text formats

I rich text formats such as Word and PDF are di�cult tohandle for corpus processing tools, so we typically want toconvert them into plain text

I examples:I pdftotext: PDF → plain textI wvtext: Word → plain text

I the output from these tools often need a bit of polishing

-20pt

UNIVERSITY OF

GOTHENBURG

using text from the web

I �boilerplate removal� http://code.google.com/p/boilerpipe

http://code.google.com/p/boilerpipe

-20pt

UNIVERSITY OF

GOTHENBURG

from image to text: Optical Character Recognition

⇒

1.

Lund � Halmstad.

Färden från Lund gjordes genomEngelholm till Margretetorp.Uppkomne på höjden af Hallandsåsnedsände vi afskedstagande blickaröfver en bördig sträckning af Skåne.Engeltoftas välbyggda sätesgård meddithörande utbrytningar och skimrandekyrkor i förening med täcka lundarpryda den vackra slätten. . . .

-20pt

UNIVERSITY OF

GOTHENBURG

when OCR goes wrong. . .

I From Dalpilen (1923): �Talet utmynnade i ett rungande teve

för jubilaren, ännu så kraftfull och ungdomlig.�

-20pt

UNIVERSITY OF

GOTHENBURG

when OCR goes wrong (again). . .

I let's look at the popularity of funk music using Google ngramviewer: http://books.google.com/ngrams

http://books.google.com/ngrams

-20pt

UNIVERSITY OF

GOTHENBURG

when OCR goes wrong (again). . .

I let's look at the popularity of funk music using Google ngramviewer: http://books.google.com/ngrams

http://languagelog.ldc.upenn.edu/nll/?p=2848

http://books.google.com/ngrams

http://languagelog.ldc.upenn.edu/nll/?p=2848

-20pt

UNIVERSITY OF

GOTHENBURG

metadata: information about the document as a whole

I language

I creation or publication time

I the author:I native languageI locationI genderI age

I genre

I modality: written? spoken?

I topic classi�cation (e.g. library classi�cation system)

-20pt

UNIVERSITY OF

GOTHENBURG


I to be able to do anything interesting with the text, we need togo beyond the letters and add some structure

I typically, we start by segmenting (splitting) the text intomanageable pieces: sentences and words

I then, we add linguistic annotation: morphology, syntax,discourse, . . .

-20pt

UNIVERSITY OF

GOTHENBURG

sentence splitting and word tokenization

I the �rst step in text processing is typically that we split it intosentences and words (tokens)

I how do you think this could be done automatically?

I typical method for sentence splitting: look for end-of-sentencepunctuation followed by a capital letter

I typical method for word segmentation (tokenization): lookfor letters between spaces or punctuation

I we also need to de�ne what we mean by a �word�!I is cannot one word or two?I how about don't or Mary's?I how about clitics in Romance languages, e.g. Italian farcela?

-20pt

UNIVERSITY OF

GOTHENBURG

tokenization and sentence splitting: tricky cases

I automatic tokenization and sentence splitting can be done in aquite reliable way for �normal� text

I but there are some corner cases. . .�. . . an account with the U.S. Treasury to buy Savings Bonds online . . . �

�. . . then I went back to the U.S. My dad and I moved . . . �

I hyphenated words: should we remove the hyphen?

I tokenization in languages that don't use spaces is nontrivial:

example �borrowed� from Liang Huang

-20pt

UNIVERSITY OF

GOTHENBURG

adding linguistic annotation

I if we want to carry out more complex linguistic investigation,we need linguistic annotation

I part-of-speech tags: �this is a noun�I morphological analysis: �it is in the singular�I syntactic analysis: �it is the subject of the sentence�I . . .

I the linguistic annotation can be added manually (high cost,high quality) or automatically (low cost, low quality)

I more about annotation in the next two lectures!

-20pt

UNIVERSITY OF

GOTHENBURG

overview


overview of corpora

types of corpora

collecting corpora



-20pt

UNIVERSITY OF

GOTHENBURG

corpus statistics

I frequencies:I which word is most frequent?I which words are most typical of this corpus, compared to the

�general� language?I development over time (neologisms, new senses)I words, phrases, syntactic constructions, . . .

I co-occurrences:I which words tend to occur close to the word jazz?I which verbs take the noun cake as an object?I how do we normally translate the word case from English into

Swedish?

-20pt

UNIVERSITY OF

GOTHENBURG

concordancers

I a concordancer creates a concordances: a list of the contextswhere a search term appears

I today, many concordancers are web-basedI for instance, Korp http://spraakbanken.gu.se/korp

I the pioneer: Hugues de St. Cher (d. 1263) and 500 monks

http://spraakbanken.gu.se/korp

-20pt

UNIVERSITY OF

GOTHENBURG

stand-alone concordance / search tools

I AntConc:http://www.laurenceanthony.net/software.html

I Wordsmith (Windows only)http://www.lexically.net/wordsmith/

I KH coder http://khc.sourceforge.net/en/

http://www.laurenceanthony.net/software.html

http://www.lexically.net/wordsmith/

http://khc.sourceforge.net/en/

-20pt

UNIVERSITY OF

GOTHENBURG

purpose of concordances

I overview of the usage of a word: its senses, etc

I �typical� usage, e.g. for examples in dictionaries

I common cooccurrences

I �nding idioms and collocation

-20pt

UNIVERSITY OF

GOTHENBURG

structural searches

I for instance TIGERSearch (lecture 3 and assignment)

I ANNIS (http://corpus-tools.org/annis/) is a web-basedalternative

http://corpus-tools.org/annis/

-20pt

UNIVERSITY OF

GOTHENBURG

structural searches (2)

I another example with Korp:

-20pt

UNIVERSITY OF

GOTHENBURG

time, location, . . .

I occurrences of the word kommunism (`communism') inSwedish newspapers 1820�1920

-20pt

UNIVERSITY OF

GOTHENBURG

next lecture: annotation

I adding linguistic information to corpora: annotation

I describing the linguistic model

I managing the annotation project

I measuring annotation reliability

I quick survey of tools for manual and automatic annotation

corpus methods in linguistics and nlp lecture 1: introduction · corpus methods in linguistics and...

Documents