corpus methods in linguistics and nlp lecture 1: introduction · corpus methods in linguistics and...
TRANSCRIPT
Corpus methods in linguistics and NLPLecture 1: Introduction
UNIVERSITY OF
GOTHENBURG
Richard Johansson
November 3, 2015
-20pt
UNIVERSITY OF
GOTHENBURG
overview
course-related matters
overview of corpora
types of corpora
collecting corpora
adding structure to the text
quick overview of search tools
-20pt
UNIVERSITY OF
GOTHENBURG
teaching: technical and nontechnical tracks
I MLT students and doctoral students in NLP follow thetechnical track
I the others follow the nontechnical track
-20pt
UNIVERSITY OF
GOTHENBURG
lectures
I 8 lectures in total (possibly 9)
I the �rst four are relevant to all:I introduction and overview (this lecture)I annotation: adding linguistic information to corporaI treebanks: syntactically annotated corporaI quantitative methods & statistics
I lectures mainly for a NLP audience:I distributional semantics: discovering word meaningI large-scale data processingI (possibly one on corpora in information retrieval)
I lectures mainly for a linguistic audience:I historical corpora: why more di�cult than modern?I (possibly: introduction to clustering and topic modeling)I quantitative investigations in syntax
-20pt
UNIVERSITY OF
GOTHENBURG
literature
I we will use a few parts of the online book edited by Wynne(2004): Developing Linguistic Corpora: a Guide to GoodPractice
I in addition, the web page contains pointers to a number ofarticles
-20pt
UNIVERSITY OF
GOTHENBURG
assignments
technical nontechnical
designing annotation • •searching Swedish corpora •searching in treebanks • •�nding collocations •distributional semantics •using Spark (VG)
project (PhD) •
I the �rst three assignments are already out
I we will discuss them the next few times we meet
-20pt
UNIVERSITY OF
GOTHENBURG
project
I pick a corpus-related topic related to your research
I formulate a research question
I carry out an investigation (quantitative or qualitative)
I write a report (December�January)
I present it at a seminar (December)
-20pt
UNIVERSITY OF
GOTHENBURG
overview
course-related matters
overview of corpora
types of corpora
collecting corpora
adding structure to the text
quick overview of search tools
-20pt
UNIVERSITY OF
GOTHENBURG
what is a corpus?
I a corpus (pl. corpora; Swedish en korpus�korpusar) is acollection of authentic text
I selected,I annotated (linguistically analyzed),I and computerized (stored in an electronic form),I for a speci�c purpose (but can of course be reused for other
purposes)
-20pt
UNIVERSITY OF
GOTHENBURG
why use corpora in natural language processing?
I in development of NLP systems:
I �training� of statistical/machine learning systems, e.g.estimate HMM probabilities for a tagger
I . . . but also development of knowledge-based systems, e.g.Grammatical Framework
I evaluation:I what is the accuracy of our tagger?
I �discovery�:I collocations: discovering patternsI clustering of words, documents, . . .I distributional vectorsI topic models
-20pt
UNIVERSITY OF
GOTHENBURG
but how about linguistics (and the humanities in general)?
I linguistics is much more empirical nowadays than before
I corpus methods allow the linguist to work more e�ciently andobjectively
I . . . but also to pose new research questions
-20pt
UNIVERSITY OF
GOTHENBURG
criticism of the use of corpora in linguistics
I Chomsky (recently): �Corpus linguistics doesn't meananything. It's like saying suppose a physicist decides [. . . ] thatinstead of relying on experiments, what they're going to do istake videotapes of things happening in the world and they'llcollect huge videotapes of everything that's happening andfrom that maybe they'll come up with some generalizations orinsights.�
I but it's not clear what would be the linguistic counterpart of acontrolled experiment in the �hard sciences�
I introspection? can hardly be called scienti�c!I interviews? should be OK, but costly, and limitedI eye tracking, etc? also OK, but even more costly and limited
-20pt
UNIVERSITY OF
GOTHENBURG
typical uses of corpora (1)
I quantitative investigations, most importantly frequencies:I syntax: are short noun phrase objects more often fronted?I sociolinguistics: is the passive voice more frequent among
academics?I lexicography: what are the senses of the English word line?I dialectology: how common is the word bamba as a function of
the distance from Gothenburg?I press history: how often was Napoleon mentioned in Swedish
newspapers from the early 19th century?I cultural history: what crimes were people convicted of in
17th-century Stockholm?
-20pt
UNIVERSITY OF
GOTHENBURG
typical uses of corpora (2)
I �nding attestations:I is that word used nowadays?I is it possible to extract more than one wh-pronoun in English?
-20pt
UNIVERSITY OF
GOTHENBURG
typical uses of corpora (3)
I cultural heritage:I documenting a moribund languageI giving the people access to older stages of the language, e.g.
old law texts, runestones
-20pt
UNIVERSITY OF
GOTHENBURG
but why not just use Google?
I probably the largest �corpus� around
I but Google searches are not reproducibleI Kilgarri� (2007): Googleology is bad science
I we know nothing about the selection of the data
I but still nice if we just want to �nd an attestation: Can yousay that?
I . . . and a rough indication of relative frequenciesI is believe in more common than believe on?
-20pt
UNIVERSITY OF
GOTHENBURG
short and biased history of (computerized) corpora
I Index Thomisticus (1946�2005)I http://www.corpusthomisticum.org
I Brown Corpus (1967)
I Press-65 (1970)
I early treebanking: Talbanken / MAMBA(1976�78)
I spoken: London�Lund (1980)
I parallel: HANSARD (∼1990)I Penn Treebank (1995)
I �Web as a Corpus� (∼2003) (SweWAC 2010)
-20pt
UNIVERSITY OF
GOTHENBURG
overview
course-related matters
overview of corpora
types of corpora
collecting corpora
adding structure to the text
quick overview of search tools
-20pt
UNIVERSITY OF
GOTHENBURG
types of corpora: a few dimensions
I modality: written, spoken, sign language, multimodal, . . . ?
I genre / domain: �ction, news, debates, . . . ?
I speaker/writer: �normal�, child, aphatic, learner, . . . ?
I language(s): one, two, many?I and if more than one: what relation between languages?
parallel, comparable?
I size
I annotation: just words, syntax, dialogue, . . . ?
-20pt
UNIVERSITY OF
GOTHENBURG
some examples of corpora: typical mixed �general language�
I Stockholm�Umeå Corpus (general written Swedish):I 500 texts, ∼2000 words eachI 9 main genres, with subgenres:
I K � imaginative proseI KK � general �ctionI KL � science �ction and mysteryI KN � light readingI KR � humour
I Brown corpus (Francis and Ku£era, 1964)
I British National Corpus
-20pt
UNIVERSITY OF
GOTHENBURG
some examples of corpora: limited amount
I written Finnish Romani (Borin, 2000):I ∼110.000 wordsI a signi�cant part of the total written production in
this language
I Gothic: the Silver Bible and a few fragments
-20pt
UNIVERSITY OF
GOTHENBURG
some examples of corpora: parallel
I Europarl: http://www.statmt.org/europarlI proceedings of the European Parliament 1996�2006I the proceedings are translated into all the member
languagesI 20 languages, ∼10�50 million words / languageI all languages linked sentence by sentence to English
EN: Technical requirements for inland waterway vessels (vote)
SV: Tekniska föreskrifter för fartyg i inlandssjöfart (omröstning)
I a similar corpus: the Canadian Hansard (English�French)I the Bible (or parts of it) in 100 languages:
http://homepages.inf.ed.ac.uk/s0787820/bible/
I UN charters. . .
-20pt
UNIVERSITY OF
GOTHENBURG
some examples of corpora: comparable
I Wikipedia (http://www.wikipedia.org, . . . )
I linked article�article (several hundred languages)
I the texts are related but not (often) direct translations
I also contains useful semi-structured information
Sweden, o�cially the Kingdom of Sweden, is a Scandinavian country in Northern Europe. It bor-ders Norway and Finland, and is connected to Denmark by a bridge-tunnel across the Öresund. At450,295 square kilometres (173,860 sq mi), Sweden is the third-largest country in the EuropeanUnion by area, with a total population of over 9.8 million. Sweden consequently has a low pop-ulation density of 21 inhabitants per square kilometre (54/sq mi), with the highest concentrationin the southern half of the country.. . .
La Svèsia (449 964 km2; 9 082 995) la xe un Stado del nord de l'Eoropa inte la Penisola Scandinava.La so cavedal la xe Stocolma che La gà 765 044 abitanti. Altre cità importante le xe Göteborg,Uppsala, Malmö e Lund. La con�na co la Norveja a nord-ovest e co la Finlandia a est; la xecoligada co la Danimarca traverso del Ponte del Øresund. La xe bagnà dal Mar Baltego. La ga nadensità demografega pitosto bassa e despensada iregolarmente. El so teritorio el xe sior de legne,fero e aqua. I Svedesi i gode de un bon livel de vita . . .
-20pt
UNIVERSITY OF
GOTHENBURG
some examples of corpora: parent�child dialogues
I CHILDES (http://childes.psy.cmu.edu)
*NOR: did you wash your hands Jinny ?
*JIN: yeah .
*NOR: you did already ?
*JIN: already .
*NOR: you forgot the candy off from around your mouth huh ?
<n is wiping J's face>
*JIN: yeah .
*NOR: <laughs> okay <% N goes back to sit down> .
*JIN: do I look dirty ?
-20pt
UNIVERSITY OF
GOTHENBURG
overview
course-related matters
overview of corpora
types of corpora
collecting corpora
adding structure to the text
quick overview of search tools
-20pt
UNIVERSITY OF
GOTHENBURG
selection of a corpus
I top-down:I why? what is the purpose of the corpus?I what? what type of language are we interested in?
I bottom-up:I from where? what material can we access?I how? what is the method for gathering the material?
-20pt
UNIVERSITY OF
GOTHENBURG
corpus as a sample
I a corpus is a sample in the statistical sense
I it's a similar situation as when we carry out an opinion pollI we select a representative sample from a well-de�ned
population that is large enoughI then we can carry out investigations with some degree of
statistical certainty
-20pt
UNIVERSITY OF
GOTHENBURG
what's the population? what's �representative�?
I the sample is representative if what's true about the sampleis also true in general
I but: is New York Times representative of �English in general�?I probably more meaningful to speak about whether it's
representative of a genre
I in practice, it's hard to determine whether a corpus isrepresentative
I Clear (1992) Corpus samplingI Biber (1993) Representativeness in corpus design
I corpus collectors sometimes also try to make the corpus�balanced�, which is equally hard to de�ne
I it is probably more important to document the composition
-20pt
UNIVERSITY OF
GOTHENBURG
the e�ect of sampling on NLP systems
I the way the corpus has been sampled also has implications inNLP
I at least in data-driven NLP systems that observe a corpus
I well-known example: the WSJ part of the Penn Treebank,often used for developing syntactic parsers in English
I vocabulary e�ectsI the relative frequencies of e.g. PoS tags di�er between genresI constructions: for instance, questions are rare in the WSJ
-20pt
UNIVERSITY OF
GOTHENBURG
example: the written part of the BNC
http://www.natcorp.ox.ac.uk
DOMAIN % TIME %Imaginative 21.91 1960-74 2.26Arts 8.08 1975-93 89.23Belief and thought 3.40 Unclassi�ed 8.49Commerce/�nance 7.93 MEDIUM %Leisure 11.13 Book 58.58Natural/pure science 4.18 Periodical 31.08Applied science 8.21 Misc. published 4.38Social science 14.80 Misc. unpublished 4.00World a�airs 18.39 To-be-spoken 1.52Unclassi�ed 1.93 Unclassi�ed 0.40
-20pt
UNIVERSITY OF
GOTHENBURG
example: the spoken part of the BNC
REGION % CONTEXT-GOVERNED %South 45.61 Educational 20.56Midlands 23.33 Business 21.47North 25.43 Institutional 21.86Unclassi�ed 5.61 Leisure 23.71
Unclassi�ed 12.38INTERACTION %Monologue 18.64Dialogue 74.87Unclassi�ed 6.48
-20pt
UNIVERSITY OF
GOTHENBURG
availability of text
I sometimes, we don't have the luxury of selecting a�representative� sample: we just have to take what we can get
I for instance Finnish Romani, Runic Swedish, . . .
I copyright issues:I published on the web 6= freely available!I there is a risk that the work you do will be wasted
I ESPC, the English�Swedish Parallel CorpusI Twitter corpora
I also, technical issues:I �ction is harder to access than web-published text (news,
blogs, . . . )I so at the Department of Swedish we have a huge amount of
web data but a much smaller amounts of �ction, academictext, etc
-20pt
UNIVERSITY OF
GOTHENBURG
example: the Koala corpus of contemporary Swedish
I in the recent Koala corpus, we selected �ve subcorpora wherewe were sure that the texts were legally kosher
I blogs: the authors agreed to release their textsI �ction: we selected text under a Creative Commons licenseI European parliament proceedings: they are publicI Wikipedia: everything under CCI government press releases
-20pt
UNIVERSITY OF
GOTHENBURG
overview
course-related matters
overview of corpora
types of corpora
collecting corpora
adding structure to the text
quick overview of search tools
-20pt
UNIVERSITY OF
GOTHENBURG
making the text usable for automatic tools
I to be able to carry out quantitative research about text, weneed to add more information to the texts
I information about the text: metadataI information inside the text: annotation
I but �rst, what do we mean by the �text�?
-20pt
UNIVERSITY OF
GOTHENBURG
storing language in our computers
I plain text �le: the �le contains only the �letters�I this is what we tend to use when processing
corpora
I rich text: not only the �letters�, but alsoformatting information such as font, size, color
I for instance Word, PDF, HTML, . . .I will typically be converted into plain text before
inclusion in a corpus
I other media: require complex conversion methodsI scanned page: requires OCRI recording: requires speech recognition
Pierre Vinken, 61
years old, will
join the board as
nonexecutive director
Nov. 29.
-20pt
UNIVERSITY OF
GOTHENBURG
the anatomy of a plain text �le
I in a text �le, the �letters� � the character symbols �are stored in a sequence
I the character symbols are de�ned by an internationalstandard called Unicode
-20pt
UNIVERSITY OF
GOTHENBURG
character encoding
I when the text is stored in a �le (or transmitted over theInternet), the character symbols are converted into bytes(numbers)
I there are several encoding systemsI nowadays, UTF-8 dominates: it can handle all UnicodeI a few years back, many language-speci�c encodings. . .
-20pt
UNIVERSITY OF
GOTHENBURG
when we accidentally use the wrong encoding. . .
-20pt
UNIVERSITY OF
GOTHENBURG
are there letters for my language?
I http://www.unicode.org/charts
I most known writing systems (living and dead) are nowstandardized in Unicode
I for instance, several kinds of runes(see http://www.unicode.org/charts/PDF/U16A0.pdf)but not some lesser-known types (e.g. Dalrunor)
I there are also a number of semi-standardized writing systems
-20pt
UNIVERSITY OF
GOTHENBURG
some languages require more complex rendering
I Arabic is drawn from right to left, and the shape of the letterdepends on its position in the word:
kaf, teh, 'alef, beh →I Indic scripts (e.g. Devanagari), and Hangul (Korean) combine
vowel symbols and consontant symbols in irregular ways:
I but even in the Latin script, we use ligatures such as � and �
-20pt
UNIVERSITY OF
GOTHENBURG
processing rich text formats
I rich text formats such as Word and PDF are di�cult tohandle for corpus processing tools, so we typically want toconvert them into plain text
I examples:I pdftotext: PDF → plain textI wvtext: Word → plain text
I the output from these tools often need a bit of polishing
-20pt
UNIVERSITY OF
GOTHENBURG
using text from the web
I �boilerplate removal� http://code.google.com/p/boilerpipe
-20pt
UNIVERSITY OF
GOTHENBURG
from image to text: Optical Character Recognition
⇒
1.
Lund � Halmstad.
Färden från Lund gjordes genomEngelholm till Margretetorp.Uppkomne på höjden af Hallandsåsnedsände vi afskedstagande blickaröfver en bördig sträckning af Skåne.Engeltoftas välbyggda sätesgård meddithörande utbrytningar och skimrandekyrkor i förening med täcka lundarpryda den vackra slätten. . . .
-20pt
UNIVERSITY OF
GOTHENBURG
when OCR goes wrong. . .
I From Dalpilen (1923): �Talet utmynnade i ett rungande teve
för jubilaren, ännu så kraftfull och ungdomlig.�
-20pt
UNIVERSITY OF
GOTHENBURG
when OCR goes wrong (again). . .
I let's look at the popularity of funk music using Google ngramviewer: http://books.google.com/ngrams
-20pt
UNIVERSITY OF
GOTHENBURG
when OCR goes wrong (again). . .
I let's look at the popularity of funk music using Google ngramviewer: http://books.google.com/ngrams
http://languagelog.ldc.upenn.edu/nll/?p=2848
-20pt
UNIVERSITY OF
GOTHENBURG
metadata: information about the document as a whole
I language
I creation or publication time
I the author:I native languageI locationI genderI age
I genre
I modality: written? spoken?
I topic classi�cation (e.g. library classi�cation system)
-20pt
UNIVERSITY OF
GOTHENBURG
adding structure to the text
I to be able to do anything interesting with the text, we need togo beyond the letters and add some structure
I typically, we start by segmenting (splitting) the text intomanageable pieces: sentences and words
I then, we add linguistic annotation: morphology, syntax,discourse, . . .
-20pt
UNIVERSITY OF
GOTHENBURG
sentence splitting and word tokenization
I the �rst step in text processing is typically that we split it intosentences and words (tokens)
I how do you think this could be done automatically?
I typical method for sentence splitting: look for end-of-sentencepunctuation followed by a capital letter
I typical method for word segmentation (tokenization): lookfor letters between spaces or punctuation
I we also need to de�ne what we mean by a �word�!I is cannot one word or two?I how about don't or Mary's?I how about clitics in Romance languages, e.g. Italian farcela?
-20pt
UNIVERSITY OF
GOTHENBURG
sentence splitting and word tokenization
I the �rst step in text processing is typically that we split it intosentences and words (tokens)
I how do you think this could be done automatically?
I typical method for sentence splitting: look for end-of-sentencepunctuation followed by a capital letter
I typical method for word segmentation (tokenization): lookfor letters between spaces or punctuation
I we also need to de�ne what we mean by a �word�!I is cannot one word or two?I how about don't or Mary's?I how about clitics in Romance languages, e.g. Italian farcela?
-20pt
UNIVERSITY OF
GOTHENBURG
tokenization and sentence splitting: tricky cases
I automatic tokenization and sentence splitting can be done in aquite reliable way for �normal� text
I but there are some corner cases. . .�. . . an account with the U.S. Treasury to buy Savings Bonds online . . . �
�. . . then I went back to the U.S. My dad and I moved . . . �
I hyphenated words: should we remove the hyphen?
I tokenization in languages that don't use spaces is nontrivial:
example �borrowed� from Liang Huang
-20pt
UNIVERSITY OF
GOTHENBURG
adding linguistic annotation
I if we want to carry out more complex linguistic investigation,we need linguistic annotation
I part-of-speech tags: �this is a noun�I morphological analysis: �it is in the singular�I syntactic analysis: �it is the subject of the sentence�I . . .
I the linguistic annotation can be added manually (high cost,high quality) or automatically (low cost, low quality)
I more about annotation in the next two lectures!
-20pt
UNIVERSITY OF
GOTHENBURG
overview
course-related matters
overview of corpora
types of corpora
collecting corpora
adding structure to the text
quick overview of search tools
-20pt
UNIVERSITY OF
GOTHENBURG
corpus statistics
I frequencies:I which word is most frequent?I which words are most typical of this corpus, compared to the
�general� language?I development over time (neologisms, new senses)I words, phrases, syntactic constructions, . . .
I co-occurrences:I which words tend to occur close to the word jazz?I which verbs take the noun cake as an object?I how do we normally translate the word case from English into
Swedish?
-20pt
UNIVERSITY OF
GOTHENBURG
concordancers
I a concordancer creates a concordances: a list of the contextswhere a search term appears
I today, many concordancers are web-basedI for instance, Korp http://spraakbanken.gu.se/korp
I the pioneer: Hugues de St. Cher (d. 1263) and 500 monks
-20pt
UNIVERSITY OF
GOTHENBURG
stand-alone concordance / search tools
I AntConc:http://www.laurenceanthony.net/software.html
I Wordsmith (Windows only)http://www.lexically.net/wordsmith/
I KH coder http://khc.sourceforge.net/en/
-20pt
UNIVERSITY OF
GOTHENBURG
purpose of concordances
I overview of the usage of a word: its senses, etc
I �typical� usage, e.g. for examples in dictionaries
I common cooccurrences
I �nding idioms and collocation
-20pt
UNIVERSITY OF
GOTHENBURG
structural searches
I for instance TIGERSearch (lecture 3 and assignment)
I ANNIS (http://corpus-tools.org/annis/) is a web-basedalternative
-20pt
UNIVERSITY OF
GOTHENBURG
structural searches (2)
I another example with Korp:
-20pt
UNIVERSITY OF
GOTHENBURG
time, location, . . .
I occurrences of the word kommunism (`communism') inSwedish newspapers 1820�1920
-20pt
UNIVERSITY OF
GOTHENBURG
next lecture: annotation
I adding linguistic information to corpora: annotation
I describing the linguistic model
I managing the annotation project
I measuring annotation reliability
I quick survey of tools for manual and automatic annotation