brian walker university of huddersfield we will look at ... •basic concepts and terminology. •a...
Post on 28-Apr-2018
218 Views
Preview:
TRANSCRIPT
Brian Walker University of Huddersfield
Training Schedule – day#1
09:00 Lectures: Introduction to corpus linguistics: history, terminology and methodology
11:00 Break
11:15 Lecture and practical sessions: What can you do with a corpus? Introducing AntConc.
12:45 LUNCH
14:00 Practical sessions: AntConc
15:30 Break
15:45 Lecture: Building a corpus – theory and practice
16:45 Round-up / Q&A.
17:00 FINISH
Training Schedule – day#2
09:00 Lecture: Mark-up and annotation.
10:00 Practical session: adding mark-up and annotation
11:00 Break
11:15 Practical session: exploiting mark-up and annotation
12:30 LUNCH
13:30 Practical session: further exploitation of mark-up and annotation.
15:30 Break
15:45 Lecture: Using corpus tools to explore a comic strip
16:45 Round-up / Q&A.
17:00 – FINISH
Introduction
We will look at ...
• Basic concepts and terminology.
• A little bit of history.
• What you can do with a corpus.
What is Corpus Linguistics?
Corpus linguistics uses naturally occurring language data for linguistics analysis.
Large amounts of
Machine readable
Computer s/w
Large samples of
What is corpus linguistics?
• Large samples of language data = corpus.
• Latin corpus: ‘body’ (plural corpora)
• Put simply: a corpus is a ‘body’ of text
• Uses a corpus.
• Uses computers for analysis (not always the case).
• Empirical – analysing actual patterns of language use.
• Depends on quantitative and qualitative analytical techniques.
Biber, Conrad & Reppen (1998: 4)
What is Corpus Linguistics?
• Uses a corpus.
• Uses computers for analysis (not always the case).
• Empirical – analysing actual patterns of language use.
• Depends on quantitative and qualitative analytical techniques.
Biber, Conrad & Reppen (1998: 4)
What is Corpus Linguistics?
• Uses a corpus.
• Uses computers for analysis (not always the case).
• Empirical – analysing actual patterns of language use.
• Depends on quantitative and qualitative analytical techniques.
Biber, Conrad & Reppen (1998: 4)
What is Corpus Linguistics?
Early Corpus Linguistics
• Field Linguistics • Boas’s studies of native American languages • Bloomfield’s description of Tagalog • Hockett’s work on Potawatomi • Harris’s emphasis on the importance of results
being derived from data Franz Boas
Leonard Bloomfield
Charles Hockett Zellig Harris
While until about 1880 investigators confined themselves to the collection of vocabularies and brief grammatical notes, it has become more and more evident that large masses of texts are needed in order to elucidate the structure of languages. (Boas 1917: 1)
Principles of Chomskyan linguistics
• Homogeneous underlying system of language
• Describe the language of the ideal speaker/hearer
• Focus on linguistic competence as opposed to linguistic performance
Corpus linguistics doesn’t mean anything. It’s like saying suppose a physicist decides, suppose physics and chemistry decide that instead of relying on experiments, what they’re going to do is take videotapes of things happening in the world and they’ll collect huge videotapes of everything that’s happening and from that maybe they’ll come up with some generalizations or insights. (Chomsky, quoted in Andor 2004: 97)
Problems with intuition
Issue of acceptability
• I was 19 when I started university
• I were 19 when I started university
Impossibility of studying certain aspects of language without recourse to corpus data
• Historical linguistics
• Language change/variation
• Language acquisition
…this [intuition] is a very strange notion of data. Normally one expects a scientist to develop theories to describe and explain some phenomena which already exist, independently of the scientist. One does not expect a scientist to make up the data at the same time as the theory, or even to make up the data afterwards, in order to illustrate the theory. (Stubbs 1996: 29)
Methodology vs. theory
Two views:
Methodologist
CL is a methodology for studying large amounts of language data using computer software
Neo-Firthian
CL is a sub-discipline of linguistics, concerned with explaining relationships between meaning and structure in language
What is a corpus?
• Machine-readable form
• Very large
• Representative sample
• (Standard reference)
• Often annotated McEnery and Wilson (2001)
Machine readable form
• Nowadays, corpus = machine readable
• Corpora tend to sit on a computer
• Not always the case
Very large
• Corpora are usually very large: 10 x 1000s, 100 x 1000s, millions of words.
• Usually a finite size
• Size decided at design stage – when size reached, data collection stops.
• Exception – monitor corpus
– E.g. COBUILD Corpus (Birmingham, UK)
– Dictionary compiling
A representative sample
• Corpora are so big that they can be a ‘representative sample’ of a language or a language variety
• Also depends a lot on design of corpus
– (more later)
(Standard reference)
• A corpus might be a standard reference or a ‘benchmark’ for a particular variety of language against which other texts or corpora can be compared
Annotation
• Just the words on their own = ‘raw text’.
• Annotation = extra information about what is in the corpus.
• Can help with the analysis of the data.
Annotation
Information about the text:
• Where it came from
• Who produced it
• Genre
• Etc.
Annotation
Adding information to the body of the text:
• e.g. gender of speaker;
• e.g. different parts of a text (headline, main story);
• e.g. the grammatical function of each word.
Annotation
• Annotation can be a manual process (takes ages)
• But some linguistic annotation can be done automatically
– e.g. word meaning (semantic)
– e.g. grammatical class of each word in the corpus (noun, verb, etc.)
What is a corpus?
• Machine-readable form
• Very large
• Representative sample
• (Standard reference)
• Often annotated McEnery and Wilson (2001)
A corpus
• …a finite-sized body of machine-readable text, sampled in order to be maximally representative of the language variety under consideration.
(McEnery & Wilson 2001: 32)
A brief history of corpora
The Survey of English
• Instigated 1959 by Randolph Quirk at University College London
• One million words of written and spoken British English, made up of 200 text samples of 5000 words each
• Electronic version of the spoken data produced in collaboration with Lund University: the London-Lund Corpus
• Manually annotated for prosodic and paralinguistic features
• Grammatical structures for each text sample recorded on file cards
• Searching the corpus meant a trip to the Survey offices to search through filing cabinets of data!
The Survey of English
• Instigated 1959 by Randolph Quirk at University College London
• One million words of written and spoken British English, made up of 200 text samples of 5000 words each
• Electronic version of the spoken data produced in collaboration with Lund University: the London-Lund Corpus
• Manually annotated for prosodic and paralinguistic features
• Grammatical structures for each text sample recorded on file cards
• Searching the corpus meant a trip to the Survey offices to search through filing cabinets of data!
Building the Brown corpus
• The Brown Corpus • Built by Nelson Francis and Henry
Kučera at Brown University, USA • One million words of written
American English (1961), made up of 500 text samples of 2000 words each
• Enabled frequency measures of words • Confirmed Zipf’s law • The most frequent word in a corpus is
approximately twice as frequent as the second most frequent, and three times as frequent as the third most frequent, etc.
• Frequency is inversely proportional to rank
Extending the Brown family
• 1970-78: LOB • Built by Geoffrey Leech and
colleagues at Lancaster University • One million words of written British
English (1961), made up of 500 text samples of 2000 words each
• FROWN: Written American English from 1991
• FLOB: Written British English from 1991
• BE06: Written British English from early years of 21st century
• LOBalike: Written British English from 2011
Extending the Brown family
• 1970-78: LOB • Built by Geoffrey Leech and
colleagues at Lancaster University • One million words of written British
English (1961), made up of 500 text samples of 2000 words each
• FROWN: Written American English from 1991
• FLOB: Written British English from 1991
• BE06: Written British English from early years of 21st century
• LOBalike: Written British English from 2011
Further developments:
• Currently, one of the best contemporary UK English corpora • 100 million words from the early 1990s • Represents a wide range of both spoken and written modern
British English: – Written data
• 90 million words • Includes extracts from newspapers, academic books, popular fiction,
letters and university essays
– Spoken data • 10 million words • Includes demographic data and context governed data • The demographic part
– Transcripts of about 900 everyday unscripted spoken conversations
• The context-governed part – Spoken language collected in public contexts – e.g. radio phone-ins, government
meetings, classroom interactions
Making sense of meaning
• COBUILD project initiated at Birmingham in 1980 - resulted in the Bank of English • English Lexical Studies 1963: Sinclair, Susan Jones and Robert Daley analysed a small
corpus of spoken and written English to investigate the relationship between words and meaning
• Meaning is best seen as a property of words in combination • Builds on J. R. Firth’s concept of collocation
What can you do with a corpus?
Frequency analysis
• Simple statistical measure can:
– offer an insight into how often particular words are used in a data set;
– be indicative of the overriding concerns expressed in a text.
– be used to investigate lexical change across time or differences between texts.
Frequency analysis
Frequency analysis – example:
0
0.05
0.1
0.15
0.2
0.25
0.3
1945 1950 1951 1955 1959 1964 1966 1970 1974(Feb)
1974(Oct)
1979 1983 1987 1992 1997 2001 2005 2010
The changing
frequency of
choice in party
political
manifestos,1945 -
2010.
Concordances
• A concordance is a list of all the sentences in which our target word occurs.
• Concordances are helpful because they can allow us to see patterns of language use.
• Corpus linguistic software is used to generate concordance lines.
Concordances
to shake, honey blonde hair cascading over slim shoulders. The girl just has to laugh. She's talk
hen I have a bath,' she laughed. Anthea - tall, slim and breathtakingly pretty - is nearly ten years
wigs, as he calls her, would want to drag this slim six-footer off the street. He had the same
s about six foot tall and very attractive. She is slim with blonde hair and looks like a catwalk mod
Amanda, of Chadwell Heath, Essex, had hoped to slim before the first wedding. Then she discovered
t again, Joy walked over to the table with two slim girls seated there. They were arguing about
the left read' These seats are for those of very slim build only'. Joy was now standing, reading the
d even though their chance of happiness was slim and Wickham was disliked, the marriage still
with nine centuries. Six feet (1.8 metres) tall, slim and athletic, his right-handed batting was less
to his two companions.' Aye,' replied another, slim and small as a child but with a face centuries
hree young children. She was dark-haired and slim , 32 years of age and pleasant. Her husband,
I could see no special contact between them. Slim in her dungarees, with her long, curly, chest
ecute impressively unimpressive water-tricks: slim , brown sprites. We return to the village by a
and fat. The other was a Sikh, very small and slim . They looked like a comic turn. There was a
give up totally, but settled for three or four slim cigars a day instead of ten to fifteen cigarette
built-in disappointments, ounce by ounce the slim frame turned to flab, and in the end Baxter
Collocations
• Collocation = relationship between words that tend to occur together
– Words that tend to occur near word X are the collocates of word X
– Based on frequencies
– Statistical measures
Collocations
• Important in corpus linguistics.
• The company a word keeps can give that word implicit associations or assumptions.
• ‘You shall know a word by the company it keeps.’ (Firth 1957: 11).
Collocations
Collocations
• Juvenile = young, youthful, a young person
– Collocates: delinquency, delinquent, delinquents, offenders, diabetes, crime, court
• Juvenile has negative associations
• Semantic prosody
Collocations
Collocations
– Near-synonyms often differ in terms of their collocations
Collocations
Collocations
• Young
– Collocates: mums-to-be, bloods, nubile, hopefuls, impressionable, up-and-coming
• Negative associations?
Collocations
Bart You're up to something, aren't ya?
Homer No! I'm just going out to commit certain deeds.
Collocations
s. <p/> A39 57 A39 58 <h_><p_>The write way to commit murder<p/> A39 59 <p_><quote_>"Advice and inform
of God is manifested. <tf_>Kill, D03 78 rob and commit adultery<tf/> are all deeds forbidden in the D03
0 of a religious sect who orders his followers to commit suicide.<p/> D11 131 <p_>"God, permitting the mir
bility of episcopal ordination"<quote/> would not commit the D17 47 Methodist Church to the view that th
7 article. Take care though: don't let your words commit an editor to E10 98 using a specific picture, w
theory and deconstruction is such as to G67 189 commit the reasoner to defending certain values.<p/> G67
ithin the Service about offenders who continue to commit H09 191 crime while on bail<p/> H09 192 <p_>Whil
4m ($45m).<p/> H27 148 <p_>However, it would only commit itself to a forecast of H27 149 maintained sales
democracy from collapse, but this was to J41 142 commit <quote_>"a common fallacy in social thought which
45 163 the effort levels that they are willing to commit. Let contracts J45 164 with regard to effort be
2 <p_>Her cheeks flushed crimson and he strove to commit to memory P08 53 the lovely colour as the blood
222 never took the slot, although he did briefly commit to an ROTC A10 223 program before putting his na
1894.<p/> A26 13 <p_>Commissioners hesitated to commit themselves after one of the A26 14 monument's c
ote/> <quote_>"Cold Feet A32 243 - Why Men Won't Commit"<quote/> and <quote_>"Letting Go and Moving A32
B13 92 addressed men who use drugs or those who commit adultery, and who B13 93 get AIDS and other ven
ceeds rational basis. Since urban blacks B17 61 commit more crime proportionately (although not numerica
us consequences. Mr. C12 185 Deng was hounded to commit suicide in 1966 and his criticism is now C12 186
nd Jodie squabble C13 199 because he's afraid to commit to marriage.<p/> C13 200 <p_>Social issues, too,
ue <quote_>"is to do something about it, i.e., to commit oneself D03 187 to a way of life ..."<quote/><p/
n objective theistic D03 192 statement; it is to commit oneself to living life and to D03 193 understand
to be silly and trivial, because I don't want to commit D06 180 an overt, nonrational act and I don't wa
WN:E28\><h_><p_>SANITATION<p/> E28 2 <p_>HOW TO COMMIT BIOCIDE<p/> E28 3 <p_>In the strictest sense, s
an <tf|>offensive F04 52 position. That is, to commit to an aggressive daily-action plan F04 53 desig
form drives 15 F11 31 percent of its victims to commit suicide. (For a list of symptoms, F11 32 see 'A
, artificial persons make decisions that F37 23 commit other people. At the same time, the power to spea
act G22 13 open to us now would be unjust is to commit ourselves to avoiding G22 14 it. But what of pa
H08 57 exploiting the Gulf war as a pretext to commit terrorism.<p/> H08 58 <p_>While we can be proud
p/> H09 52 <p_>First, we must get the people who commit crimes out of the H09 53 community, and we must
<p_>And, it increases penalties for criminals who commit gun H09 69 offenses.<p/> H09 70 <p_>We have no
rease the penalties on those who use such guns to commit H09 120 crimes.<p/> H09 121 <p_>Mr. President, I
y requiring grantees H26 155 in most programs to commit their own funds for a portion of the H26 156 cos
to do with its value; to think so is to J30 27 commit a genetic fallacy. After I wrote this, I came acr
ereas J43 34 disengaged delinquents are free to commit a variety of illegal J43 35 activities, such fr
on a particular illegal J43 38 possibility. Why commit anti-gay violence versus rape or armed J43 39 r
_>Hitler understandably regarded people who could commit such J56 150 acts against Britain as his natural
rt with this J58 131 their so natural Right, but commit onely<&|>sic! the Administration J58 132 of such
lives K23 172 were before us. Rarely did anyone commit suicide. Here, hundreds of K23 173 people sit, w
asked Michael. <quote_>"Did you want P17 102 to commit suicide?"<quote/><p/> P17 103 <p_><quote_>"Oh, no
• We might be able to guess some collocates intuitively, but corpus tools can help us identify others.
• If we suspect a word combination to be unusual or deviant, we can check our intuition by looking at the collocates for that word in a general corpus.
Collocations
[Context: This extract occurs towards the end of the sketch. Alan enters and very quickly summarises how the war finally came to an end.]
Alan (entering) But the tide was turning, the wicket was drying out. It was deuce – advantage Great Britain. Then America and Russia asked if they could join in, and the whole thing turned into a free-for-all. And so, unavoidably, came peace, putting an end to organised war as we knew it.
(Alan Bennett, Beyond the Fringe)
Collocations
Collocations
• The most common post-modifying collocate of organised is crime
• In the BNC, crime appears as a collocate of organised 61 times in 41 texts
• War appears as a collocate of organised only once – and in this one instance it appears after war – When this war broke out organised Labour in this country lost the
initiative (CE7565)
• Organised war is an unusual co-occurrence
• Why? Does it take on some meaning from the contexts in which it is habitually used, which convey an attitude or stance of the author?
• A keyword is a word which occurs in a text or corpus more frequently than you would expect by chance alone
– … based on comparison with another (benchmark) corpus (e.g. the BNC)
– … and the difference has to be statistically significant
Keywords
Keywords
Text #1 wordlist
Text #2 wordlist
Comparison
process
Key words list
Apply statistical test
(e.g. Log Likelihood).
The over-represented (and
under-represented) words in
text #1 when compared with
text #2
Difference must be
statistically significant
• A text’s keywords often point towards its content or its biases and/or can act as style markers (Enkvist 1973)
• Keywords are often a good guide to what would be interesting to look at in more detail
Keywords
Why use a corpus?
Why use a corpus?
• Allows linguists to access quantitative information about language, which can often be used to support qualitative analysis.
• Insights into language gained from corpus analysis are often generalisable in a way that insights gained from the qualitative analysis of small samples of data are not.
• We can look at patterns in large bodies of texts to identify language trends and tendencies, and test our intuitions.
Why use a corpus?
• Using corpus data forces us to acknowledge how language is really used (which is often different from how we think it is used)
• Computer analysis can also reveal atypical or unusual uses (relative to some norm), which may not be possible to observe through manual analysis
Some potential problems
• To give reliable results which will lead to useful findings, a corpus needs to be representative of the language data we are interested in (more on this later).
• Availability/difficulty of collecting data (e.g. historical texts; transcribing spoken texts).
• Accuracy of automated processes can vary with different tools and text-types.
Looking ahead
• Development of tools and technologies
• Corpus techniques increasingly used in other disciplines
• Inter-disciplinarity
• Multimodal corpora (e.g. Headtalk, Knight et al. 2008)
• Corpus Linguistics and Geographical Information Systems. This involves extracting place-names from a corpus, searching for their semantic collocates and creating maps to allows users to visualise how concepts such as war and money are distributed geographically (Gregory and Hardie 2011)
Summary
The basic idea:
• By analysing VERY large amounts of textual data, we can ...
• establish norms about the variety of language being studied
• test theories about language
• spot common and rare language phenomena
• reduce bias
Summary
The computer can’t do it all for us – we still have to analyse the results and ask ...
‘What does it all mean?’
Any questions so far?
References
• Andor, J. (2004) ‘The master and his performance: an interview with Noam Chomsky’, Intercultural Pragmatics 1(1): 93-111.
• Boas, F. (1917) ‘Introduction’, International Journal of American Linguistics 1(1): 1-8. [Reprinted in Boas, D. (1940) Race, Language and Culture, pp. 199-210. The Free Press; New York.]
• Gregory, I. and Hardie, A. (2011) ‘Visual GISting: bringing together corpus linguistics and Geographical Information Systems’, Literary and Linguistic Computing 26(3): 297-314.
• Knight, D., Adolphs, S., Tennent, P. and Carter, R. (2008) ‘The Nottingham Multi-Modal Corpus: a demonstration’, Proceedings of the 6th Language Resources and Evaluation Conference, Palais des Congrés Mansour Eddahbi, Marrakech, Morocco, 28-30th May.
• Stubbs, M (1996) Text and Corpus Analysis. Oxford: Blackwell.
top related