kivik 2013kilgarriff: web corpora1 web corpora adam kilgarriff
TRANSCRIPT
![Page 1: Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff](https://reader031.vdocuments.us/reader031/viewer/2022013011/56649eba5503460f94bc258c/html5/thumbnails/1.jpg)
Kivik 2013 Kilgarriff: Web corpora 1
Web Corpora
Adam Kilgarriff
![Page 2: Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff](https://reader031.vdocuments.us/reader031/viewer/2022013011/56649eba5503460f94bc258c/html5/thumbnails/2.jpg)
Kivik 2013 Kilgarriff: Web corpora 2
You can’t help noticing
• Replaceable or replacable?– http://googlefight.com
![Page 3: Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff](https://reader031.vdocuments.us/reader031/viewer/2022013011/56649eba5503460f94bc258c/html5/thumbnails/3.jpg)
Kivik 2013 Kilgarriff: Web corpora 3
• Very very large– 2006 estimates for duplicate free, linguistic, Google-
indexed web• German: 44 billion words• Italian: 25 billion words• English: 1,000 billion -10,000 billion words
• Most languages• Most language types• Up-to-date• Free• Instant access
![Page 4: Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff](https://reader031.vdocuments.us/reader031/viewer/2022013011/56649eba5503460f94bc258c/html5/thumbnails/4.jpg)
Kivik 2013 Kilgarriff: Web corpora 4
Overview
• Is the web a corpus?• Representativeness• What is out there?
– Web1T
• Googleology• Web corpus types
– Targeted sites: Oxford English Corpus– General: WaC family– WebBootCaT
![Page 5: Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff](https://reader031.vdocuments.us/reader031/viewer/2022013011/56649eba5503460f94bc258c/html5/thumbnails/5.jpg)
Kivik 2013 Kilgarriff: Web corpora 5
Is the web a corpus?
• Sinclair – in “Developing linguistic corpora, a guide to good practice. Corpus and
Text – Basic Principles”
“…not a corpus because• dimensions unknown, constantly changing• not designed from a linguistic perpective
• But– We can find out dimensions – Many corpora are not designed
• “as much chatroom dialogue as I can get”
• Def: a corpus is a collection of texts – when viewed as an object of language research
![Page 6: Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff](https://reader031.vdocuments.us/reader031/viewer/2022013011/56649eba5503460f94bc258c/html5/thumbnails/6.jpg)
Kivik 2013 Kilgarriff: Web corpora 6
Is the web a corpus?
Yes
![Page 7: Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff](https://reader031.vdocuments.us/reader031/viewer/2022013011/56649eba5503460f94bc258c/html5/thumbnails/7.jpg)
Kivik 2013 Kilgarriff: Web corpora 7
but it’s not representative
![Page 8: Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff](https://reader031.vdocuments.us/reader031/viewer/2022013011/56649eba5503460f94bc258c/html5/thumbnails/8.jpg)
Kivik 2013 Kilgarriff: Web corpora 8
Theory
A random sample of a population is representative of it.
Observations on sample support inferences about population
(within confidence bounds)
![Page 9: Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff](https://reader031.vdocuments.us/reader031/viewer/2022013011/56649eba5503460f94bc258c/html5/thumbnails/9.jpg)
Kivik 2013 Kilgarriff: Web corpora 9
TheoryA random sample of a population is …
• What is the population?– production and reception
– speech and text
– copying
![Page 10: Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff](https://reader031.vdocuments.us/reader031/viewer/2022013011/56649eba5503460f94bc258c/html5/thumbnails/10.jpg)
Kivik 2013 Kilgarriff: Web corpora 10
Theory• Population not defined• Representative sample not possible
![Page 11: Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff](https://reader031.vdocuments.us/reader031/viewer/2022013011/56649eba5503460f94bc258c/html5/thumbnails/11.jpg)
Kivik 2013 Kilgarriff: Web corpora 11
sublanguage• Language = core + sublanguages• Options for corpus construction
– none– some– all
• None– impoverished view of language
• Some: BNC– cake recipes and gastro-uterine disease– not car repair manuals or astronomy or …
• All: until recently, not viable
![Page 12: Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff](https://reader031.vdocuments.us/reader031/viewer/2022013011/56649eba5503460f94bc258c/html5/thumbnails/12.jpg)
Kivik 2013 Kilgarriff: Web corpora 12
Representativeness• The web is not representative• but nor is anything else• Text type variation
– under-researched, lacking in theory• Atkins Clear Ostler 1993 on design brief for BNC;
Biber 1988, Kilgarriff 2001
• Text type is an issue across NLP– Web: issue is acute because, as against BNC or
WSJ, we simply don’t know what is there
![Page 13: Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff](https://reader031.vdocuments.us/reader031/viewer/2022013011/56649eba5503460f94bc258c/html5/thumbnails/13.jpg)
Kivik 2013 Kilgarriff: Web corpora 13
What is out there?
• What text types are there on the web?– some are new: chatroom
– proportions
• is it overwhelmed by porn? How much?
• Hard question
![Page 14: Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff](https://reader031.vdocuments.us/reader031/viewer/2022013011/56649eba5503460f94bc258c/html5/thumbnails/14.jpg)
Kivik 2013 Kilgarriff: Web corpora 14
• The web– a social, cultural, political phenomenon– new, little understood– a legitimate object of science– mostly language
• we are well placed– a lot of people will be interested
• Let’s– study the web– source of language data– apply our tools for web use (dictionaries, MT)– use the web as infrastructure
![Page 15: Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff](https://reader031.vdocuments.us/reader031/viewer/2022013011/56649eba5503460f94bc258c/html5/thumbnails/15.jpg)
Kivik 2013 Kilgarriff: Web corpora 15
Using Search Engines
No setup costsStart querying today
Methods• Hit counts• ‘snippets’
– Metasearch engines, WebCorp
• Find pages and download
![Page 16: Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff](https://reader031.vdocuments.us/reader031/viewer/2022013011/56649eba5503460f94bc258c/html5/thumbnails/16.jpg)
Kivik 2013 Kilgarriff: Web corpora 16
Googleology
• Google hit counts for language modelling
– Example: (Keller & Lapata 2003) – 36 queries to estimate freq(fulfil, obligation) to
each of Google and Altavista
• Very interesting work
• Great interest in query syntax
![Page 17: Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff](https://reader031.vdocuments.us/reader031/viewer/2022013011/56649eba5503460f94bc258c/html5/thumbnails/17.jpg)
Kivik 2013 Kilgarriff: Web corpora 17
The Trouble with Google• not enough instances
– max 1000• not enough queries
– max 1000 per day with API• not enough context
– 10-word snippet around search term• sort order
– search term in titles and headings • untrustworthy hit counts• limited search options• linguistically dumb, eg not lemmatised
• aime/aimer/aimes/aimons/aimez/aiment …
![Page 18: Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff](https://reader031.vdocuments.us/reader031/viewer/2022013011/56649eba5503460f94bc258c/html5/thumbnails/18.jpg)
Kivik 2013 Kilgarriff: Web corpora 18
• Appeal– Zero-cost entry, just start googling
• Reality– High-quality work: high-cost methodology
![Page 19: Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff](https://reader031.vdocuments.us/reader031/viewer/2022013011/56649eba5503460f94bc258c/html5/thumbnails/19.jpg)
Kivik 2013 Kilgarriff: Web corpora 19
Also:
• No replicability
• Methods, stats not published
• At mercy of commercial corporation
• Googleology is bad science
![Page 20: Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff](https://reader031.vdocuments.us/reader031/viewer/2022013011/56649eba5503460f94bc258c/html5/thumbnails/20.jpg)
Kivik 2013 Kilgarriff: Web corpora 20
Better: web-sourced corpora
• Gather pages– Google hits– Select and gather whole sites– General crawl
• Filter
• De-duplicate
• Linguistic processing
• Load into corpus tool
![Page 21: Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff](https://reader031.vdocuments.us/reader031/viewer/2022013011/56649eba5503460f94bc258c/html5/thumbnails/21.jpg)
Kivik 2013 Kilgarriff: Web corpora 21
Oxford English Corpus
• Whole domains chosen and harvested– control over text type
• 2.3 billion words
![Page 22: Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff](https://reader031.vdocuments.us/reader031/viewer/2022013011/56649eba5503460f94bc258c/html5/thumbnails/22.jpg)
Kivik 2013 Kilgarriff: Web corpora 22
Oxford English Corpus
![Page 23: Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff](https://reader031.vdocuments.us/reader031/viewer/2022013011/56649eba5503460f94bc258c/html5/thumbnails/23.jpg)
Kivik 2013 Kilgarriff: Web corpora 23
WaC family
• 1.5 B words each
• Baroni and colleagues
• Seeds: – mid-frequency words from ‘core vocab’ lists
and corpora
• Google on seed words, then crawl
![Page 24: Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff](https://reader031.vdocuments.us/reader031/viewer/2022013011/56649eba5503460f94bc258c/html5/thumbnails/24.jpg)
TenTen Family
• Processing chain – Spiderling, a lingustic crawler
• A billion words a day
– jusText for“cleaning”: removing non-text
– Onion – remove duplicates (paragraph level)
• All major world languages
• 2-20 billion words
• Lexical Computing
• All available in Sketch Engine
Kivik 2013 Kilgarriff: Web corpora 24
![Page 25: Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff](https://reader031.vdocuments.us/reader031/viewer/2022013011/56649eba5503460f94bc258c/html5/thumbnails/25.jpg)
Kivik 2013 Kilgarriff: Web corpora 25
Small, specialised corpora
• Terminologists
• Translators needing target-language domain-specific vocab
• Specialist dictionaries– Don’t exist– Expensive/inaccessible– Out of date
![Page 26: Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff](https://reader031.vdocuments.us/reader031/viewer/2022013011/56649eba5503460f94bc258c/html5/thumbnails/26.jpg)
Kivik 2013 Kilgarriff: Web corpora 26
BootCat (Bootstrapping Corpora and Terms)
• Put in seed terms• Google/Yahoo search• Retrieve Google/Yahoo hits
– Remove duplicates, boilerplate
• Small instant corpora• Baroni and Bernardini, LREC 2004• Web version
– WebBootCaT– At Sketch Engine site
![Page 27: Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff](https://reader031.vdocuments.us/reader031/viewer/2022013011/56649eba5503460f94bc258c/html5/thumbnails/27.jpg)
But did I make a good corpus?
Kivik 2013 Kilgarriff: Web corpora 27
![Page 28: Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff](https://reader031.vdocuments.us/reader031/viewer/2022013011/56649eba5503460f94bc258c/html5/thumbnails/28.jpg)
Bad Science
• Ben Goldacre
Kivik 2013 Kilgarriff: Web corpora 28
![Page 29: Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff](https://reader031.vdocuments.us/reader031/viewer/2022013011/56649eba5503460f94bc258c/html5/thumbnails/29.jpg)
Bad Science
• Ben Goldacre
• Biases in samples– A quarter of the people who tested positive
had just been on holiday in Mexico– But the research team didn’t notice
Kivik 2013 Kilgarriff: Web corpora 29
![Page 30: Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff](https://reader031.vdocuments.us/reader031/viewer/2022013011/56649eba5503460f94bc258c/html5/thumbnails/30.jpg)
Bad linguistics
• Our corpus study shows X– But what was in the corpus?
Kivik 2013 Kilgarriff: Web corpora 30
![Page 31: Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff](https://reader031.vdocuments.us/reader031/viewer/2022013011/56649eba5503460f94bc258c/html5/thumbnails/31.jpg)
Bad linguistics
• Our corpus study shows X– But what was in the corpus?
– Moral: • Get to know your corpus
Kivik 2013 Kilgarriff: Web corpora 31
![Page 32: Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff](https://reader031.vdocuments.us/reader031/viewer/2022013011/56649eba5503460f94bc258c/html5/thumbnails/32.jpg)
How?
• Read it?
• Too big to read
• Not designed to be read
Kivik 2013 Kilgarriff: Web corpora 32
![Page 33: Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff](https://reader031.vdocuments.us/reader031/viewer/2022013011/56649eba5503460f94bc258c/html5/thumbnails/33.jpg)
How?
• Compare it with other(s)
• Keyword lists
Kivik 2013 Kilgarriff: Web corpora 33
![Page 34: Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff](https://reader031.vdocuments.us/reader031/viewer/2022013011/56649eba5503460f94bc258c/html5/thumbnails/34.jpg)
UKWaC vs. enTenTen12
Kivik 2013 Kilgarriff: Web corpora 34
![Page 35: Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff](https://reader031.vdocuments.us/reader031/viewer/2022013011/56649eba5503460f94bc258c/html5/thumbnails/35.jpg)
enTenTen vs. UKWaC accord actually amendment among bad because behavior believe bill blog ca center citizen color defense determine do dollar earth effort election even evil fact faculty favor favorite federal foreign forth guess guy he her him himself his honor human kid kill kind know labor law let liberal like man maybe me military movie my nation never nor not nothing official oh organization percent political post president pretty professor program realize recognize say shall she sin soul speak state suppose tell terrorist that thing think thou thy toward true truth unto upon violation vote voter war what while why woman yes
accommodation achieve advice aim area assessment available band behaviour building centre charity click client club colour consultation contact council delivery detail develop development disabled email enable enquiry ensure event excellent facility favourite full further garden guidance guide holiday improve information insurance join link local main manage management match mm nd offer opportunity organisation organise page partnership please pm poker pp programme project pub pupil quality range rd realise recognise road route scheme sector service shop site skill specialist st staff stage suitable telephone th top tour training transport uk undertake venue village visit visitor website welcome whilst wide workshop www
Kivik 2013 Kilgarriff: Web corpora 35
![Page 36: Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff](https://reader031.vdocuments.us/reader031/viewer/2022013011/56649eba5503460f94bc258c/html5/thumbnails/36.jpg)
enTenTen vs. UKWaCaccord actually amendment among bad because behavior believe bill blog ca center citizen color defense determine do dollar earth effort election even evil fact faculty favor favorite federal foreign forth guess guy he her him himself his honor human kid kill kind know labor law let liberal like man maybe me military movie my nation never nor not nothing official oh organization percent political post president pretty professor program realize recognize say shall she sin soul speak state suppose tell terrorist that thing think thou thy toward true truth unto upon violation vote voter war what while why woman yes
accommodation achieve advice aim area assessment available band behaviour building centre charity click client club colour consultation contact council delivery detail develop development disabled email enable enquiry ensure event excellent facility favourite full further garden guidance guide holiday improve information insurance join link local main manage management match mm nd offer opportunity organisation organise page partnership please pm poker pp programme project pub pupil quality range rd realise recognise road route scheme sector service shop site skill specialist st staff stage suitable telephone th top tour training transport uk undertake venue village visit visitor website welcome whilst wide workshop www
Kivik 2013 Kilgarriff: Web corpora 36
![Page 37: Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff](https://reader031.vdocuments.us/reader031/viewer/2022013011/56649eba5503460f94bc258c/html5/thumbnails/37.jpg)
enTenTen vs. UKWaCaccord actually amendment among bad because behavior believe bill blog ca center citizen color defense determine do dollar earth effort election even evil fact faculty favor favorite federal foreign forth guess guy he her him himself his honor human kid kill kind know labor law let liberal like man maybe me military movie my nation never nor not nothing official oh organization percent political post president pretty professor program realize recognize say shall she sin soul speak state suppose tell terrorist that thing think thou thy toward true truth unto upon violation vote voter war what while why woman yes
accommodation achieve advice aim area assessment available band behaviour building centre charity click client club colour consultation contact council delivery detail develop development disabled email enable enquiry ensure event excellent facility favourite full further garden guidance guide holiday improve information insurance join link local main manage management match mm nd offer opportunity organisation organise page partnership please pm poker pp programme project pub pupil quality range rd realise recognise road route scheme sector service shop site skill specialist st staff stage suitable telephone th top tour training transport uk undertake venue village visit visitor website welcome whilst wide workshop www
Kivik 2013 Kilgarriff: Web corpora 37
![Page 38: Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff](https://reader031.vdocuments.us/reader031/viewer/2022013011/56649eba5503460f94bc258c/html5/thumbnails/38.jpg)
enTenTen vs. UKWaCaccord actually amendment among bad because behavior believe bill blog ca center citizen color defense determine do dollar earth effort election even evil fact faculty favor favorite federal foreign forth guess guy he her him himself his honor human kid kill kind know labor law let liberal like man maybe me military movie my nation never nor not nothing official oh organization percent political post president pretty professor program realize recognize say shall she sin soul speak state suppose tell terrorist that thing think thou thy toward true truth unto upon violation vote voter war what while why woman yes
accommodation achieve advice aim area assessment available band behaviour building centre charity click client club colour consultation contact council delivery detail develop development disabled email enable enquiry ensure event excellent facility favourite full further garden guidance guide holiday improve information insurance join link local main manage management match mm nd offer opportunity organisation organise page partnership please pm poker pp programme project pub pupil quality range rd realise recognise road route scheme sector service shop site skill specialist st staff stage suitable telephone th top tour training transport uk undertake venue village visit visitor website welcome whilst wide workshop www
Kivik 2013 Kilgarriff: Web corpora 38
![Page 39: Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff](https://reader031.vdocuments.us/reader031/viewer/2022013011/56649eba5503460f94bc258c/html5/thumbnails/39.jpg)
enTenTen vs. UKWaC
•Core verbs– be determine do
guess know let say shall suppose tell think
•Pronouns– he her him his me
my she
•Biber: more informal
Kivik 2013 Kilgarriff: Web corpora 39
![Page 40: Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff](https://reader031.vdocuments.us/reader031/viewer/2022013011/56649eba5503460f94bc258c/html5/thumbnails/40.jpg)
Judgements
• Not all or nothing– Both have (lots of) AmE and BrE
– Observing patterns• Not right or wrong
• Where does ‘believe’ belong?– Bible or core verbs?
– No right answer, could be both
• The better you know the data, the better you understand why words are there
Kivik 2013 Kilgarriff: Web corpora 40
![Page 41: Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff](https://reader031.vdocuments.us/reader031/viewer/2022013011/56649eba5503460f94bc258c/html5/thumbnails/41.jpg)
The maths
“this word is twice as common here as there”•Simplest approach
– Normalise frequencies• Per thousand, or per million
– Take ratio
•For examples– Assume two 1m-word corpora
• Normalisation not needed– Fc=focus corpus– Rc= reference corpus
Kivik 2013 Kilgarriff: Web corpora 41
![Page 42: Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff](https://reader031.vdocuments.us/reader031/viewer/2022013011/56649eba5503460f94bc258c/html5/thumbnails/42.jpg)
Kivik 2013 Kilgarriff: Web corpora 42
Problem 1: You can’t divide by zero
• Standard solution: add one
• Problem solved
fc rc ratio
buggle 10 0 ?
stort 100 0 ?
nammikin 1000 0 ?
fc rc ratio
buggle 11 1 11
stort 101 1 101
nammikin 1001 1 1001
![Page 43: Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff](https://reader031.vdocuments.us/reader031/viewer/2022013011/56649eba5503460f94bc258c/html5/thumbnails/43.jpg)
Kivik 2013 Kilgarriff: Web corpora 43
Problem 2: High ratios more common, less interesting for rarer words
fc rc ratio interesting?
spug 10 1 10 no
grod 1000 100 10 yes
• ratio is not enough: frequency matters too
Also
• some researchers: grammar, grammar words
• some researchers: lexis, content words
No right answer
Slider?
![Page 44: Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff](https://reader031.vdocuments.us/reader031/viewer/2022013011/56649eba5503460f94bc258c/html5/thumbnails/44.jpg)
Kivik 2013 Kilgarriff: Web corpora 44
Solution• Don’t just add 1, add n:
• n=1
• n=100
word fc rc fc+n rc+n Ratio Rank
obscurish 10 0 11 1 11.00 1
middling 200 100 201 101 1.99 2
common 12000 10000 12001 10001 1.20 3
word fc rc fc+n rc+n Ratio Rank
obscurish 10 0 110 100 1.10 3
middling 200 100 300 200 1.50 1
common 12000 10000 12100 10100 1.20 2
![Page 45: Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff](https://reader031.vdocuments.us/reader031/viewer/2022013011/56649eba5503460f94bc258c/html5/thumbnails/45.jpg)
Kivik 2013 Kilgarriff: Web corpora 45
• n=1000 word fc rc fc+n rc+n Ratio Rank
obscurish 10 0 1010 1000 1.01 3
middling 200 100 1200 1100 1.09 2
common 12000 10000 13000 11000 1.18 1
![Page 46: Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff](https://reader031.vdocuments.us/reader031/viewer/2022013011/56649eba5503460f94bc258c/html5/thumbnails/46.jpg)
Kivik 2013 Kilgarriff: Web corpora 46
Summary
word fc rc n=1 n=100 n=1000
obscurish 10 0 1st 2nd 3rd
middling 200 100 2nd 1st 2nd
common 12000 10000 3rd 3rd 1st
![Page 47: Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff](https://reader031.vdocuments.us/reader031/viewer/2022013011/56649eba5503460f94bc258c/html5/thumbnails/47.jpg)
Kivik 2013 Kilgarriff: Web corpora 47
But what about
• Mutual information
• Log-likelihood
• Chi-square
• Fisher’s test
• …
• Don’t they use cleverer maths?
![Page 48: Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff](https://reader031.vdocuments.us/reader031/viewer/2022013011/56649eba5503460f94bc258c/html5/thumbnails/48.jpg)
Kivik 2013 Kilgarriff: Web corpora 48
Yes but
• Clever maths is for hypothesis testing– Can you defeat null hypothesis?
• Language is not random, so
• … you always can
• Null hypothesis never true
• Hypothesis-testing not informative
• Clever maths irrelevant– Kilgarriff 2006, CLLT
![Page 49: Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff](https://reader031.vdocuments.us/reader031/viewer/2022013011/56649eba5503460f94bc258c/html5/thumbnails/49.jpg)
Kivik 2013 Kilgarriff: Web corpora 49
Varying the parameter
• BAWE– British Academic Written English
• Nesi and Thompson 2008
– Student essays• Arts/Humanities, Social Sciences, Life Sciences,
Physical Sciences
– fc: ArtsHum, rc: SocSci– With n=10 and n=1000
![Page 50: Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff](https://reader031.vdocuments.us/reader031/viewer/2022013011/56649eba5503460f94bc258c/html5/thumbnails/50.jpg)
Kivik 2013 Kilgarriff: Web corpora 50
![Page 51: Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff](https://reader031.vdocuments.us/reader031/viewer/2022013011/56649eba5503460f94bc258c/html5/thumbnails/51.jpg)
Kivik 2013 Kilgarriff: Web corpora 51
![Page 52: Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff](https://reader031.vdocuments.us/reader031/viewer/2022013011/56649eba5503460f94bc258c/html5/thumbnails/52.jpg)
Parameters for keyword lists
• Lemmas– Could be word forms, word classes
• Simplemaths– (default: 100, for mix of lexical and grammar words)
• Only all-lowercase-letters– Could allow uppercase, or any at all
• Minimum 2/3/4 characters– Helps get words, not abbreviations etc
Kivik 2013 Kilgarriff: Web corpora 52
![Page 53: Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff](https://reader031.vdocuments.us/reader031/viewer/2022013011/56649eba5503460f94bc258c/html5/thumbnails/53.jpg)
enTenTen vs. UKWaCObama Clinton Hillary McCain
Centre Leeds Manchester Edinburgh
Kivik 2013 Kilgarriff: Web corpora 53
With parameters:•Simplemaths: 10•Uppercase and lowercase•Minimum length =5 (to exclude acronyms)
![Page 54: Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff](https://reader031.vdocuments.us/reader031/viewer/2022013011/56649eba5503460f94bc258c/html5/thumbnails/54.jpg)
Two interlocking questions
• How do two corpora differ
• How do two text types differ
Kivik 2013 Kilgarriff: Web corpora 54
![Page 55: Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff](https://reader031.vdocuments.us/reader031/viewer/2022013011/56649eba5503460f94bc258c/html5/thumbnails/55.jpg)
Two interlocking questions
• How do two corpora differ– enTenTen vs. UKWaC– Interpret as:
• Differences of corpus compilation procedures- and/or -• Differences of proportions of text types
• How do two text types differ– BAWE example
• Arts/humanities essays vs. Social Sciences essays– Any other corpus differences
• Unwanted biases• But we need to know about them
Kivik 2013 Kilgarriff: Web corpora 55
![Page 56: Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff](https://reader031.vdocuments.us/reader031/viewer/2022013011/56649eba5503460f94bc258c/html5/thumbnails/56.jpg)
• Don’t do bad science• Get to know your corpus
– Compare with others• Qualitatively: keyword lists• (Quantitatively: distances)
• No excuses– The Sketch Engine does all the technical
work for you
• The joy of research
Kivik 2013 Kilgarriff: Web corpora 56