info 340 information retrieval words in a language – cardinality and enumeration
TRANSCRIPT
INFO 340
Information RetrievalWords in a Language – Cardinality
and Enumeration
What do we know about words in a language ?
• Consider some language and all of the words in that language– We can count them– We can identify the probability of occurrence of
any particular word– We can categorize them according to some
scheme– We can order them according to some scheme
Body of Words Reference Point
• Reference point for these words –– A ‘corpus’• A particular body of text• Usually documents of text
• What is a document?– Wiki says (at least on Feb 5, 2009):• “A document (noun) is a bounded physical
representation of body of information designed with the capacity (and usually intent) to communicate.”
Corpus
text (words) within documents
documents within corpus
Enumerating a Corpus
• We can count:– the total # of words within a corpus– the total # of documents within a corpus– the number of occurrences of any particular word
within a document– the number of occurrences of any particular word
in the entire corpus-- how often certain words are in proximity to each
other – word X is Y words from word Z
British National Corpus• From http://www.natcorp.ox.ac.uk/corpus/index.xml (non-link underlines added)• What is the BNC?• The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from
a wide range of sources, designed to represent a wide cross-section of British English from the later part of the 20th century, both spoken and written. The latest edition is the BNC XML Edition, released in 2007.
• The written part of the BNC (90%) includes, for example, extracts from regional and national newspapers, specialist periodicals and journals for all ages and interests, academic books and popular fiction, published and unpublished letters and memoranda, school and university essays, among many other kinds of text. The spoken part (10%) consists of orthographic transcriptions of unscripted informal conversations (recorded by volunteers selected from different age, region and social classes in a demographically balanced way) and spoken language collected in different contexts, ranging from formal business or government meetings to radio shows and phone-ins.
• The corpus is encoded according to the Guidelines of the Text Encoding Initiative (TEI) to represent both the output from CLAWS (automatic part-of-speech tagger) and a variety of other structural properties of texts (e.g. headings, paragraphs, lists etc.). Full classification, contextual and bibliographic information is also included with each text in the form of a TEI-conformant header.
• Work on building the corpus began in 1991, and was completed in 1994. No new texts have been added after the completion of the project but the corpus was slightly revised prior to the release of the second edition BNC World (2001) and the third edition BNC XML Edition (2007). Since the completion of the project, two sub-corpora with material from the BNC have been released separately: the BNC Sampler (a general collection of one million written words, one million spoken) and the BNC Baby (four one-million word samples from four different genres).
• Full technical documentation covering all aspects of the BNC including its design, markup, and contents are provided by the Reference Guide for the British National Corpus (XML Edition). For earlier versions of the Reference Guide and other documentation, see the BNC Archive page.
• From http://www.natcorp.ox.ac.uk/corpus/index.xml• What sort of corpus is the BNC?
– Monolingual: It deals with modern British English, not other languages used in Britain. However non-British English and foreign language words do occur in the corpus.
– Synchronic: It covers British English of the late twentieth century, rather than the historical development which produced it.
– General: It includes many different styles and varieties, and is not limited to any particular subject field, genre or register. In particular, it contains examples of both spoken and written language.
– Sample: For written sources, samples of 45,000 words are taken from various parts of single-author texts. Shorter texts up to a maximum of 45,000 words, or multi-author texts such as magazines and newspapers, are included in full. Sampling allows for a wider coverage of texts within the 100 million limit, and avoids over-representing idiosyncratic texts.
British National Corpus
Top 50 most occurringFrom Kilgraff’s lemmatized* lists on BNC
(http://www.kilgarriff.co.uk/BNClists/lemma.num)
1 6187267 the2 4239632 be3 3093444 of4 2687863 and5 2186369 a6 1924315 in7 1620850 to 8 1375636 have9 1090186 it10 1039323 to11 887877 for12 884599 i13 760399 that14 695498 you15 681255 he16 680739 on17 675027 with18 559596 do19 534162 at20 517171 by21 465486 not22 461945 this23 459622 but24 434532 from25 433441 they
26 426896 his27 384313 that28 380257 she29 373808 or30 372031 which31 364164 as32 358039 we33 343063 an34 333518 say35 297281 will36 272345 would37 266116 can38 261089 if39 260919 their40 249540 go41 249466 what42 239460 there43 230737 all44 220940 get45 218258 her46 217268 make47 205432 who48 201968 as49 201819 out50 195426 up
Lemmatised listThere is a lemmatised frequency list for the 6,318 words with more than 800 occurrences in the whole 100M-word BNC. The definition of a 'word' approximates to a headword in an EFL dictionary such as Longman's Dictionary of Contemporary English: so, eg, nominal and verbal "help" are listed separately, and the count for verbal "help" is the sum of counts for verbal 'help', 'helps', 'helping', 'helped'.
Bottom 50 (least often occuring)
6268 811 bail6269 810 unwanted6270 810 tight6271 810 plausible6272 810 midfield6273 810 alert6274 809 feminine6275 809 drainage6276 809 cruelty6277 809 abnormal6278 808 relate6279 808 poison6280 807 symmetry6281 807 stake6282 807 rotten6283 807 prone6284 807 marsh6285 807 litigation6286 807 curl6287 806 urine6288 806 latin6289 806 hover6290 806 greeting6291 806 chase6292 805 spouse6293 805 produce
6294 805 forge6295 804 salon6296 804 handicapped6297 803 sway6298 803 homosexual6299 803 handicap6300 803 colon6301 802 upstairs6302 802 stimulation6303 802 spray6304 802 original6305 802 lay6306 802 garlic6307 801 suitcase6308 801 skipper6309 801 moan6310 801 manpower6311 801 manifest6312 801 incredibly6313 801 historically6314 801 decision-making6315 800 wildly6316 800 reformer6317 800 quantum6318 800 considering
Lemmatised listThere is a lemmatised frequency list for the 6,318 words with more than 800 occurrences in the whole 100M-word BNC. The definition of a 'word' approximates to a headword in an EFL dictionary such as Longman's Dictionary of Contemporary English: so, eg, nominal and verbal "help" are listed separately, and the count for verbal "help" is the sum of counts for verbal 'help', 'helps', 'helping', 'helped'.
From Kilgraff’s lemmatized* lists on BNC (http://www.kilgarriff.co.uk/BNClists/lemma.num)
Word Search on BNC• Results of your search• Your query was • dude • Here is a random selection of 50 solutions from the 68 found... • ABS 82 The show is hip and happening, dude: the audience looks as if it has just walked in off the King's Road, the post-modernish
set is ultra-cool, the show's titles are dazzling, the best I've seen on British television. • AL3 1668 Paul Mansfield on a dude ranch finds great steaks and catfish. • AL3 1677 The most popular dude ranch states with British holidaymakers are Arizona, Colorado and Wyoming. • AL3 1692 Days on a dude ranch --; particularly after a night like that one --; tend to be restful. • AL3 1705 U.K. specialists in dude ranch holidays are Ranch America, 250 Imperial Drive, Rayners Lane, Harrow, Middx HA2 7HJ. • AL3 1707 Their 1992 programme features the Dixie Dude Ranch. • ASV 744 `;Killer board, dude';, said Callahan. • ASV 1290 a punk Mafia dude • ASV 1338 When I mentioned his name in Hawaii an informant who asked to remain anonymous said: `;Johnny Boy Gomes is the
meanest, heaviest dude on the whole of the North Shore.'; • C87 421 Dude Power --; Makes Rufus invulnerable to aliens. • C9K 626 I'm not a great rhythm player, I don't look at myself as a great chops dude and arranging is definitely not my forte. • C9M 908 It was made by this guy who built guitars, some hippy dude, and it was the oddest-shaped guitar. • CD6 349 Amongst the news, views, product reviews and UK surf gossip, arrives Speng --; The Cool Ruler, a cartoon surfing dude
who travels the world's beach breaks in the company of Dog Gorgon. • CEK 2502 11 (10) CALIFORNIA MAN: Wayne's World style comedy about a rock dude. • CGB 2016 Nutshell it for us, dude: `;There's these two loser kids from the Valley. • CH5 4935 He'd come a long way for a dude from Texas, and it had all been so very easy for the man with the Gary Cooper smile. • …
• Results of your search• Your query was • bloody • Here is a random selection of 50 solutions from the 6818 found... • A32 105 The film appears to query the notion of heroism in its battle scenes, muddy, bloody, in marked contrast to the
sunny patriotism of Laurence Olivier's wartime version. • A74 824 Bloody, bloody, bloody. • A7H 1640 `;I got a bloody nose over that,'; he says. • A95 255 The mass demonstrations of the first year are rare these days and the army and Shin Bet security service have
been chalking up success after bloody success in hunting down the masked youths who throw petrol bombs and kill collaborators.
• AC2 548 At one such meeting a heckler had got a great round of cheers from the assembled throng when he had told Clasper to get off his bloody soap-box and do a day's work for a bloody change.
• AC2 2312 You bloody shits….'; • B24 1996 S. H. Patrolling up Prescot Road during the war, if I saw a light on, I used to shout, `;Put that bloody light out,'; and
if nobody put the light out, we used to let fly with a brick. • B24 2614 You'd look at the sergeant and if he O. K. d it, you'd have one but if he didn't, you bloody wouldn't.'; • CA0 1427 You can't do without it, you bloody old tart, can you? • CAF 1251 In fact, Pilsudski came to power in a bloody putsch and presided over gross human-rights violations, the brutal
crushing of strikes and virtual civil war with the national minorities. • CJF 2396 `;Bloody murder, is it then? • CL2 2295 She was, in short, too bloody much, and not only that, she was totally ignoring me. • F9C 1708 Plumbers, builders, estate agents, the government, the council, bloody thieves… • FEE 2572 " You flaming women, you're so stuffed with bloody honesty it's a wonder you don't choke on it. • FP0 593 The short polar day died in bloody shadows. • FR5 366 It's so bloody easy. • FRS 26 `;Do you know how many firms of bloody architects I've traipsed round to in the past two months? • G1M 2346 It's bloody Piper.
Word Search on BNC
Corpora Morphing Over Time
From:http://courses.ischool.berkeley.edu/i202/f07/lectures/202-20071203.pdf
Enumeration leads to predictive capabilities
• Given a sample of text made up from words within a particular corpus that has been enumerated, I can predict the probability that certain words will show up.
Zipf’s Law
• Named after George Zipf (1902 – 1950), a Harvard linguist
• Empirical – created from observation
• The probability of a word’s occurrence is inversely proportional to it’s rank
Zipf’s Law
• More formally, Zipf’s law is a power law function where– the observation that frequency of occurrence of
some event ( P ), as a function of the rank ( i) when the rank is determined by the above frequency of occurrence, is a power-law function Pi ~ 1/ia with the exponent a close to unity (1).
(http://www.nslij-genetics.org/wli/zipf/)
Zipf’s Law
• Identify your corpus• Count the total number of words• Count the total number of each word• Rank each word by frequency of occurrence– i.e. the most frequently used word has the highest
rank -- #1 and least frequently used word has the lowest rank (equal to the total number of words in the corpus)
• Plot
Zipf’s LawEffect of taking the log of both axes
linear scale (both axes) log scale (both axes)
Zipf’s Law in Different Languages
From Multilingual Statistical Text Analysis, Zipf’s Law and Hungarian Speech Generationhttp://www.nslij-genetics.org/wli/zipf/nemeth02.pdf
Class Exercise
• Divide into groups of 5 – 6I. Each group will have one music category. Take 15
minutes to come up with 50 words that would fall predominantly into the following music genres. (Keep it clean).
-- country-- classic rock-- heavy metal-- hip-hop/rap-- top 40/pop-- KEXP
Class Exercise Part II
• In the same groups, name 20 words that you think would appear with particular frequency (higher than ‘predicted’ by the entire corpus of all newspapers) in the following newspapers:– The Wall Street Journal– Seattle Times– The National Enquirer– The New York Times– The Seattle PI– The Ballard Journal
Class Exercises Part III
• Back to music –– Your corpus is the entire set of articles, web
pages, books, interviews, etc written by critics and reviewers of the music industry
– Within each of your groups:• Quietly, identify a musical artist• Identify 10 words that you think appeared with
exceptionally frequency in this corpus• When done, you’ll write your 10 words on the board
and the class will try to figure out who the artist is
Homework
Choose 20 words and using the BNC or Kilgariff’s lists (http://www.kilgarriff.co.uk/BNClists/lemma.num) –
and plot them on log-log paper. You can get log-log at the book store or print your own here: (http://incompetech.com/graphpaper/logarithmic/)
Do your 20 words look Zipfian ?
Homework (continued)
• Download & install Lucene on your iSchool lab computer account (Windows side) --– http://www.apache.org/dyn/closer.cgi/lucene/java/– Download lucene-2.4.0.zip – Install per README.txt instructions
• Set CLASSPATH for lucene-core-2.4.0.jar & lucene-demos-2.4.0.jar
– Test the demo IndexFiles. (Read demo.html in docs)
• Download & install Luke (lukeall.jar) @– http://www.getopt.org/luke/