info 340 information retrieval words in a language – cardinality and enumeration

23
INFO 340 Information Retrieval Words in a Language – Cardinality and Enumeration

Upload: alicia-quinn

Post on 28-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: INFO 340 Information Retrieval Words in a Language – Cardinality and Enumeration

INFO 340

Information RetrievalWords in a Language – Cardinality

and Enumeration

Page 2: INFO 340 Information Retrieval Words in a Language – Cardinality and Enumeration

What do we know about words in a language ?

• Consider some language and all of the words in that language– We can count them– We can identify the probability of occurrence of

any particular word– We can categorize them according to some

scheme– We can order them according to some scheme

Page 3: INFO 340 Information Retrieval Words in a Language – Cardinality and Enumeration

Body of Words Reference Point

• Reference point for these words –– A ‘corpus’• A particular body of text• Usually documents of text

• What is a document?– Wiki says (at least on Feb 5, 2009):• “A document (noun) is a bounded physical

representation of body of information designed with the capacity (and usually intent) to communicate.”

Page 4: INFO 340 Information Retrieval Words in a Language – Cardinality and Enumeration

Corpus

text (words) within documents

documents within corpus

Page 5: INFO 340 Information Retrieval Words in a Language – Cardinality and Enumeration

Enumerating a Corpus

• We can count:– the total # of words within a corpus– the total # of documents within a corpus– the number of occurrences of any particular word

within a document– the number of occurrences of any particular word

in the entire corpus-- how often certain words are in proximity to each

other – word X is Y words from word Z

Page 6: INFO 340 Information Retrieval Words in a Language – Cardinality and Enumeration

British National Corpus• From http://www.natcorp.ox.ac.uk/corpus/index.xml (non-link underlines added)• What is the BNC?• The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from

a wide range of sources, designed to represent a wide cross-section of British English from the later part of the 20th century, both spoken and written. The latest edition is the BNC XML Edition, released in 2007.

• The written part of the BNC (90%) includes, for example, extracts from regional and national newspapers, specialist periodicals and journals for all ages and interests, academic books and popular fiction, published and unpublished letters and memoranda, school and university essays, among many other kinds of text. The spoken part (10%) consists of orthographic transcriptions of unscripted informal conversations (recorded by volunteers selected from different age, region and social classes in a demographically balanced way) and spoken language collected in different contexts, ranging from formal business or government meetings to radio shows and phone-ins.

• The corpus is encoded according to the Guidelines of the Text Encoding Initiative (TEI) to represent both the output from CLAWS (automatic part-of-speech tagger) and a variety of other structural properties of texts (e.g. headings, paragraphs, lists etc.). Full classification, contextual and bibliographic information is also included with each text in the form of a TEI-conformant header.

• Work on building the corpus began in 1991, and was completed in 1994. No new texts have been added after the completion of the project but the corpus was slightly revised prior to the release of the second edition BNC World (2001) and the third edition BNC XML Edition (2007). Since the completion of the project, two sub-corpora with material from the BNC have been released separately: the BNC Sampler (a general collection of one million written words, one million spoken) and the BNC Baby (four one-million word samples from four different genres).

• Full technical documentation covering all aspects of the BNC including its design, markup, and contents are provided by the Reference Guide for the British National Corpus (XML Edition). For earlier versions of the Reference Guide and other documentation, see the BNC Archive page.

Page 7: INFO 340 Information Retrieval Words in a Language – Cardinality and Enumeration

• From http://www.natcorp.ox.ac.uk/corpus/index.xml• What sort of corpus is the BNC?

– Monolingual: It deals with modern British English, not other languages used in Britain. However non-British English and foreign language words do occur in the corpus.

– Synchronic: It covers British English of the late twentieth century, rather than the historical development which produced it.

– General: It includes many different styles and varieties, and is not limited to any particular subject field, genre or register. In particular, it contains examples of both spoken and written language.

– Sample: For written sources, samples of 45,000 words are taken from various parts of single-author texts. Shorter texts up to a maximum of 45,000 words, or multi-author texts such as magazines and newspapers, are included in full. Sampling allows for a wider coverage of texts within the 100 million limit, and avoids over-representing idiosyncratic texts.

British National Corpus

Page 8: INFO 340 Information Retrieval Words in a Language – Cardinality and Enumeration

Top 50 most occurringFrom Kilgraff’s lemmatized* lists on BNC

(http://www.kilgarriff.co.uk/BNClists/lemma.num)

1 6187267 the2 4239632 be3 3093444 of4 2687863 and5 2186369 a6 1924315 in7 1620850 to 8 1375636 have9 1090186 it10 1039323 to11 887877 for12 884599 i13 760399 that14 695498 you15 681255 he16 680739 on17 675027 with18 559596 do19 534162 at20 517171 by21 465486 not22 461945 this23 459622 but24 434532 from25 433441 they

26 426896 his27 384313 that28 380257 she29 373808 or30 372031 which31 364164 as32 358039 we33 343063 an34 333518 say35 297281 will36 272345 would37 266116 can38 261089 if39 260919 their40 249540 go41 249466 what42 239460 there43 230737 all44 220940 get45 218258 her46 217268 make47 205432 who48 201968 as49 201819 out50 195426 up

Lemmatised listThere is a lemmatised frequency list for the 6,318 words with more than 800 occurrences in the whole 100M-word BNC. The definition of a 'word' approximates to a headword in an EFL dictionary such as Longman's Dictionary of Contemporary English: so, eg, nominal and verbal "help" are listed separately, and the count for verbal "help" is the sum of counts for verbal 'help', 'helps', 'helping', 'helped'.

Page 9: INFO 340 Information Retrieval Words in a Language – Cardinality and Enumeration

Bottom 50 (least often occuring)

6268 811 bail6269 810 unwanted6270 810 tight6271 810 plausible6272 810 midfield6273 810 alert6274 809 feminine6275 809 drainage6276 809 cruelty6277 809 abnormal6278 808 relate6279 808 poison6280 807 symmetry6281 807 stake6282 807 rotten6283 807 prone6284 807 marsh6285 807 litigation6286 807 curl6287 806 urine6288 806 latin6289 806 hover6290 806 greeting6291 806 chase6292 805 spouse6293 805 produce

6294 805 forge6295 804 salon6296 804 handicapped6297 803 sway6298 803 homosexual6299 803 handicap6300 803 colon6301 802 upstairs6302 802 stimulation6303 802 spray6304 802 original6305 802 lay6306 802 garlic6307 801 suitcase6308 801 skipper6309 801 moan6310 801 manpower6311 801 manifest6312 801 incredibly6313 801 historically6314 801 decision-making6315 800 wildly6316 800 reformer6317 800 quantum6318 800 considering

Lemmatised listThere is a lemmatised frequency list for the 6,318 words with more than 800 occurrences in the whole 100M-word BNC. The definition of a 'word' approximates to a headword in an EFL dictionary such as Longman's Dictionary of Contemporary English: so, eg, nominal and verbal "help" are listed separately, and the count for verbal "help" is the sum of counts for verbal 'help', 'helps', 'helping', 'helped'.

From Kilgraff’s lemmatized* lists on BNC (http://www.kilgarriff.co.uk/BNClists/lemma.num)

Page 10: INFO 340 Information Retrieval Words in a Language – Cardinality and Enumeration

Word Search on BNC• Results of your search• Your query was • dude • Here is a random selection of 50 solutions from the 68 found... • ABS 82 The show is hip and happening, dude: the audience looks as if it has just walked in off the King's Road, the post-modernish

set is ultra-cool, the show's titles are dazzling, the best I've seen on British television. • AL3 1668 Paul Mansfield on a dude ranch finds great steaks and catfish. • AL3 1677 The most popular dude ranch states with British holidaymakers are Arizona, Colorado and Wyoming. • AL3 1692 Days on a dude ranch --; particularly after a night like that one --; tend to be restful. • AL3 1705 U.K. specialists in dude ranch holidays are Ranch America, 250 Imperial Drive, Rayners Lane, Harrow, Middx HA2 7HJ. • AL3 1707 Their 1992 programme features the Dixie Dude Ranch. • ASV 744 `;Killer board, dude';, said Callahan. • ASV 1290 a punk Mafia dude • ASV 1338 When I mentioned his name in Hawaii an informant who asked to remain anonymous said: `;Johnny Boy Gomes is the

meanest, heaviest dude on the whole of the North Shore.'; • C87 421 Dude Power --; Makes Rufus invulnerable to aliens. • C9K 626 I'm not a great rhythm player, I don't look at myself as a great chops dude and arranging is definitely not my forte. • C9M 908 It was made by this guy who built guitars, some hippy dude, and it was the oddest-shaped guitar. • CD6 349 Amongst the news, views, product reviews and UK surf gossip, arrives Speng --; The Cool Ruler, a cartoon surfing dude

who travels the world's beach breaks in the company of Dog Gorgon. • CEK 2502 11 (10) CALIFORNIA MAN: Wayne's World style comedy about a rock dude. • CGB 2016 Nutshell it for us, dude: `;There's these two loser kids from the Valley. • CH5 4935 He'd come a long way for a dude from Texas, and it had all been so very easy for the man with the Gary Cooper smile. • …

Page 11: INFO 340 Information Retrieval Words in a Language – Cardinality and Enumeration

• Results of your search• Your query was • bloody • Here is a random selection of 50 solutions from the 6818 found... • A32 105 The film appears to query the notion of heroism in its battle scenes, muddy, bloody, in marked contrast to the

sunny patriotism of Laurence Olivier's wartime version. • A74 824 Bloody, bloody, bloody. • A7H 1640 `;I got a bloody nose over that,'; he says. • A95 255 The mass demonstrations of the first year are rare these days and the army and Shin Bet security service have

been chalking up success after bloody success in hunting down the masked youths who throw petrol bombs and kill collaborators.

• AC2 548 At one such meeting a heckler had got a great round of cheers from the assembled throng when he had told Clasper to get off his bloody soap-box and do a day's work for a bloody change.

• AC2 2312 You bloody shits….'; • B24 1996 S. H. Patrolling up Prescot Road during the war, if I saw a light on, I used to shout, `;Put that bloody light out,'; and

if nobody put the light out, we used to let fly with a brick. • B24 2614 You'd look at the sergeant and if he O. K. d it, you'd have one but if he didn't, you bloody wouldn't.'; • CA0 1427 You can't do without it, you bloody old tart, can you? • CAF 1251 In fact, Pilsudski came to power in a bloody putsch and presided over gross human-rights violations, the brutal

crushing of strikes and virtual civil war with the national minorities. • CJF 2396 `;Bloody murder, is it then? • CL2 2295 She was, in short, too bloody much, and not only that, she was totally ignoring me. • F9C 1708 Plumbers, builders, estate agents, the government, the council, bloody thieves… • FEE 2572 " You flaming women, you're so stuffed with bloody honesty it's a wonder you don't choke on it. • FP0 593 The short polar day died in bloody shadows. • FR5 366 It's so bloody easy. • FRS 26 `;Do you know how many firms of bloody architects I've traipsed round to in the past two months? • G1M 2346 It's bloody Piper.

Word Search on BNC

Page 12: INFO 340 Information Retrieval Words in a Language – Cardinality and Enumeration

Corpora Morphing Over Time

From:http://courses.ischool.berkeley.edu/i202/f07/lectures/202-20071203.pdf

Page 13: INFO 340 Information Retrieval Words in a Language – Cardinality and Enumeration

Enumeration leads to predictive capabilities

• Given a sample of text made up from words within a particular corpus that has been enumerated, I can predict the probability that certain words will show up.

Page 14: INFO 340 Information Retrieval Words in a Language – Cardinality and Enumeration

Zipf’s Law

• Named after George Zipf (1902 – 1950), a Harvard linguist

• Empirical – created from observation

• The probability of a word’s occurrence is inversely proportional to it’s rank

Page 15: INFO 340 Information Retrieval Words in a Language – Cardinality and Enumeration

Zipf’s Law

• More formally, Zipf’s law is a power law function where– the observation that frequency of occurrence of

some event ( P ), as a function of the rank ( i) when the rank is determined by the above frequency of occurrence, is a power-law function Pi ~ 1/ia with the exponent a close to unity (1).

(http://www.nslij-genetics.org/wli/zipf/)

Page 16: INFO 340 Information Retrieval Words in a Language – Cardinality and Enumeration

Zipf’s Law

• Identify your corpus• Count the total number of words• Count the total number of each word• Rank each word by frequency of occurrence– i.e. the most frequently used word has the highest

rank -- #1 and least frequently used word has the lowest rank (equal to the total number of words in the corpus)

• Plot

Page 17: INFO 340 Information Retrieval Words in a Language – Cardinality and Enumeration

Zipf’s LawEffect of taking the log of both axes

linear scale (both axes) log scale (both axes)

Page 18: INFO 340 Information Retrieval Words in a Language – Cardinality and Enumeration

Zipf’s Law in Different Languages

From Multilingual Statistical Text Analysis, Zipf’s Law and Hungarian Speech Generationhttp://www.nslij-genetics.org/wli/zipf/nemeth02.pdf

Page 19: INFO 340 Information Retrieval Words in a Language – Cardinality and Enumeration

Class Exercise

• Divide into groups of 5 – 6I. Each group will have one music category. Take 15

minutes to come up with 50 words that would fall predominantly into the following music genres. (Keep it clean).

-- country-- classic rock-- heavy metal-- hip-hop/rap-- top 40/pop-- KEXP

Page 20: INFO 340 Information Retrieval Words in a Language – Cardinality and Enumeration

Class Exercise Part II

• In the same groups, name 20 words that you think would appear with particular frequency (higher than ‘predicted’ by the entire corpus of all newspapers) in the following newspapers:– The Wall Street Journal– Seattle Times– The National Enquirer– The New York Times– The Seattle PI– The Ballard Journal

Page 21: INFO 340 Information Retrieval Words in a Language – Cardinality and Enumeration

Class Exercises Part III

• Back to music –– Your corpus is the entire set of articles, web

pages, books, interviews, etc written by critics and reviewers of the music industry

– Within each of your groups:• Quietly, identify a musical artist• Identify 10 words that you think appeared with

exceptionally frequency in this corpus• When done, you’ll write your 10 words on the board

and the class will try to figure out who the artist is

Page 22: INFO 340 Information Retrieval Words in a Language – Cardinality and Enumeration

Homework

Choose 20 words and using the BNC or Kilgariff’s lists (http://www.kilgarriff.co.uk/BNClists/lemma.num) –

and plot them on log-log paper. You can get log-log at the book store or print your own here: (http://incompetech.com/graphpaper/logarithmic/)

Do your 20 words look Zipfian ?

Page 23: INFO 340 Information Retrieval Words in a Language – Cardinality and Enumeration

Homework (continued)

• Download & install Lucene on your iSchool lab computer account (Windows side) --– http://www.apache.org/dyn/closer.cgi/lucene/java/– Download lucene-2.4.0.zip – Install per README.txt instructions

• Set CLASSPATH for lucene-core-2.4.0.jar & lucene-demos-2.4.0.jar

– Test the demo IndexFiles. (Read demo.html in docs)

• Download & install Luke (lukeall.jar) @– http://www.getopt.org/luke/