text is fun: statistical exploration of large...

44
Text is fun: Statistical exploration of large corpora Siva Reddy Lexical Computing Ltd, UK http://sketchengine.co.uk IIIT-Hyderabad Advanced School on Natural Language Processing July 14 2012 Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 1 / 30

Upload: others

Post on 12-Feb-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration

Text is fun: Statistical exploration of large corpora

Siva Reddy

Lexical Computing Ltd, UKhttp://sketchengine.co.uk

IIIT-Hyderabad Advanced School onNatural Language Processing

July 14 2012

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 1 / 30

Page 2: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration

Acknowledgments

Adam Kilgarriff Michael Rundell

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 2 / 30

Page 3: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration

What is “meaning”?

Semantics: Study of meaning in language.

Lexical semantics: Study of meaning of words.

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 3 / 30

Page 4: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration

What is “meaning”?

Semantics: Study of meaning in language.

Lexical semantics: Study of meaning of words.

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 3 / 30

Page 5: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration

What is “meaning”?

Semantics: Study of meaning in language.

Lexical semantics: Study of meaning of words.

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 3 / 30

Page 6: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration

What is “meaning”?

Semantics: Study of meaning in language.

Lexical semantics: Study of meaning of words.

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 3 / 30

Page 7: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration

How are dictionaries built in pre-computer era?

James Murray and colleagues: Oxford English Dictionary

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 4 / 30

Page 8: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration

How are dictionaries built in pre-computer era?

Storage of Evidences

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 5 / 30

Page 9: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration

How are dictionaries built in pre-computer era?

IndexingSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 6 / 30

Page 10: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration

Revolution: Internet Era

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 7 / 30

Page 11: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration

Dictionary building: Requirements

Corpus (Text) Collection

Wordlist

Evidence collection: Words in action.

Word Profiles

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 8 / 30

Page 12: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration

Dictionary building: Requirements

Corpus (Text) Collection

Wordlist

Evidence collection: Words in action.

Word Profiles

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 8 / 30

Page 13: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration

Web as Corpus: Challenges

Crawling

Text extraction

Spamming

Duplication

Exercise 1: WebBootCaTCollect corpus from web on a topic of interest.(Baroni et al., 2006; Kilgarriff et al., 2010)

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 9 / 30

Page 14: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration

Web as Corpus: Challenges

Crawling

Text extraction

Spamming

Duplication

Exercise 1: WebBootCaTCollect corpus from web on a topic of interest.(Baroni et al., 2006; Kilgarriff et al., 2010)

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 9 / 30

Page 15: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration

Wordlist

Generalized dictionary

Domain-specific dictionary

Exercise 2: Keyword Extraction

Collect keywords from the corpus you collected above.

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 10 / 30

Page 16: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration

Wordlist

Generalized dictionary

Domain-specific dictionary

Exercise 2: Keyword Extraction

Collect keywords from the corpus you collected above.

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 10 / 30

Page 17: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration

Evidence collection

Words in action

Google like searching isn’t enough

Get all the word forms of test?

Words which are at a distance of three from test?

Corpus Query Language: regular expressions

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 11 / 30

Page 18: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration

Regular expressions

Regular Expression Table:

http://bit.ly/KZT7Kj

Exercise 3: Write regular expressions for . . .

http://sketchengine.co.uk/exercises/regex/

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 12 / 30

Page 19: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration

CQL: Corpus Query Language

query pattern matching set of tokens

tokens have attributes (word, lemma, tag, lempos, lc)

[attribute="value"] for each token pattern

value is a regular expression

Additional Pointershttp://bit.ly/LPRuju

http://trac.sketchengine.co.uk/wiki/SkE/CorpusQuerying

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 13 / 30

Page 20: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration

Corpus Processing: Challenges

What are the noun forms of the word test?

Will "test.*" work?

Word Tokenization

Morphological analysis

Part-of-Speech Tagging

CQL: [lemma="treat" & tag="N.*"]

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 14 / 30

Page 21: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration

Corpus Processing: Challenges

What are the noun forms of the word test?

Will "test.*" work?

Word Tokenization

Morphological analysis

Part-of-Speech Tagging

CQL: [lemma="treat" & tag="N.*"]

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 14 / 30

Page 22: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration

Collocations (word associations)

When do you say a word A is important to word B?

mouse: laser

mouse: food

Exercise 4: Collocations of the words girl and boy?

Download data from http://sivareddy.in/textisfun.tgz

Rank context words using mutual informationa: P(x ,y)P(x)P(y)

aRemoved log for simplicity

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 15 / 30

Page 23: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration

Collocations (word associations)

When do you say a word A is important to word B?

mouse: laser

mouse: food

Exercise 4: Collocations of the words girl and boy?

Download data from http://sivareddy.in/textisfun.tgz

Rank context words using mutual informationa: P(x ,y)P(x)P(y)

aRemoved log for simplicity

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 15 / 30

Page 24: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration

Word Sketch - a profile describing collocations

Word Sketch of write-v http://bit.ly/KUCBFj

The voice of the majority

Sketch Grammar: describes the frequent constructions of words inlanguage

Exercise 5: Objects of eat-v?

Write the Sketch Grammar capturing object relation?

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 16 / 30

Page 25: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration

My near-dream for Indian languages?

Writing Sketch Grammar is not so time-taking.

Exploit Sketch Grammar to build Syntactic Parser

A parser for every language

Cash the similarities between different languages

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 17 / 30

Page 26: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration

When do you say two words are similar?

Distributional Hypothesis (Harris, 1954)

The words that occur in similar contexts tend to have similar meaning

e.g: laptop, computer

Backbone for Vector Space Model of Semantics.

Firth (Firth, 1957)

You shall know a person from his friends - Chinese Proverb

You shall know a word from its context - Firth’s Principle

Bag of words hypothesis

Two documents tend to be similar if they have similar distribution of similarwords

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 18 / 30

Page 27: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration

When do you say two words are similar?

Distributional Hypothesis (Harris, 1954)

The words that occur in similar contexts tend to have similar meaning

e.g: laptop, computer

Backbone for Vector Space Model of Semantics.

Firth (Firth, 1957)

You shall know a person from his friends - Chinese Proverb

You shall know a word from its context - Firth’s Principle

Bag of words hypothesis

Two documents tend to be similar if they have similar distribution of similarwords

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 18 / 30

Page 28: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration

Vector Space Models (VSMs) of Semantics

Interpret semantics using VSMBackbone: Distributional Hypothesis

Text entity (we are interested in) as a Vector (point) in dimensional space.

Context of the entity as dimensionsExisting methods represent knowledge in VSMs mainly in three types(Turney and Pantel, 2010)

term-documentterm-contextpair-pattern

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 19 / 30

Page 29: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration

Vector Space Models (VSMs) of Semantics

Interpret semantics using VSMBackbone: Distributional Hypothesis

Text entity (we are interested in) as a Vector (point) in dimensional space.

Context of the entity as dimensionsExisting methods represent knowledge in VSMs mainly in three types(Turney and Pantel, 2010)

term-documentterm-contextpair-pattern

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 19 / 30

Page 30: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration

Vector Space Models (VSMs) of Semantics

Interpret semantics using VSMBackbone: Distributional Hypothesis

Text entity (we are interested in) as a Vector (point) in dimensional space.

Context of the entity as dimensionsExisting methods represent knowledge in VSMs mainly in three types(Turney and Pantel, 2010)

term-documentterm-contextpair-pattern

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 19 / 30

Page 31: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration

Term-Document: (Salton et al., 1975)

1

d1: Human machine interface for Lab ABC computer applications

1Image courtesy: (Landauer et al., 1998)Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 20 / 30

Page 32: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration

Term-Document: (Salton et al., 1975)

2

Document similarity can be found using Cosine similarity

sim(D1,D2) = D1.D2‖D1‖‖D2‖

2Image courtesy: (Salton et al., 1975)Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 21 / 30

Page 33: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration

Term-Document: (Salton et al., 1975)

2

Document similarity can be found using Cosine similarity

sim(D1,D2) = D1.D2‖D1‖‖D2‖

2Image courtesy: (Salton et al., 1975)Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 21 / 30

Page 34: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration

Term-Context: Word Space Model

Meaning of a word as a vector (Schütze, 1998)

Meaning of a word is represented as a cooccurrence vector built from a corpus

police-n photon-n speed-n car-n soul-nTraffic 142 0 293 347 1Light 41 29 222 198 50TrafficLight 5 0 13 48 0

Exercise 6: Compute similarity between girl, boy, dog

Hint: Represent words as vectors using mutual information scores of contextwords, and compute Cosine similarity.

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 22 / 30

Page 35: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration

Term-Context: Word Space Model

Meaning of a word as a vector (Schütze, 1998)

Meaning of a word is represented as a cooccurrence vector built from a corpus

police-n photon-n speed-n car-n soul-nTraffic 142 0 293 347 1Light 41 29 222 198 50TrafficLight 5 0 13 48 0

Exercise 6: Compute similarity between girl, boy, dog

Hint: Represent words as vectors using mutual information scores of contextwords, and compute Cosine similarity.

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 22 / 30

Page 36: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration

Word Senses

So far we represented a word with a single word sketch

mouse vs mouse?

Word Sense Disambiguation: collocations are the clue

WordNet have been used extensively

Can we guess the number of senses of a word?

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 23 / 30

Page 37: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration

Word Sense Induction

Figure: Word Sense Induction in a Graph based setting

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 24 / 30

Page 38: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration

Semantic Word Sketches

Semantic FramesDemo: http://corpdev.sketchengine.co.uk/run.cgi/first_form?corpname=5dcaa5fe

Exercise 7: abstract entities which modify boy and girl

Use word sense of context words as clue.

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 25 / 30

Page 39: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration

Beyond Words: Compositional Semantics

Given meanings of

couch

roast

potato

Can we interpret the meanings of

couch potato

roast potato

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 26 / 30

Page 40: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration

Beyond Words: Compositional Semantics

Given meanings of

couch

roast

potato

Can we interpret the meanings of

couch potato

roast potato

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 26 / 30

Page 41: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration

Couch Potato

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 27 / 30

Page 42: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration

Roast Potato

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 28 / 30

Page 43: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration

Bibliography I

Baroni, M., Kilgarriff, A., Pomikalek, J., and Rychly, P. (2006). Webbootcat:Instant domain-specific corpora to support human translators. InProceedings of the 11th Annual Conference of the European Association forMachine Translation (EAMT), Norway.

Firth, J. R. (1957). A Synopsis of Linguistic Theory, 1930-1955. Studies inLinguistic Analysis, pages 1–32.

Harris, Z. S. (1954). Distributional structure. Word, 10:146–162.

Kilgarriff, A., Reddy, S., Pomikálek, J., and PVS, A. (2010). A corpus factory formany languages. In Proceedings of the Seventh International Conferenceon Language Resources and Evaluation (LREC’10), Valletta, Malta.

Landauer, T. K., Foltz, P. W., and Laham, D. (1998). An introduction to latentsemantic analysis. Discourse Processes, 25:259–284.

Salton, G., Wong, A., and Yang, C. S. (1975). A vector space model forautomatic indexing. Commun. ACM, 18:613–620.

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 29 / 30

Page 44: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration

Bibliography II

Schütze, H. (1998). Automatic Word Sense Discrimination. ComputationalLinguistics, 24(1):97–123.

Turney, P. D. and Pantel, P. (2010). From frequency to meaning: vector spacemodels of semantics. J. Artif. Int. Res., 37:141–188.

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 30 / 30