Text is fun: Statistical exploration of large corpora
Siva Reddy
Lexical Computing Ltd, UKhttp://sketchengine.co.uk
IIIT-Hyderabad Advanced School onNatural Language Processing
July 14 2012
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 1 / 30
Acknowledgments
Adam Kilgarriff Michael Rundell
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 2 / 30
What is “meaning”?
Semantics: Study of meaning in language.
Lexical semantics: Study of meaning of words.
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 3 / 30
What is “meaning”?
Semantics: Study of meaning in language.
Lexical semantics: Study of meaning of words.
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 3 / 30
What is “meaning”?
Semantics: Study of meaning in language.
Lexical semantics: Study of meaning of words.
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 3 / 30
What is “meaning”?
Semantics: Study of meaning in language.
Lexical semantics: Study of meaning of words.
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 3 / 30
How are dictionaries built in pre-computer era?
James Murray and colleagues: Oxford English Dictionary
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 4 / 30
How are dictionaries built in pre-computer era?
Storage of Evidences
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 5 / 30
How are dictionaries built in pre-computer era?
IndexingSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 6 / 30
Revolution: Internet Era
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 7 / 30
Dictionary building: Requirements
Corpus (Text) Collection
Wordlist
Evidence collection: Words in action.
Word Profiles
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 8 / 30
Dictionary building: Requirements
Corpus (Text) Collection
Wordlist
Evidence collection: Words in action.
Word Profiles
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 8 / 30
Web as Corpus: Challenges
Crawling
Text extraction
Spamming
Duplication
Exercise 1: WebBootCaTCollect corpus from web on a topic of interest.(Baroni et al., 2006; Kilgarriff et al., 2010)
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 9 / 30
Web as Corpus: Challenges
Crawling
Text extraction
Spamming
Duplication
Exercise 1: WebBootCaTCollect corpus from web on a topic of interest.(Baroni et al., 2006; Kilgarriff et al., 2010)
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 9 / 30
Wordlist
Generalized dictionary
Domain-specific dictionary
Exercise 2: Keyword Extraction
Collect keywords from the corpus you collected above.
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 10 / 30
Wordlist
Generalized dictionary
Domain-specific dictionary
Exercise 2: Keyword Extraction
Collect keywords from the corpus you collected above.
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 10 / 30
Evidence collection
Words in action
Google like searching isn’t enough
Get all the word forms of test?
Words which are at a distance of three from test?
Corpus Query Language: regular expressions
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 11 / 30
Regular expressions
Regular Expression Table:
http://bit.ly/KZT7Kj
Exercise 3: Write regular expressions for . . .
http://sketchengine.co.uk/exercises/regex/
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 12 / 30
CQL: Corpus Query Language
query pattern matching set of tokens
tokens have attributes (word, lemma, tag, lempos, lc)
[attribute="value"] for each token pattern
value is a regular expression
Additional Pointershttp://bit.ly/LPRuju
http://trac.sketchengine.co.uk/wiki/SkE/CorpusQuerying
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 13 / 30
Corpus Processing: Challenges
What are the noun forms of the word test?
Will "test.*" work?
Word Tokenization
Morphological analysis
Part-of-Speech Tagging
CQL: [lemma="treat" & tag="N.*"]
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 14 / 30
Corpus Processing: Challenges
What are the noun forms of the word test?
Will "test.*" work?
Word Tokenization
Morphological analysis
Part-of-Speech Tagging
CQL: [lemma="treat" & tag="N.*"]
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 14 / 30
Collocations (word associations)
When do you say a word A is important to word B?
mouse: laser
mouse: food
Exercise 4: Collocations of the words girl and boy?
Download data from http://sivareddy.in/textisfun.tgz
Rank context words using mutual informationa: P(x ,y)P(x)P(y)
aRemoved log for simplicity
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 15 / 30
Collocations (word associations)
When do you say a word A is important to word B?
mouse: laser
mouse: food
Exercise 4: Collocations of the words girl and boy?
Download data from http://sivareddy.in/textisfun.tgz
Rank context words using mutual informationa: P(x ,y)P(x)P(y)
aRemoved log for simplicity
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 15 / 30
Word Sketch - a profile describing collocations
Word Sketch of write-v http://bit.ly/KUCBFj
The voice of the majority
Sketch Grammar: describes the frequent constructions of words inlanguage
Exercise 5: Objects of eat-v?
Write the Sketch Grammar capturing object relation?
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 16 / 30
My near-dream for Indian languages?
Writing Sketch Grammar is not so time-taking.
Exploit Sketch Grammar to build Syntactic Parser
A parser for every language
Cash the similarities between different languages
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 17 / 30
When do you say two words are similar?
Distributional Hypothesis (Harris, 1954)
The words that occur in similar contexts tend to have similar meaning
e.g: laptop, computer
Backbone for Vector Space Model of Semantics.
Firth (Firth, 1957)
You shall know a person from his friends - Chinese Proverb
You shall know a word from its context - Firth’s Principle
Bag of words hypothesis
Two documents tend to be similar if they have similar distribution of similarwords
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 18 / 30
When do you say two words are similar?
Distributional Hypothesis (Harris, 1954)
The words that occur in similar contexts tend to have similar meaning
e.g: laptop, computer
Backbone for Vector Space Model of Semantics.
Firth (Firth, 1957)
You shall know a person from his friends - Chinese Proverb
You shall know a word from its context - Firth’s Principle
Bag of words hypothesis
Two documents tend to be similar if they have similar distribution of similarwords
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 18 / 30
Vector Space Models (VSMs) of Semantics
Interpret semantics using VSMBackbone: Distributional Hypothesis
Text entity (we are interested in) as a Vector (point) in dimensional space.
Context of the entity as dimensionsExisting methods represent knowledge in VSMs mainly in three types(Turney and Pantel, 2010)
term-documentterm-contextpair-pattern
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 19 / 30
Vector Space Models (VSMs) of Semantics
Interpret semantics using VSMBackbone: Distributional Hypothesis
Text entity (we are interested in) as a Vector (point) in dimensional space.
Context of the entity as dimensionsExisting methods represent knowledge in VSMs mainly in three types(Turney and Pantel, 2010)
term-documentterm-contextpair-pattern
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 19 / 30
Vector Space Models (VSMs) of Semantics
Interpret semantics using VSMBackbone: Distributional Hypothesis
Text entity (we are interested in) as a Vector (point) in dimensional space.
Context of the entity as dimensionsExisting methods represent knowledge in VSMs mainly in three types(Turney and Pantel, 2010)
term-documentterm-contextpair-pattern
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 19 / 30
Term-Document: (Salton et al., 1975)
1
d1: Human machine interface for Lab ABC computer applications
1Image courtesy: (Landauer et al., 1998)Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 20 / 30
Term-Document: (Salton et al., 1975)
2
Document similarity can be found using Cosine similarity
sim(D1,D2) = D1.D2‖D1‖‖D2‖
2Image courtesy: (Salton et al., 1975)Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 21 / 30
Term-Document: (Salton et al., 1975)
2
Document similarity can be found using Cosine similarity
sim(D1,D2) = D1.D2‖D1‖‖D2‖
2Image courtesy: (Salton et al., 1975)Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 21 / 30
Term-Context: Word Space Model
Meaning of a word as a vector (Schütze, 1998)
Meaning of a word is represented as a cooccurrence vector built from a corpus
police-n photon-n speed-n car-n soul-nTraffic 142 0 293 347 1Light 41 29 222 198 50TrafficLight 5 0 13 48 0
Exercise 6: Compute similarity between girl, boy, dog
Hint: Represent words as vectors using mutual information scores of contextwords, and compute Cosine similarity.
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 22 / 30
Term-Context: Word Space Model
Meaning of a word as a vector (Schütze, 1998)
Meaning of a word is represented as a cooccurrence vector built from a corpus
police-n photon-n speed-n car-n soul-nTraffic 142 0 293 347 1Light 41 29 222 198 50TrafficLight 5 0 13 48 0
Exercise 6: Compute similarity between girl, boy, dog
Hint: Represent words as vectors using mutual information scores of contextwords, and compute Cosine similarity.
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 22 / 30
Word Senses
So far we represented a word with a single word sketch
mouse vs mouse?
Word Sense Disambiguation: collocations are the clue
WordNet have been used extensively
Can we guess the number of senses of a word?
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 23 / 30
Word Sense Induction
Figure: Word Sense Induction in a Graph based setting
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 24 / 30
Semantic Word Sketches
Semantic FramesDemo: http://corpdev.sketchengine.co.uk/run.cgi/first_form?corpname=5dcaa5fe
Exercise 7: abstract entities which modify boy and girl
Use word sense of context words as clue.
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 25 / 30
Beyond Words: Compositional Semantics
Given meanings of
couch
roast
potato
Can we interpret the meanings of
couch potato
roast potato
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 26 / 30
Beyond Words: Compositional Semantics
Given meanings of
couch
roast
potato
Can we interpret the meanings of
couch potato
roast potato
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 26 / 30
Couch Potato
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 27 / 30
Roast Potato
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 28 / 30
Bibliography I
Baroni, M., Kilgarriff, A., Pomikalek, J., and Rychly, P. (2006). Webbootcat:Instant domain-specific corpora to support human translators. InProceedings of the 11th Annual Conference of the European Association forMachine Translation (EAMT), Norway.
Firth, J. R. (1957). A Synopsis of Linguistic Theory, 1930-1955. Studies inLinguistic Analysis, pages 1–32.
Harris, Z. S. (1954). Distributional structure. Word, 10:146–162.
Kilgarriff, A., Reddy, S., Pomikálek, J., and PVS, A. (2010). A corpus factory formany languages. In Proceedings of the Seventh International Conferenceon Language Resources and Evaluation (LREC’10), Valletta, Malta.
Landauer, T. K., Foltz, P. W., and Laham, D. (1998). An introduction to latentsemantic analysis. Discourse Processes, 25:259–284.
Salton, G., Wong, A., and Yang, C. S. (1975). A vector space model forautomatic indexing. Commun. ACM, 18:613–620.
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 29 / 30
Bibliography II
Schütze, H. (1998). Automatic Word Sense Discrimination. ComputationalLinguistics, 24(1):97–123.
Turney, P. D. and Pantel, P. (2010). From frequency to meaning: vector spacemodels of semantics. J. Artif. Int. Res., 37:141–188.
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 30 / 30