a social network approach to unsupervised induction of syntactic clusters for bengali monojit...

30
A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India [email protected]

Upload: christina-bonnie-stokes

Post on 04-Jan-2016

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com

A Social Network Approach to

Unsupervised Induction of Syntactic Clusters for Bengali

Monojit ChoudhuryMicrosoft Research India

[email protected]

Page 2: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com

Co-authors

Chris BiemannUniversity of Leipzig

Joydeep Nath Animesh Mukherjee Niloy GangulyIndian Institute of Technology Kharagpur

Page 3: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com

Language – A Complex System

Structure: phones words, words phrases, phrase

sentence, sentence discourseFunction: Communication through

recursive syntax compositional semantics

Dynamics:EvolutionLanguage change

Page 4: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com

Computational Linguistics

Study of language using computersStudy of language-using computers

Natural Language Processing:Speech recognitionMachine translationAutomatic summarizationSpell checkers, Information retrieval &

extraction, …

Page 5: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com

Labeling of Text

Lexical Category (POS tags)Syntactic Category (Phrases, chunks)Semantic Role (Agent, theme, …)Sense Domain dependent labeling (genes, proteins, …)

How to define the set of labels?

How to (learn to) predict them automatically?

Page 6: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com

Distributional Hypothesis

“A word is characterized by the company it keeps” – Firth, 1957

Syntax: function words (Harris, 1968)Semantics: content words

Page 7: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com

Outline

Defining ContextSyntactic Network of WordsComplex Network – Theory & ApplicationsChinese Whispers: Clustering the NetworkExperimentsTopological Properties of the NetworksEvaluationFuture work

Page 8: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com

Features Words

Estimate the unigram frequencies

Feature words: Most frequent m words

Page 9: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com

Feature Vector

From the familiar to the exotic, the collection is a delight

0 0 … 0 1

1 0 … 0 0

0 1 … 0 0

1 0 … 0 0

fw1 fw2 fw199 fw200

p-2

p-1

p1

p2

the to is from

Page 10: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com

Syntactic Network of Words

light

color

red

blue

blood

sky

heavy

weight

100

20

1

1

1 – cos(red, blue)

Page 11: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com

The Chinese Whisper Algorithm

light

color

red

blue

blood

sky

heavy

weight

0.9

0.5

0.9

0.7

0.8

-0.5

Page 12: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com

The Chinese Whisper Algorithm

light

color

red

blue

blood

sky

heavy

weight

0.9

0.5

0.9

0.7

0.8

-0.5

Page 13: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com

The Chinese Whisper Algorithm

light

color

red

blue

blood

sky

heavy

weight

0.9

0.5

0.9

0.7

0.8

-0.5

Page 14: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com

Experiments

Corpus: Anandabazaar Patrika (17M words)

We build networks Gn,m

n: corpus size – {1M, 2M, 5M, 10M, 17M}m: number of feature words – {25, 50, 100, 200}

Number of nodes: 5000Number of edges ~ 150,000

Page 15: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com

Topological Properties: Cumulative Degree Distribution

Pk

kPk -log(k)pk = -dPk /dk 1/k Zipfian Distribution!!

CDD: Pk is the probability that a randomly chosen node has degree ≥ k

G17M,50

Page 16: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com

Topological Properties:Clustering Coefficient

Measures transitivity of the network or equivalently the proportion of triangles

Very small for random graphs, high for social networks

Mean CC for G17M,50: 0.53CC vs. Degree

Page 17: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com

Topological Properties: Cluster Size Distribution

Clu

ster

Siz

e

rankrank

Variation with n (m = 50) Variation with m (n = 17M)

Page 18: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com

Evaluation: Tag Entropy

w: {t1, t6, t9}

Tagw:

Cluster C: {w1, w2, w3, w4}

TE(C)=

1 0 0 0 0 1 0 0 1 0

1 0 0 0 0 1 0 0 1 0

0 0 1 0 0 1 0 0 1 0

0 0 0 0 0 1 0 0 1 0

1 0 1 0 0 1 0 0 1 0

1 0 1 0 0 0 0 0 0 0 = 2

Page 19: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com

Mean Tag Entropy

MTE = 1/N TE(Ci)

Weighted MTE = |Ci|TE(Ci)/(|Ci|)

Caveat: Every word in separate cluster has 0 MTE and WMTE

Baseline: Every word in a single cluster

Page 20: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com

Tag Entropy vs. Corpus Size

m = 50

1M 2M 5M 10M 17M

74.49 75.14 76.09 78.29 74.94

17.46 18.68 24.23 27.56 30.60

%Reduction in Tag Entropy

MTE

WMTE

Page 21: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com

The Bigger the worse!

Cluster Size

Tag

Ent

ropy

Page 22: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com

Clusters …

Big ones Bad ones mix of everything!

Medium sized clusters are good

http://banglaposclusters.googlepages.com/home

Rank Size Type

5 596 Proper nouns, titles and posts

6 352 Possessive case of nouns (common, proper, verbal) and pronouns

8 133 Nouns (common, verbal) forming compounds with “do” or “be”

11 44 Number-Classifier (e.g. 1-TA, ekaTA)

12 84 Adjectives

Page 23: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com

More Observations

Words are split intoFirst name vs. SurnamesAnimate nouns-poss vs. Inanimate noun-possNouns-acc vs. Nouns-poss vs. Nouns-locVerb-finite vs. Verb-infinitive

Syntactic or semantic?Nouns related to professions, months, days of week,

stars, players etc.

Page 24: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com

Advantages

No labeled data required: A good solution to resources scarcity

No prior class information: Circumvents issues related to tag set definition

Computational definition of Class

Understanding the structure of language (Syntax) and it’s evolution

Page 25: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com

Danke für Ihre Aufmerksamkeit.

Dieses ist „vom Übersetzer übersetzt worden, der“ von Phasen Microsoft Beta ist.

Thank you for your attention

This has been translated by "Translator Beta" from Microsoft Live.

Page 26: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com

Related Work

Harris, 68: Distributional hypothesis for syntactic classes

Miller and Charles, 91: Function words as featuresFinch and Chater, 92; Schtze, 93, 95; Clark, 00;

Rapp, 05; Biemann, 06: The general techniqueHaghighi and Klein, 06; Goldwater and Griffiths, 07:

Bayesian approach to unsupervised POS taggingDasgupta and Ng, 07: Bengali POS induction

through morphological features

Page 27: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com

Medium and Low Frequency Words

Neighboring (window 4) co-occurrences ranked by log-likelihood thresholded by θ

Two words are connected iff they share at least 4 neighbors

Language English Finnish German

Nodes 52857 85627 137951

Edges 691241 702349 1493571

Page 28: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com

Construction of Lexicon

Each word assigned a unique tag based on the word class it belongs toClass 1: sky, color, blood, weightClass 2: red, blue, light, heavy

Ambiguous words: High and medium frequency words that formed

singleton clusterPossible tags of neighboring clusters

Page 29: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com

Training and Evaluation

Unsupervised training of trigram HMM using the clusters and lexicon

Evaluation:Tag a text, for which gold standard is availableEstimate the conditional entropy H(T|C) and the

related perplexity 2H(T|C)

Final Results: English – 2.05 (619/345), Finnish – 3.22

(625/466), German – 1.79 (781/440)

Page 30: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com

Example

From the familiar to the exotic, the collection is a delight

Prep At JJ Prep At JJ At NN V At NN C200 C1 C331 C5 C1 C331 C1 C221 C3 C1 C220