a social network approach to unsupervised induction of syntactic clusters for bengali monojit...
TRANSCRIPT
A Social Network Approach to
Unsupervised Induction of Syntactic Clusters for Bengali
Monojit ChoudhuryMicrosoft Research India
Co-authors
Chris BiemannUniversity of Leipzig
Joydeep Nath Animesh Mukherjee Niloy GangulyIndian Institute of Technology Kharagpur
Language – A Complex System
Structure: phones words, words phrases, phrase
sentence, sentence discourseFunction: Communication through
recursive syntax compositional semantics
Dynamics:EvolutionLanguage change
Computational Linguistics
Study of language using computersStudy of language-using computers
Natural Language Processing:Speech recognitionMachine translationAutomatic summarizationSpell checkers, Information retrieval &
extraction, …
Labeling of Text
Lexical Category (POS tags)Syntactic Category (Phrases, chunks)Semantic Role (Agent, theme, …)Sense Domain dependent labeling (genes, proteins, …)
How to define the set of labels?
How to (learn to) predict them automatically?
Distributional Hypothesis
“A word is characterized by the company it keeps” – Firth, 1957
Syntax: function words (Harris, 1968)Semantics: content words
Outline
Defining ContextSyntactic Network of WordsComplex Network – Theory & ApplicationsChinese Whispers: Clustering the NetworkExperimentsTopological Properties of the NetworksEvaluationFuture work
Features Words
Estimate the unigram frequencies
Feature words: Most frequent m words
Feature Vector
From the familiar to the exotic, the collection is a delight
0 0 … 0 1
1 0 … 0 0
0 1 … 0 0
1 0 … 0 0
fw1 fw2 fw199 fw200
p-2
p-1
p1
p2
the to is from
Syntactic Network of Words
light
color
red
blue
blood
sky
heavy
weight
100
20
1
1
1 – cos(red, blue)
The Chinese Whisper Algorithm
light
color
red
blue
blood
sky
heavy
weight
0.9
0.5
0.9
0.7
0.8
-0.5
The Chinese Whisper Algorithm
light
color
red
blue
blood
sky
heavy
weight
0.9
0.5
0.9
0.7
0.8
-0.5
The Chinese Whisper Algorithm
light
color
red
blue
blood
sky
heavy
weight
0.9
0.5
0.9
0.7
0.8
-0.5
Experiments
Corpus: Anandabazaar Patrika (17M words)
We build networks Gn,m
n: corpus size – {1M, 2M, 5M, 10M, 17M}m: number of feature words – {25, 50, 100, 200}
Number of nodes: 5000Number of edges ~ 150,000
Topological Properties: Cumulative Degree Distribution
Pk
kPk -log(k)pk = -dPk /dk 1/k Zipfian Distribution!!
CDD: Pk is the probability that a randomly chosen node has degree ≥ k
G17M,50
Topological Properties:Clustering Coefficient
Measures transitivity of the network or equivalently the proportion of triangles
Very small for random graphs, high for social networks
Mean CC for G17M,50: 0.53CC vs. Degree
Topological Properties: Cluster Size Distribution
Clu
ster
Siz
e
rankrank
Variation with n (m = 50) Variation with m (n = 17M)
Evaluation: Tag Entropy
w: {t1, t6, t9}
Tagw:
Cluster C: {w1, w2, w3, w4}
TE(C)=
1 0 0 0 0 1 0 0 1 0
1 0 0 0 0 1 0 0 1 0
0 0 1 0 0 1 0 0 1 0
0 0 0 0 0 1 0 0 1 0
1 0 1 0 0 1 0 0 1 0
1 0 1 0 0 0 0 0 0 0 = 2
Mean Tag Entropy
MTE = 1/N TE(Ci)
Weighted MTE = |Ci|TE(Ci)/(|Ci|)
Caveat: Every word in separate cluster has 0 MTE and WMTE
Baseline: Every word in a single cluster
Tag Entropy vs. Corpus Size
m = 50
1M 2M 5M 10M 17M
74.49 75.14 76.09 78.29 74.94
17.46 18.68 24.23 27.56 30.60
%Reduction in Tag Entropy
MTE
WMTE
The Bigger the worse!
Cluster Size
Tag
Ent
ropy
Clusters …
Big ones Bad ones mix of everything!
Medium sized clusters are good
http://banglaposclusters.googlepages.com/home
Rank Size Type
5 596 Proper nouns, titles and posts
6 352 Possessive case of nouns (common, proper, verbal) and pronouns
8 133 Nouns (common, verbal) forming compounds with “do” or “be”
11 44 Number-Classifier (e.g. 1-TA, ekaTA)
12 84 Adjectives
More Observations
Words are split intoFirst name vs. SurnamesAnimate nouns-poss vs. Inanimate noun-possNouns-acc vs. Nouns-poss vs. Nouns-locVerb-finite vs. Verb-infinitive
Syntactic or semantic?Nouns related to professions, months, days of week,
stars, players etc.
Advantages
No labeled data required: A good solution to resources scarcity
No prior class information: Circumvents issues related to tag set definition
Computational definition of Class
Understanding the structure of language (Syntax) and it’s evolution
Danke für Ihre Aufmerksamkeit.
Dieses ist „vom Übersetzer übersetzt worden, der“ von Phasen Microsoft Beta ist.
Thank you for your attention
This has been translated by "Translator Beta" from Microsoft Live.
Related Work
Harris, 68: Distributional hypothesis for syntactic classes
Miller and Charles, 91: Function words as featuresFinch and Chater, 92; Schtze, 93, 95; Clark, 00;
Rapp, 05; Biemann, 06: The general techniqueHaghighi and Klein, 06; Goldwater and Griffiths, 07:
Bayesian approach to unsupervised POS taggingDasgupta and Ng, 07: Bengali POS induction
through morphological features
Medium and Low Frequency Words
Neighboring (window 4) co-occurrences ranked by log-likelihood thresholded by θ
Two words are connected iff they share at least 4 neighbors
Language English Finnish German
Nodes 52857 85627 137951
Edges 691241 702349 1493571
Construction of Lexicon
Each word assigned a unique tag based on the word class it belongs toClass 1: sky, color, blood, weightClass 2: red, blue, light, heavy
Ambiguous words: High and medium frequency words that formed
singleton clusterPossible tags of neighboring clusters
Training and Evaluation
Unsupervised training of trigram HMM using the clusters and lexicon
Evaluation:Tag a text, for which gold standard is availableEstimate the conditional entropy H(T|C) and the
related perplexity 2H(T|C)
Final Results: English – 2.05 (619/345), Finnish – 3.22
(625/466), German – 1.79 (781/440)
Example
From the familiar to the exotic, the collection is a delight
Prep At JJ Prep At JJ At NN V At NN C200 C1 C331 C5 C1 C331 C1 C221 C3 C1 C220