a social network approach to unsupervised induction of syntactic clusters for bengali monojit...

A Social Network Approach to

Unsupervised Induction of Syntactic Clusters for Bengali

Monojit ChoudhuryMicrosoft Research India

[email protected]

Co-authors

Chris BiemannUniversity of Leipzig

Joydeep Nath Animesh Mukherjee Niloy GangulyIndian Institute of Technology Kharagpur

Language – A Complex System

Structure: phones words, words phrases, phrase

sentence, sentence discourseFunction: Communication through

recursive syntax compositional semantics

Dynamics:EvolutionLanguage change

Computational Linguistics

Study of language using computersStudy of language-using computers

Natural Language Processing:Speech recognitionMachine translationAutomatic summarizationSpell checkers, Information retrieval &

extraction, …

Labeling of Text

Lexical Category (POS tags)Syntactic Category (Phrases, chunks)Semantic Role (Agent, theme, …)Sense Domain dependent labeling (genes, proteins, …)

How to define the set of labels?

How to (learn to) predict them automatically?

Distributional Hypothesis

“A word is characterized by the company it keeps” – Firth, 1957

Syntax: function words (Harris, 1968)Semantics: content words

Outline

Defining ContextSyntactic Network of WordsComplex Network – Theory & ApplicationsChinese Whispers: Clustering the NetworkExperimentsTopological Properties of the NetworksEvaluationFuture work

Features Words

Estimate the unigram frequencies

Feature words: Most frequent m words

Feature Vector

From the familiar to the exotic, the collection is a delight

0 0 … 0 1

1 0 … 0 0

0 1 … 0 0

1 0 … 0 0

fw1 fw2 fw199 fw200

p-2

p-1

p1

p2

the to is from

Syntactic Network of Words

light

color

red

blue

blood

sky

heavy

weight

100

20

1

1

1 – cos(red, blue)

The Chinese Whisper Algorithm

light

color

red

blue

blood

sky

heavy

weight

0.9

0.5

0.9

0.7

0.8

-0.5

Experiments

Corpus: Anandabazaar Patrika (17M words)

We build networks Gn,m

n: corpus size – {1M, 2M, 5M, 10M, 17M}m: number of feature words – {25, 50, 100, 200}

Number of nodes: 5000Number of edges ~ 150,000

Topological Properties: Cumulative Degree Distribution

Pk

kPk -log(k)pk = -dPk /dk 1/k Zipfian Distribution!!

CDD: Pk is the probability that a randomly chosen node has degree ≥ k

G17M,50

Topological Properties:Clustering Coefficient

Measures transitivity of the network or equivalently the proportion of triangles

Very small for random graphs, high for social networks

Mean CC for G17M,50: 0.53CC vs. Degree

Topological Properties: Cluster Size Distribution

Clu

ster

Siz

e

rankrank

Variation with n (m = 50) Variation with m (n = 17M)

Evaluation: Tag Entropy

w: {t1, t6, t9}

Tagw:

Cluster C: {w1, w2, w3, w4}

TE(C)=

1 0 0 0 0 1 0 0 1 0

1 0 0 0 0 1 0 0 1 0

0 0 1 0 0 1 0 0 1 0

0 0 0 0 0 1 0 0 1 0

1 0 1 0 0 1 0 0 1 0

1 0 1 0 0 0 0 0 0 0 = 2

Mean Tag Entropy

MTE = 1/N TE(Ci)

Weighted MTE = |Ci|TE(Ci)/(|Ci|)

Caveat: Every word in separate cluster has 0 MTE and WMTE

Baseline: Every word in a single cluster

Tag Entropy vs. Corpus Size

m = 50

1M 2M 5M 10M 17M

74.49 75.14 76.09 78.29 74.94

17.46 18.68 24.23 27.56 30.60

%Reduction in Tag Entropy

MTE

WMTE

The Bigger the worse!

Cluster Size

Tag

Ent

ropy

Clusters …

Big ones Bad ones mix of everything!

Medium sized clusters are good

http://banglaposclusters.googlepages.com/home

Rank Size Type

5 596 Proper nouns, titles and posts

6 352 Possessive case of nouns (common, proper, verbal) and pronouns

8 133 Nouns (common, verbal) forming compounds with “do” or “be”

11 44 Number-Classifier (e.g. 1-TA, ekaTA)

12 84 Adjectives

More Observations

Words are split intoFirst name vs. SurnamesAnimate nouns-poss vs. Inanimate noun-possNouns-acc vs. Nouns-poss vs. Nouns-locVerb-finite vs. Verb-infinitive

Syntactic or semantic?Nouns related to professions, months, days of week,

stars, players etc.

Advantages

No labeled data required: A good solution to resources scarcity

No prior class information: Circumvents issues related to tag set definition

Computational definition of Class

Understanding the structure of language (Syntax) and it’s evolution

Danke für Ihre Aufmerksamkeit.

Dieses ist „vom Übersetzer übersetzt worden, der“ von Phasen Microsoft Beta ist.

Thank you for your attention

This has been translated by "Translator Beta" from Microsoft Live.

Related Work

Harris, 68: Distributional hypothesis for syntactic classes

Miller and Charles, 91: Function words as featuresFinch and Chater, 92; Schtze, 93, 95; Clark, 00;

Rapp, 05; Biemann, 06: The general techniqueHaghighi and Klein, 06; Goldwater and Griffiths, 07:

Bayesian approach to unsupervised POS taggingDasgupta and Ng, 07: Bengali POS induction

through morphological features

Medium and Low Frequency Words

Neighboring (window 4) co-occurrences ranked by log-likelihood thresholded by θ

Two words are connected iff they share at least 4 neighbors

Language English Finnish German

Nodes 52857 85627 137951

Edges 691241 702349 1493571

Construction of Lexicon

Each word assigned a unique tag based on the word class it belongs toClass 1: sky, color, blood, weightClass 2: red, blue, light, heavy

Ambiguous words: High and medium frequency words that formed

singleton clusterPossible tags of neighboring clusters

Training and Evaluation

Unsupervised training of trigram HMM using the clusters and lexicon

Evaluation:Tag a text, for which gold standard is availableEstimate the conditional entropy H(T|C) and the

related perplexity 2H(T|C)

Final Results: English – 2.05 (619/345), Finnish – 3.22

(625/466), German – 1.79 (781/440)

Example

From the familiar to the exotic, the collection is a delight

Prep At JJ Prep At JJ At NN V At NN C200 C1 C331 C5 C1 C331 C1 C221 C3 C1 C220

a social network approach to unsupervised induction of syntactic clusters for bengali monojit...

Documents

words phrases

phones words

number of feature words

social network approach

cluster c

separate cluster

degreetopological properties

function words harris