assignment 4 - clustering languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por...

28
Assignment 4 Clustering Languages Verena Blaschke July 04, 2018

Upload: others

Post on 13-Mar-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Assignment 4 - Clustering Languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por mns nld bre cat spa deu ron bak lav fra ukr ces nor udm kaz mrj lbe enf dar ddo

Assignment 4Clustering Languages

Verena Blaschke

July 04, 2018

Page 2: Assignment 4 - Clustering Languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por mns nld bre cat spa deu ron bak lav fra ukr ces nor udm kaz mrj lbe enf dar ddo

Assignment 4

I: Feature extraction

II: K-means clustering

III: Principal component analysis

IV: Evaluation with gold-standard labels

V: Calculating distances

VI: Hierarchical clustering

Page 3: Assignment 4 - Clustering Languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por mns nld bre cat spa deu ron bak lav fra ukr ces nor udm kaz mrj lbe enf dar ddo

I: Feature extraction

fin s i l m æfin k O r V Afin n E n æfin s u u...cmn th U N t ù 1cmn õ @ n n a I

Ï 80 languages × 272 IPA segments

Page 4: Assignment 4 - Clustering Languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por mns nld bre cat spa deu ron bak lav fra ukr ces nor udm kaz mrj lbe enf dar ddo

I: Feature extraction

fin s i l m æfin k O r V Afin n E n æfin s u u...cmn th U N t ù 1cmn õ @ n n a I

Ï 80 languages × 272 IPA segments

Page 5: Assignment 4 - Clustering Languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por mns nld bre cat spa deu ron bak lav fra ukr ces nor udm kaz mrj lbe enf dar ddo

II: K-means clustering

Ï k-means clustering for k in [2,70]Ï silhouette coefficient

Ï How close (=similar) is each data point to other points fromits own cluster compared to other clusters?

Ï [−1,+1], higher scores are better

Page 6: Assignment 4 - Clustering Languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por mns nld bre cat spa deu ron bak lav fra ukr ces nor udm kaz mrj lbe enf dar ddo

II: K-means clustering

Ï k-means clustering for k in [2,70]Ï silhouette coefficient

Ï How close (=similar) is each data point to other points fromits own cluster compared to other clusters?

Ï [−1,+1], higher scores are better

Page 7: Assignment 4 - Clustering Languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por mns nld bre cat spa deu ron bak lav fra ukr ces nor udm kaz mrj lbe enf dar ddo

0 10 20 30 40 50 60 70number of clusters

0.04

0.06

0.08

0.10

0.12

0.14

0.16si

lhouett

e s

core

Silhouette score by number of clusters.

Page 8: Assignment 4 - Clustering Languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por mns nld bre cat spa deu ron bak lav fra ukr ces nor udm kaz mrj lbe enf dar ddo

II: K-means clustering

Ï k-means clustering for k in [2,70]Ï silhouette coefficient

Ï How close (=similar) is each data point to other points fromits own cluster compared to other clusters?

Ï [−1,+1], higher scores are betterÏ error function

Ï sum of squared distances from the closest centroidskmeans.inertia_

Ï What is a good number of clusters? → elbow method

Page 9: Assignment 4 - Clustering Languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por mns nld bre cat spa deu ron bak lav fra ukr ces nor udm kaz mrj lbe enf dar ddo

II: K-means clustering

Ï k-means clustering for k in [2,70]Ï silhouette coefficient

Ï How close (=similar) is each data point to other points fromits own cluster compared to other clusters?

Ï [−1,+1], higher scores are betterÏ error function

Ï sum of squared distances from the closest centroidskmeans.inertia_

Ï What is a good number of clusters? → elbow method

Page 10: Assignment 4 - Clustering Languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por mns nld bre cat spa deu ron bak lav fra ukr ces nor udm kaz mrj lbe enf dar ddo

0 10 20 30 40 50 60 70number of clusters

0.0

0.2

0.4

0.6

0.8

1.0err

or

1e8Sum of squared distances from the centroids by number of clusters.

Page 11: Assignment 4 - Clustering Languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por mns nld bre cat spa deu ron bak lav fra ukr ces nor udm kaz mrj lbe enf dar ddo

III: Principal component analysis

Ï remove redundant features

Ï remove noiseÏ train machine learning models more quickly

Page 12: Assignment 4 - Clustering Languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por mns nld bre cat spa deu ron bak lav fra ukr ces nor udm kaz mrj lbe enf dar ddo

III: Principal component analysis

Ï remove redundant featuresÏ remove noiseÏ train machine learning models more quickly

Page 13: Assignment 4 - Clustering Languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por mns nld bre cat spa deu ron bak lav fra ukr ces nor udm kaz mrj lbe enf dar ddo

1500 1000 500 0 500 1000 1500 20001500

1000

500

0

500

1000

1500

smn

mhr

cym

hrv

sjd

azj pbu

slv

hun

tur

koi

chv

pesrus

bul

eng

tel

yrknio

smsliv

myvkan

vep

kpv

fin

kcalez

dan

ekk

smj

kmr

sme

ben

olo

mdf

sweavasahita

bel

uzn ell

mal

oss

hin

gle

pol

krlslk

sel

lat

che

pormns

nld

bre

catspa

rondeu

bak

lav

fra

ukr

ces

norudm

kaz

mrj

lbedarenf

ddo

isltam

sma

lit

hye

tat

Feature vectors scaled down to 2 dimensions.

Page 14: Assignment 4 - Clustering Languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por mns nld bre cat spa deu ron bak lav fra ukr ces nor udm kaz mrj lbe enf dar ddo

1500 1000 500 0 500 1000 1500 20001500

1000

500

0

500

1000

1500

tat

livekk

fra

swe

myv

tam

lbebelcesvep

kmr spacatrus gle

smssmn

deu

hrv

mrj

mdf

yrknio

olo

polhye

che

smjsme

sjd

dar

avaddo

fin

benell

lat

bre

kpv

selazj sah

uzn

dan

lez

isl

cym

oss

mhr

kaz

slv

mal

koi

lit

lav

bak

itakcatel

nor

tur

hun

kanengudm

sma

ron

hin

nld por

bul

krl

pbu

enf

chvmns

pes

slk

ukr

tat

livekk

fra

swe

myv

tam

lbebelcesvep

kmr spacatrus gle

smssmn

deu

hrv

mrj

mdf

yrknio

olo

polhye

che

smjsme

sjd

dar

avaddo

fin

benell

lat

bre

kpv

selazj sah

uzn

dan

lez

isl

cym

oss

mhr

kaz

slv

mal

koi

lit

lav

bak

itakcatel

nor

tur

hun

kanengudm

sma

ron

hin

nld por

bul

krl

pbu

enf

chvmns

pes

slk

ukr

Feature vectors scaled down to 2 dimensions.

TurkicIndo-EuropeanUralicNakh-DaghestanianDravidian

Page 15: Assignment 4 - Clustering Languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por mns nld bre cat spa deu ron bak lav fra ukr ces nor udm kaz mrj lbe enf dar ddo

III: Principal component analysis

pca = PCA(features.shape[1])d = 0var_explained = 0while var_explained < 0.9:

var_explained += pca.explained_variance_ratio_[d]d += 1

featuresPCA = PCA(d).fit_transform(features)

pca = PCA(0.9)print(pca.n_components_)

Page 16: Assignment 4 - Clustering Languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por mns nld bre cat spa deu ron bak lav fra ukr ces nor udm kaz mrj lbe enf dar ddo

III: Principal component analysis

pca = PCA(features.shape[1])d = 0var_explained = 0while var_explained < 0.9:

var_explained += pca.explained_variance_ratio_[d]d += 1

featuresPCA = PCA(d).fit_transform(features)

pca = PCA(0.9)print(pca.n_components_)

Page 17: Assignment 4 - Clustering Languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por mns nld bre cat spa deu ron bak lav fra ukr ces nor udm kaz mrj lbe enf dar ddo

0 5 10 15 20component

0.0

0.2

0.4

0.6

0.8

1.0vari

ance

expla

ined

Variance explained per PCA component.

variance explained (cumulative)variance explained per component

Page 18: Assignment 4 - Clustering Languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por mns nld bre cat spa deu ron bak lav fra ukr ces nor udm kaz mrj lbe enf dar ddo

IV: Evaluation with gold-standard labels

n_fam = len(set(family))pred_all = KMeans(n_fam).fit_predict(features)pred_pca = KMeans(n_fam).fit_predict(featuresPCA)

lang all pca family---------------------------------------------------kan 2 4 Dravidiantam 3 0 Dravidiantel 4 0 Dravidianmal 4 2 Dravidianbul 0 1 Indo-Europeances 0 1 Indo-European...

Page 19: Assignment 4 - Clustering Languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por mns nld bre cat spa deu ron bak lav fra ukr ces nor udm kaz mrj lbe enf dar ddo

IV: Evaluation with gold-standard labels

n_fam = len(set(family))pred_all = KMeans(n_fam).fit_predict(features)pred_pca = KMeans(n_fam).fit_predict(featuresPCA)

lang all pca family---------------------------------------------------kan 2 4 Dravidiantam 3 0 Dravidiantel 4 0 Dravidianmal 4 2 Dravidianbul 0 1 Indo-Europeances 0 1 Indo-European...

Page 20: Assignment 4 - Clustering Languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por mns nld bre cat spa deu ron bak lav fra ukr ces nor udm kaz mrj lbe enf dar ddo

IV: Evaluation with gold-standard labels

Ï Homogeneity: Each cluster contains data points of the samegold-standard class.

Ï Completeness: All members of a gold-standard class are inthe same cluster.

Ï V-measure: Harmonic mean of homogeneity andcompleteness.

Ï [0,1] higher is better

all H: 0.1707 C: 0.1461 V: 0.1575PCA H: 0.1728 C: 0.1572 V: 0.1646

Page 21: Assignment 4 - Clustering Languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por mns nld bre cat spa deu ron bak lav fra ukr ces nor udm kaz mrj lbe enf dar ddo

IV: Evaluation with gold-standard labels

Ï Homogeneity: Each cluster contains data points of the samegold-standard class.

Ï Completeness: All members of a gold-standard class are inthe same cluster.

Ï V-measure: Harmonic mean of homogeneity andcompleteness.

Ï [0,1] higher is better

all H: 0.1707 C: 0.1461 V: 0.1575PCA H: 0.1728 C: 0.1572 V: 0.1646

Page 22: Assignment 4 - Clustering Languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por mns nld bre cat spa deu ron bak lav fra ukr ces nor udm kaz mrj lbe enf dar ddo

IV: Evaluation with gold-standard labels

Ï Homogeneity: Each cluster contains data points of the samegold-standard class.

Ï Completeness: All members of a gold-standard class are inthe same cluster.

Ï V-measure: Harmonic mean of homogeneity andcompleteness.

Ï [0,1] higher is better

all H: 0.1707 C: 0.1461 V: 0.1575PCA H: 0.1728 C: 0.1572 V: 0.1646

Page 23: Assignment 4 - Clustering Languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por mns nld bre cat spa deu ron bak lav fra ukr ces nor udm kaz mrj lbe enf dar ddo

IV: Evaluation with gold-standard labels

Ï Homogeneity: Each cluster contains data points of the samegold-standard class.

Ï Completeness: All members of a gold-standard class are inthe same cluster.

Ï V-measure: Harmonic mean of homogeneity andcompleteness.

Ï [0,1] higher is better

all H: 0.1707 C: 0.1461 V: 0.1575PCA H: 0.1728 C: 0.1572 V: 0.1646

Page 24: Assignment 4 - Clustering Languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por mns nld bre cat spa deu ron bak lav fra ukr ces nor udm kaz mrj lbe enf dar ddo

V: Calculating distances

A B C D E123 452 10 572 A

342 370 908 B127 754 C

23 DE

Page 25: Assignment 4 - Clustering Languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por mns nld bre cat spa deu ron bak lav fra ukr ces nor udm kaz mrj lbe enf dar ddo

VI: Hierarchical clustering

for m in ['single', 'complete', 'average']:fig, ax = plt.subplots()z = scipy.cluster.hierarchy.linkage(dist, method=m)scipy.cluster.hierarchy.dendrogram(z, labels=languages)fig.savefig('dendrogram-{}.pdf'.format(method))

Page 26: Assignment 4 - Clustering Languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por mns nld bre cat spa deu ron bak lav fra ukr ces nor udm kaz mrj lbe enf dar ddo

pes

olo

mal

che

sjd

sel

nio

oss yrk

kpv

hin

spa

dar

ava

por

tel

ddo

hye

slk

bul

koi

ukr is

ltu

renf

sah

mdf

bre

hun

chv

bak

smn

vep

azj

sma

smj

nld

kmr

lbe

slv

bel

litta

mka

zm

rjeng

sme

lav

fin

krl

lez

tat

ron

cat

hrv

mhr

cym fra

lat

livben

uzn kca

kan

nor

sms

myv

gle

mns

udm

ekk ita ell

dan

rus

ces

swe

pol

deu

pbu

0

1000

2000

3000

4000

5000

6000

7000Hierarchical clustering with single linking

Page 27: Assignment 4 - Clustering Languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por mns nld bre cat spa deu ron bak lav fra ukr ces nor udm kaz mrj lbe enf dar ddo

che

nio

oss sjd

sel

bre

hun

smn

vep

kpv

nld

kmr

tur

sah

mdf

enf

chv

bak

pes

dan

rus

ces

olo

yrk lbe

slv

bel

azj

sma

smj

udm

ekk hrv

cym

myv

gle

mns

ita ell

swe

pol

deu

pbu

mal

tam

uzn kca

kan

fra

lat

ben

sms

nor

livm

hr

dar

ava

por

tel

koi

ukr is

lhye

slk

bul

ddo

spa

hin

sme

lav

fin

krl

lez

tat

ron

cat lit

kaz

mrj

eng

0

5000

10000

15000

Hierarchical clustering with complete linking

Page 28: Assignment 4 - Clustering Languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por mns nld bre cat spa deu ron bak lav fra ukr ces nor udm kaz mrj lbe enf dar ddo

mal

hrv

cym

myv

gle

ben

uzn kca

kan liv

sms

nor

mhr

fra

lat

olo

yrk

udm

ekk

dan

rus

ces

swe

pol

deu

pbu

mns

ita ell

koi

ukr is

ldar

ava

por

tel

hye

slk

bul

ddo

spa

tam

kaz

mrj

eng lit lez

tat

ron

cat

hin

sme

lav

fin

krl

pes

sjd

sel

che

nio

oss

smn

vep

kpv

nld

kmr

tur

enf

sah

mdf

bre

hun

chv

bak

lbe

slv

bel

azj

sma

smj0

2000

4000

6000

8000

10000

Hierarchical clustering with average linking