assignment 4 - clustering languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por...
TRANSCRIPT
![Page 1: Assignment 4 - Clustering Languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por mns nld bre cat spa deu ron bak lav fra ukr ces nor udm kaz mrj lbe enf dar ddo](https://reader036.vdocuments.us/reader036/viewer/2022071418/611576ff7fef7877a33a8582/html5/thumbnails/1.jpg)
Assignment 4Clustering Languages
Verena Blaschke
July 04, 2018
![Page 2: Assignment 4 - Clustering Languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por mns nld bre cat spa deu ron bak lav fra ukr ces nor udm kaz mrj lbe enf dar ddo](https://reader036.vdocuments.us/reader036/viewer/2022071418/611576ff7fef7877a33a8582/html5/thumbnails/2.jpg)
Assignment 4
I: Feature extraction
II: K-means clustering
III: Principal component analysis
IV: Evaluation with gold-standard labels
V: Calculating distances
VI: Hierarchical clustering
![Page 3: Assignment 4 - Clustering Languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por mns nld bre cat spa deu ron bak lav fra ukr ces nor udm kaz mrj lbe enf dar ddo](https://reader036.vdocuments.us/reader036/viewer/2022071418/611576ff7fef7877a33a8582/html5/thumbnails/3.jpg)
I: Feature extraction
fin s i l m æfin k O r V Afin n E n æfin s u u...cmn th U N t ù 1cmn õ @ n n a I
Ï 80 languages × 272 IPA segments
![Page 4: Assignment 4 - Clustering Languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por mns nld bre cat spa deu ron bak lav fra ukr ces nor udm kaz mrj lbe enf dar ddo](https://reader036.vdocuments.us/reader036/viewer/2022071418/611576ff7fef7877a33a8582/html5/thumbnails/4.jpg)
I: Feature extraction
fin s i l m æfin k O r V Afin n E n æfin s u u...cmn th U N t ù 1cmn õ @ n n a I
Ï 80 languages × 272 IPA segments
![Page 5: Assignment 4 - Clustering Languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por mns nld bre cat spa deu ron bak lav fra ukr ces nor udm kaz mrj lbe enf dar ddo](https://reader036.vdocuments.us/reader036/viewer/2022071418/611576ff7fef7877a33a8582/html5/thumbnails/5.jpg)
II: K-means clustering
Ï k-means clustering for k in [2,70]Ï silhouette coefficient
Ï How close (=similar) is each data point to other points fromits own cluster compared to other clusters?
Ï [−1,+1], higher scores are better
![Page 6: Assignment 4 - Clustering Languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por mns nld bre cat spa deu ron bak lav fra ukr ces nor udm kaz mrj lbe enf dar ddo](https://reader036.vdocuments.us/reader036/viewer/2022071418/611576ff7fef7877a33a8582/html5/thumbnails/6.jpg)
II: K-means clustering
Ï k-means clustering for k in [2,70]Ï silhouette coefficient
Ï How close (=similar) is each data point to other points fromits own cluster compared to other clusters?
Ï [−1,+1], higher scores are better
![Page 7: Assignment 4 - Clustering Languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por mns nld bre cat spa deu ron bak lav fra ukr ces nor udm kaz mrj lbe enf dar ddo](https://reader036.vdocuments.us/reader036/viewer/2022071418/611576ff7fef7877a33a8582/html5/thumbnails/7.jpg)
0 10 20 30 40 50 60 70number of clusters
0.04
0.06
0.08
0.10
0.12
0.14
0.16si
lhouett
e s
core
Silhouette score by number of clusters.
![Page 8: Assignment 4 - Clustering Languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por mns nld bre cat spa deu ron bak lav fra ukr ces nor udm kaz mrj lbe enf dar ddo](https://reader036.vdocuments.us/reader036/viewer/2022071418/611576ff7fef7877a33a8582/html5/thumbnails/8.jpg)
II: K-means clustering
Ï k-means clustering for k in [2,70]Ï silhouette coefficient
Ï How close (=similar) is each data point to other points fromits own cluster compared to other clusters?
Ï [−1,+1], higher scores are betterÏ error function
Ï sum of squared distances from the closest centroidskmeans.inertia_
Ï What is a good number of clusters? → elbow method
![Page 9: Assignment 4 - Clustering Languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por mns nld bre cat spa deu ron bak lav fra ukr ces nor udm kaz mrj lbe enf dar ddo](https://reader036.vdocuments.us/reader036/viewer/2022071418/611576ff7fef7877a33a8582/html5/thumbnails/9.jpg)
II: K-means clustering
Ï k-means clustering for k in [2,70]Ï silhouette coefficient
Ï How close (=similar) is each data point to other points fromits own cluster compared to other clusters?
Ï [−1,+1], higher scores are betterÏ error function
Ï sum of squared distances from the closest centroidskmeans.inertia_
Ï What is a good number of clusters? → elbow method
![Page 10: Assignment 4 - Clustering Languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por mns nld bre cat spa deu ron bak lav fra ukr ces nor udm kaz mrj lbe enf dar ddo](https://reader036.vdocuments.us/reader036/viewer/2022071418/611576ff7fef7877a33a8582/html5/thumbnails/10.jpg)
0 10 20 30 40 50 60 70number of clusters
0.0
0.2
0.4
0.6
0.8
1.0err
or
1e8Sum of squared distances from the centroids by number of clusters.
![Page 11: Assignment 4 - Clustering Languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por mns nld bre cat spa deu ron bak lav fra ukr ces nor udm kaz mrj lbe enf dar ddo](https://reader036.vdocuments.us/reader036/viewer/2022071418/611576ff7fef7877a33a8582/html5/thumbnails/11.jpg)
III: Principal component analysis
Ï remove redundant features
Ï remove noiseÏ train machine learning models more quickly
![Page 12: Assignment 4 - Clustering Languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por mns nld bre cat spa deu ron bak lav fra ukr ces nor udm kaz mrj lbe enf dar ddo](https://reader036.vdocuments.us/reader036/viewer/2022071418/611576ff7fef7877a33a8582/html5/thumbnails/12.jpg)
III: Principal component analysis
Ï remove redundant featuresÏ remove noiseÏ train machine learning models more quickly
![Page 13: Assignment 4 - Clustering Languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por mns nld bre cat spa deu ron bak lav fra ukr ces nor udm kaz mrj lbe enf dar ddo](https://reader036.vdocuments.us/reader036/viewer/2022071418/611576ff7fef7877a33a8582/html5/thumbnails/13.jpg)
1500 1000 500 0 500 1000 1500 20001500
1000
500
0
500
1000
1500
smn
mhr
cym
hrv
sjd
azj pbu
slv
hun
tur
koi
chv
pesrus
bul
eng
tel
yrknio
smsliv
myvkan
vep
kpv
fin
kcalez
dan
ekk
smj
kmr
sme
ben
olo
mdf
sweavasahita
bel
uzn ell
mal
oss
hin
gle
pol
krlslk
sel
lat
che
pormns
nld
bre
catspa
rondeu
bak
lav
fra
ukr
ces
norudm
kaz
mrj
lbedarenf
ddo
isltam
sma
lit
hye
tat
Feature vectors scaled down to 2 dimensions.
![Page 14: Assignment 4 - Clustering Languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por mns nld bre cat spa deu ron bak lav fra ukr ces nor udm kaz mrj lbe enf dar ddo](https://reader036.vdocuments.us/reader036/viewer/2022071418/611576ff7fef7877a33a8582/html5/thumbnails/14.jpg)
1500 1000 500 0 500 1000 1500 20001500
1000
500
0
500
1000
1500
tat
livekk
fra
swe
myv
tam
lbebelcesvep
kmr spacatrus gle
smssmn
deu
hrv
mrj
mdf
yrknio
olo
polhye
che
smjsme
sjd
dar
avaddo
fin
benell
lat
bre
kpv
selazj sah
uzn
dan
lez
isl
cym
oss
mhr
kaz
slv
mal
koi
lit
lav
bak
itakcatel
nor
tur
hun
kanengudm
sma
ron
hin
nld por
bul
krl
pbu
enf
chvmns
pes
slk
ukr
tat
livekk
fra
swe
myv
tam
lbebelcesvep
kmr spacatrus gle
smssmn
deu
hrv
mrj
mdf
yrknio
olo
polhye
che
smjsme
sjd
dar
avaddo
fin
benell
lat
bre
kpv
selazj sah
uzn
dan
lez
isl
cym
oss
mhr
kaz
slv
mal
koi
lit
lav
bak
itakcatel
nor
tur
hun
kanengudm
sma
ron
hin
nld por
bul
krl
pbu
enf
chvmns
pes
slk
ukr
Feature vectors scaled down to 2 dimensions.
TurkicIndo-EuropeanUralicNakh-DaghestanianDravidian
![Page 15: Assignment 4 - Clustering Languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por mns nld bre cat spa deu ron bak lav fra ukr ces nor udm kaz mrj lbe enf dar ddo](https://reader036.vdocuments.us/reader036/viewer/2022071418/611576ff7fef7877a33a8582/html5/thumbnails/15.jpg)
III: Principal component analysis
pca = PCA(features.shape[1])d = 0var_explained = 0while var_explained < 0.9:
var_explained += pca.explained_variance_ratio_[d]d += 1
featuresPCA = PCA(d).fit_transform(features)
pca = PCA(0.9)print(pca.n_components_)
![Page 16: Assignment 4 - Clustering Languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por mns nld bre cat spa deu ron bak lav fra ukr ces nor udm kaz mrj lbe enf dar ddo](https://reader036.vdocuments.us/reader036/viewer/2022071418/611576ff7fef7877a33a8582/html5/thumbnails/16.jpg)
III: Principal component analysis
pca = PCA(features.shape[1])d = 0var_explained = 0while var_explained < 0.9:
var_explained += pca.explained_variance_ratio_[d]d += 1
featuresPCA = PCA(d).fit_transform(features)
pca = PCA(0.9)print(pca.n_components_)
![Page 17: Assignment 4 - Clustering Languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por mns nld bre cat spa deu ron bak lav fra ukr ces nor udm kaz mrj lbe enf dar ddo](https://reader036.vdocuments.us/reader036/viewer/2022071418/611576ff7fef7877a33a8582/html5/thumbnails/17.jpg)
0 5 10 15 20component
0.0
0.2
0.4
0.6
0.8
1.0vari
ance
expla
ined
Variance explained per PCA component.
variance explained (cumulative)variance explained per component
![Page 18: Assignment 4 - Clustering Languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por mns nld bre cat spa deu ron bak lav fra ukr ces nor udm kaz mrj lbe enf dar ddo](https://reader036.vdocuments.us/reader036/viewer/2022071418/611576ff7fef7877a33a8582/html5/thumbnails/18.jpg)
IV: Evaluation with gold-standard labels
n_fam = len(set(family))pred_all = KMeans(n_fam).fit_predict(features)pred_pca = KMeans(n_fam).fit_predict(featuresPCA)
lang all pca family---------------------------------------------------kan 2 4 Dravidiantam 3 0 Dravidiantel 4 0 Dravidianmal 4 2 Dravidianbul 0 1 Indo-Europeances 0 1 Indo-European...
![Page 19: Assignment 4 - Clustering Languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por mns nld bre cat spa deu ron bak lav fra ukr ces nor udm kaz mrj lbe enf dar ddo](https://reader036.vdocuments.us/reader036/viewer/2022071418/611576ff7fef7877a33a8582/html5/thumbnails/19.jpg)
IV: Evaluation with gold-standard labels
n_fam = len(set(family))pred_all = KMeans(n_fam).fit_predict(features)pred_pca = KMeans(n_fam).fit_predict(featuresPCA)
lang all pca family---------------------------------------------------kan 2 4 Dravidiantam 3 0 Dravidiantel 4 0 Dravidianmal 4 2 Dravidianbul 0 1 Indo-Europeances 0 1 Indo-European...
![Page 20: Assignment 4 - Clustering Languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por mns nld bre cat spa deu ron bak lav fra ukr ces nor udm kaz mrj lbe enf dar ddo](https://reader036.vdocuments.us/reader036/viewer/2022071418/611576ff7fef7877a33a8582/html5/thumbnails/20.jpg)
IV: Evaluation with gold-standard labels
Ï Homogeneity: Each cluster contains data points of the samegold-standard class.
Ï Completeness: All members of a gold-standard class are inthe same cluster.
Ï V-measure: Harmonic mean of homogeneity andcompleteness.
Ï [0,1] higher is better
all H: 0.1707 C: 0.1461 V: 0.1575PCA H: 0.1728 C: 0.1572 V: 0.1646
![Page 21: Assignment 4 - Clustering Languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por mns nld bre cat spa deu ron bak lav fra ukr ces nor udm kaz mrj lbe enf dar ddo](https://reader036.vdocuments.us/reader036/viewer/2022071418/611576ff7fef7877a33a8582/html5/thumbnails/21.jpg)
IV: Evaluation with gold-standard labels
Ï Homogeneity: Each cluster contains data points of the samegold-standard class.
Ï Completeness: All members of a gold-standard class are inthe same cluster.
Ï V-measure: Harmonic mean of homogeneity andcompleteness.
Ï [0,1] higher is better
all H: 0.1707 C: 0.1461 V: 0.1575PCA H: 0.1728 C: 0.1572 V: 0.1646
![Page 22: Assignment 4 - Clustering Languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por mns nld bre cat spa deu ron bak lav fra ukr ces nor udm kaz mrj lbe enf dar ddo](https://reader036.vdocuments.us/reader036/viewer/2022071418/611576ff7fef7877a33a8582/html5/thumbnails/22.jpg)
IV: Evaluation with gold-standard labels
Ï Homogeneity: Each cluster contains data points of the samegold-standard class.
Ï Completeness: All members of a gold-standard class are inthe same cluster.
Ï V-measure: Harmonic mean of homogeneity andcompleteness.
Ï [0,1] higher is better
all H: 0.1707 C: 0.1461 V: 0.1575PCA H: 0.1728 C: 0.1572 V: 0.1646
![Page 23: Assignment 4 - Clustering Languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por mns nld bre cat spa deu ron bak lav fra ukr ces nor udm kaz mrj lbe enf dar ddo](https://reader036.vdocuments.us/reader036/viewer/2022071418/611576ff7fef7877a33a8582/html5/thumbnails/23.jpg)
IV: Evaluation with gold-standard labels
Ï Homogeneity: Each cluster contains data points of the samegold-standard class.
Ï Completeness: All members of a gold-standard class are inthe same cluster.
Ï V-measure: Harmonic mean of homogeneity andcompleteness.
Ï [0,1] higher is better
all H: 0.1707 C: 0.1461 V: 0.1575PCA H: 0.1728 C: 0.1572 V: 0.1646
![Page 24: Assignment 4 - Clustering Languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por mns nld bre cat spa deu ron bak lav fra ukr ces nor udm kaz mrj lbe enf dar ddo](https://reader036.vdocuments.us/reader036/viewer/2022071418/611576ff7fef7877a33a8582/html5/thumbnails/24.jpg)
V: Calculating distances
A B C D E123 452 10 572 A
342 370 908 B127 754 C
23 DE
![Page 25: Assignment 4 - Clustering Languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por mns nld bre cat spa deu ron bak lav fra ukr ces nor udm kaz mrj lbe enf dar ddo](https://reader036.vdocuments.us/reader036/viewer/2022071418/611576ff7fef7877a33a8582/html5/thumbnails/25.jpg)
VI: Hierarchical clustering
for m in ['single', 'complete', 'average']:fig, ax = plt.subplots()z = scipy.cluster.hierarchy.linkage(dist, method=m)scipy.cluster.hierarchy.dendrogram(z, labels=languages)fig.savefig('dendrogram-{}.pdf'.format(method))
![Page 26: Assignment 4 - Clustering Languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por mns nld bre cat spa deu ron bak lav fra ukr ces nor udm kaz mrj lbe enf dar ddo](https://reader036.vdocuments.us/reader036/viewer/2022071418/611576ff7fef7877a33a8582/html5/thumbnails/26.jpg)
pes
olo
mal
che
sjd
sel
nio
oss yrk
kpv
hin
spa
dar
ava
por
tel
ddo
hye
slk
bul
koi
ukr is
ltu
renf
sah
mdf
bre
hun
chv
bak
smn
vep
azj
sma
smj
nld
kmr
lbe
slv
bel
litta
mka
zm
rjeng
sme
lav
fin
krl
lez
tat
ron
cat
hrv
mhr
cym fra
lat
livben
uzn kca
kan
nor
sms
myv
gle
mns
udm
ekk ita ell
dan
rus
ces
swe
pol
deu
pbu
0
1000
2000
3000
4000
5000
6000
7000Hierarchical clustering with single linking
![Page 27: Assignment 4 - Clustering Languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por mns nld bre cat spa deu ron bak lav fra ukr ces nor udm kaz mrj lbe enf dar ddo](https://reader036.vdocuments.us/reader036/viewer/2022071418/611576ff7fef7877a33a8582/html5/thumbnails/27.jpg)
che
nio
oss sjd
sel
bre
hun
smn
vep
kpv
nld
kmr
tur
sah
mdf
enf
chv
bak
pes
dan
rus
ces
olo
yrk lbe
slv
bel
azj
sma
smj
udm
ekk hrv
cym
myv
gle
mns
ita ell
swe
pol
deu
pbu
mal
tam
uzn kca
kan
fra
lat
ben
sms
nor
livm
hr
dar
ava
por
tel
koi
ukr is
lhye
slk
bul
ddo
spa
hin
sme
lav
fin
krl
lez
tat
ron
cat lit
kaz
mrj
eng
0
5000
10000
15000
Hierarchical clustering with complete linking
![Page 28: Assignment 4 - Clustering Languages · ita bel uzn ell mal oss hin gle pol krl slk sel lat che por mns nld bre cat spa deu ron bak lav fra ukr ces nor udm kaz mrj lbe enf dar ddo](https://reader036.vdocuments.us/reader036/viewer/2022071418/611576ff7fef7877a33a8582/html5/thumbnails/28.jpg)
mal
hrv
cym
myv
gle
ben
uzn kca
kan liv
sms
nor
mhr
fra
lat
olo
yrk
udm
ekk
dan
rus
ces
swe
pol
deu
pbu
mns
ita ell
koi
ukr is
ldar
ava
por
tel
hye
slk
bul
ddo
spa
tam
kaz
mrj
eng lit lez
tat
ron
cat
hin
sme
lav
fin
krl
pes
sjd
sel
che
nio
oss
smn
vep
kpv
nld
kmr
tur
enf
sah
mdf
bre
hun
chv
bak
lbe
slv
bel
azj
sma
smj0
2000
4000
6000
8000
10000
Hierarchical clustering with average linking