deconstructing data science -...

57
Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 5: Clustering overview Jan 31, 2016

Upload: others

Post on 23-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

Deconstructing Data ScienceDavid Bamman, UC Berkeley

Info 290

Lecture 5: Clustering overview

Jan 31, 2016

Page 2: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

Clustering

• Clustering (and unsupervised learning more generally) finds structure in data, using just X

X = a set of skyscrapers

Page 3: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

Unsupervised Learning

Le et al. (2012), “Building High-level Features Using Large Scale Unsupervised Learning” (ICML)

Page 4: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

Netflix

Amazon

Twitter

New York Times

Page 5: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

Unsupervised Learning• Matrix completion (e.g., user recommendations on

Netflix, Amazon)

Ann Bob Chris David Erik

Star Wars 5 5 4 5 3

Bridget Jones 4 4 1

Rocky 3 5

Rambo ? 2 5

Page 6: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

task 𝓧

learn patterns that define architectural styles set of skyscrapers

learn patterns that define genre set of books

learn patterns that suggest “types” of customer behavior customer data

Page 7: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

Topic models

Probabilistic graphical models

Networks

Deep learning

K-means clustering

Hierarchical clustering

Methods differ in the kind of structure learned

Page 8: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

Hierarchical Clustering• Hierarchical order

among the elements being clustered

Page 9: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

Shakespeare’s plays

Witmore (2009)http://winedarksea.org/?p=519

Dendrogram

Page 10: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

Bottom-up clustering

Page 11: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

Similarity

• What are you comparing?

• How do you quantify the similarity/difference of those things?

P(X )⇥ P(X ) ! R

Page 12: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

Probability

the a dog cat runs to store

0.0

0.2

0.4

Page 13: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

Unigram probability

the a of love sword poison hamlet romeo king capulet be woe him most

0.00

0.06

0.12

the a of love sword poison hamlet romeo king capulet be woe him most

0.00

0.06

0.12

Page 14: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

Similarity

Euclidean =

vuutvocabX

i

�PHamleti

� PRomeoi

�2

Cosine similarity, Jensen-Shannon divergence…

Page 15: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

Cluster similarity

Page 16: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

Cluster similarity

• Single link: two most similar elements • Complete link: two least similar elements • Group average: average of all members

Page 17: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

Flat Clustering• Partitions the data into a set of K clusters

A

B

C

Page 18: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

Flat Clustering• Partitions the data into a set of K clusters

Page 19: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

K-means

Page 20: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

K-means

Page 21: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

Representation

• This is a huge decision that impacts what you can learn

[x is a data point characterized by F real numbers, one for each feature]

x � RF

Page 22: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

Voting behavior

Yes on abortion access 1

Yes on expanding gun

rights0

Yes on tax breaks 0

Yes on ACA 1

Yes on abolishing IRS 0

x � R5

Page 23: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

First letter of last name

Last name starts with < “A” 0

Last name starts with < “B” 0

Last name starts with < “C” 1

Last name starts with < “D” 1

… 1

Last name starts with < “Z” 1

x � R26

Page 24: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

task 𝓧

learn patterns that define architectural styles set of skyscrapers

learn patterns that define genre set of books

learn patterns that suggest “types” of customer behavior customer data

Representation

Page 25: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

Evaluation

• Much more complex than supervised learning since there’s often no notion of “truth”

Page 26: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

Internal criteria

• Elements within clusters should be more similar to each other

• Elements in different clusters should be less similar to each other

Page 27: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

External criteria

• How closely does your clustering reproduce another (“gold standard”) clustering?

Page 28: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

A B C

Learned clusters

Comparison clusters

Page 29: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

Evaluation: Purity• Learned clusters

(as learned by our algorithm)

• External clusters(from some external source)

=1N

kmax

j|gk � cj|

G = {g1 . . . gk}

C = {c1 . . . cj}

Purity

Page 30: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

A B C

Learned (G)

External (C)

=1N

kmax

j|gk � cj|

Page 31: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

A B C

=1N

kmax

j|gk � cj|

Learned (G)

External (C)

Page 32: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

A B C

=1N

kmax

j|gk � cj|

Learned (G)

External (C)

Page 33: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

A B C

=1N

kmax

j|gk � cj|

Learned (G)

External (C)

Page 34: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

A B C

(1 + 1 + 2) / 7 = .57

Learned (G)

External (C)

Page 35: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

Evaluation: Rand IndexEvery pair of data points is either in the same external cluster, or it’s not. = binary classification

Page 36: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

same cluster?

Rubio Paul 1

Rubio Cruz 1

Rubio Trump 0

Rubio Fiorina 0

Rubio Clinton 0

Rubio Sanders 0

Paul Cruz 1

Paul Trump 0

Rand Index

Page 37: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

Rand Index

samecluster

differentcluster

samecluster

differentcluster

Predicted (ŷ)

True

(y)

21 decisionsN(N� 1)/2

Page 38: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

Learned

External

samecluster

differentcluster

samecluster

differentcluster

Predicted (ŷ)Tr

ue (y

)

Page 39: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

Rand Indexsamecluster

differentcluster

samecluster 1 4

differentcluster 4 12

Predicted (ŷ)

True

(y)

From the confusion matrix, we can calculate standard

measures from binary classification

The Rand Index = accuracy

(1 + 12) / 21 = .619

Page 40: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

Example

Clustering characters into distinct types

Page 41: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

The Villain• Does (agent): kill,

hunt, severs, chokes

• Has done to them (patient): fights, defeats, refuses

• Is described as (attribute): evil, frustrated, lord

Page 42: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

• Is character in the movie “Star Wars”

• Science Fiction, Adventure, Space Opera, Fantasy, Family Film, Action

• Is played by David Prowse • Male • 42 years old in 1977

The Villain

Page 43: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

Task

Data Source

42,306 movie plot summaries Wikipedia

15,099 English novels (1700-1899) HathiTrust

Learning character types from textual descriptions of characters.

Page 44: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

Personas

attribute dark major henchman warrior sergeant

agent shoot aim overpower interrogate kill

Jason Bourne, Bourne Supremacy

• Male • Action • War film

Highest weighted features:

Page 45: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

Personas

patient capture corner transport imprison trap

agent infiltrate deduce leap evade obtain

agent flee escape swim hide manage

Ginormica (Monsters vs. Aliens)

• Female • Action • Adventure

Highest weighted features:

Page 46: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

Evaluation I: Names

• Gold clusters: characters with the same name (sequels, remakes)

• Noise: “street thug”

• 970 unique character names used twice in the data; n=2,666

Page 47: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

Evaluation II: TV Tropes

• Gold clusters: manually clustered characters from www.tvtropes.com

• “The Surfer Dude” • “Arrogant Kung-Fu Guy” • “Hardboiled Detective” • “The Klutz” • “The Valley GIrl”

• 72 character tropes containing 501 characters

Page 48: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

Purity: Names

0

17.5

35

52.5

70

25x25 25x50 25x100 50x25 50x50 50x100

Persona Regression Dirichlet Persona

Page 49: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

Purity: TV Tropes

0

17.5

35

52.5

70

25x25 25x50 25x100 50x25 50x50 50x100

Persona Regression Dirichlet Persona

Page 50: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

task 𝓧

learn patterns that define architectural styles set of skyscrapers

learn patterns that define genre set of books

learn patterns that suggest “types” of customer behavior customer data

Evaluation

Page 51: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

Digital Humanities

• Marche (2012), Literature Is not Data: Against Digital Humanities

• Underwood (2015), Seven ways humanists are using computers to understand text.

Page 52: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

Text visualization

Page 53: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

Characteristic vocabulary

Characteristic words by William Wordsworth (in comparison to other contemporary poets) [Underwood 2015]

Page 54: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

Finding and organizing texts

• e.g., finding all examples of a complex literary form (Haiku).

• Supplement traditional searches: book catalogues, search engines.

Page 55: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

Modeling literary forms

• What features of a text are predictive of Haiku?

Page 56: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

Modeling social boundaries

Predicting reviewed texts [Underwood and Sellers (2015)]

Page 57: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/5_clustering.pdf · Lecture 5: Clustering overview Jan 31, 2016. Clustering • Clustering (and

Unsupervised modeling