i256: applied natural language processing

65
1 I256: Applied Natural Language Processing Marti Hearst Nov 6, 2006

Upload: erol

Post on 25-Feb-2016

37 views

Category:

Documents


4 download

DESCRIPTION

I256: Applied Natural Language Processing. Marti Hearst Nov 6, 2006. Today. Text Clustering Latent Semantic Indexing (LSA). Text Clustering. Finds overall similarities among groups of documents Finds overall similarities among groups of tokens Picks out some themes, ignores others. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: I256:  Applied Natural Language Processing

1

I256: Applied Natural Language Processing

Marti HearstNov 6, 2006

 

 

Page 2: I256:  Applied Natural Language Processing

2

Today

Text ClusteringLatent Semantic Indexing (LSA)

Page 3: I256:  Applied Natural Language Processing

3

Text Clustering

Finds overall similarities among groups of documentsFinds overall similarities among groups of tokensPicks out some themes, ignores others

Page 4: I256:  Applied Natural Language Processing

4

Text ClusteringClustering is

“The art of finding groups in data.” -- Kaufmann and Rousseeu

Term 1

Term 2

Page 5: I256:  Applied Natural Language Processing

5

Text Clustering

Term 1

Term 2

Clustering is“The art of finding groups in data.” -- Kaufmann and Rousseeu

Page 6: I256:  Applied Natural Language Processing

6Slide by Vasileios Hatzivassiloglou

Clustering Applications

Find semantically related words by combining similarity evidence from multiple indicatorsTry to find overall trends or patterns in text collections

Page 7: I256:  Applied Natural Language Processing

7Slide by Vasileios Hatzivassiloglou

“Training” in Clustering

Clustering is an unsupervised learning methodFor each data set, a totally fresh solution is constructedTherefore, there is no trainingHowever, we often use some data for which we have additional information on how it should be partitioned to evaluate the performance of the clustering method

Page 8: I256:  Applied Natural Language Processing

8

Pair-wise Document Similarity

nova galaxy heat h’wood film role diet fur 1 3 1

5 2 2 1 5 4 1

ABCD

How to compute document similarity?

Page 9: I256:  Applied Natural Language Processing

9

Pair-wise Document Similarity(no normalization for simplicity)

nova galaxy heat h’wood film role diet fur 1 3 1

5 2 2 1 5 4 1

ABCD

t

iii

t

t

wwDDsim

wwwD

wwwD

12121

2,22212

1,12111

),(

...,,

...,,

9)11()42(),(0),(0),(0),(0),(

11)32()51(),(

DCsimDBsimCBsimDAsimCAsimBAsim

Page 10: I256:  Applied Natural Language Processing

10

Pair-wise Document Similarity(cosine normalization)

normalized cosine )()(

),(

edunnormaliz ),(

...,,

...,,

1

22

1

21

121

21

12121

2,22212

1,12111

t

ii

t

ii

t

iii

t

iii

t

t

ww

wwDDsim

wwDDsim

wwwD

wwwD

Page 11: I256:  Applied Natural Language Processing

11

Document/Document Matrix

....

.....

................

21

2212

1121

21

nnn

t

t

t

ddD

ddDddDDDD

jiij DDd to of similarity

Page 12: I256:  Applied Natural Language Processing

12Slide by Vasileios Hatzivassiloglou

Hierarchical clustering methods

Agglomerative or bottom-up:Start with each sample in its own clusterMerge the two closest clustersRepeat until one cluster is left

Divisive or top-down:Start with all elements in one clusterPartition one of the current clusters in twoRepeat until all samples are in singleton clusters

Page 13: I256:  Applied Natural Language Processing

13

Agglomerative Clustering

A B C D E F G HI

Page 14: I256:  Applied Natural Language Processing

14

Agglomerative Clustering

A B C D E F G HI

Page 15: I256:  Applied Natural Language Processing

15

AgglomerativeClustering

A B C D E F G HI

Page 16: I256:  Applied Natural Language Processing

16Slide by Vasileios Hatzivassiloglou

Merging NodesEach node is a combination of the documents combined below itWe represent the merged nodes as a vector of term weightsThis vector is referred to as the cluster centroid

Page 17: I256:  Applied Natural Language Processing

17Slide by Vasileios Hatzivassiloglou

Merging criteriaWe need to extend the distance measure from samples to sets of samplesThe complete linkage method

The single linkage method

The average linkage method

),(max),(,

yxdBAdByAx

),(min),(,

yxdBAdByAx

ByAx

yxdBA

BAd,

),(1),(

Page 18: I256:  Applied Natural Language Processing

18

Single-link merging criteria

Merge closest pair of clusters:Single-link: clusters are close if any of their points are

dist(A,B) = min dist(a,b) for aA, bB

each word type isa single-point cluster

merge

Page 19: I256:  Applied Natural Language Processing

19

Fast, but tend to get long, stringy, meandering clusters ...Bottom-Up Clustering – Single-Link

Page 20: I256:  Applied Natural Language Processing

20

Bottom-Up Clustering – Complete-Link

Again, merge closest pair of clusters:Complete-link: clusters are close only if all of their points are dist(A,B) = max dist(a,b) for aA, bB

distancebetweenclusters

Page 21: I256:  Applied Natural Language Processing

21

Bottom-Up Clustering – Complete-Link

distancebetweenclusters

Slow to find closest pair – need quadratically many distances

Page 22: I256:  Applied Natural Language Processing

22

K-Means Clustering1 Decide on a pair-wise similarity measure2 Find K centers using agglomerative clustering

take a small sample group bottom up until K groups found

3 Assign each document to nearest center, forming new clusters4 Repeat 3 as necessary

Page 23: I256:  Applied Natural Language Processing

23Slide by Vasileios Hatzivassiloglou

k-Medians

Similar to k-means but instead of calculating the means across features, it selects as ci the sample in cluster Ci that minimizes (the median)Advantages

Does not require feature vectorsDistance between samples is always availableStatistics with medians are more robust than statistics with means

iCx

icxd ),(

Page 24: I256:  Applied Natural Language Processing

24Slide by Vasileios Hatzivassiloglou

Choosing k

In both hierarchical and k-means/medians, we need to be told where to stop, i.e., how many clusters to formThis is partially alleviated by visual inspection of the hierarchical tree (the dendrogram)It would be nice if we could find an optimal k from the dataWe can do this by trying different values of k and seeing which produces the best separation among the resulting clusters.

Page 25: I256:  Applied Natural Language Processing

25

Scatter/Gather: Clustering a Large Text Collection

Cutting, Pedersen, Tukey & Karger 92, 93Hearst & Pedersen 95

Cluster sets of documents into general “themes”, like a table of contents

Display the contents of the clusters by showing topical terms and typical titles

User chooses subsets of the clusters and re-clusters the documents within

Resulting new groups have different “themes”

Page 26: I256:  Applied Natural Language Processing

26

S/G Example: query on “star”Encyclopedia text

14 sports 8 symbols 47 film, tv 68 film, tv (p) 7 music97 astrophysics 67 astronomy(p) 12 steller phenomena 10 flora/fauna 49 galaxies, stars

29 constellations 7 miscelleneous

Clustering and re-clustering is entirely automated

Page 27: I256:  Applied Natural Language Processing

27

Page 28: I256:  Applied Natural Language Processing

28

Page 29: I256:  Applied Natural Language Processing

29

Page 30: I256:  Applied Natural Language Processing

30

Clustering Retrieval Results

Tends to place similar docs togetherSo can be used as a step in relevance rankingBut not great for showing to users

Exception: good for showing what to throw out!

Page 31: I256:  Applied Natural Language Processing

31

Another use of clustering

Use clustering to map the entire huge multidimensional document space into a huge number of small clusters.“Project” these onto a 2D graphical representationLooks neat, but doesn’t work well as an information retrieval interface.

Page 32: I256:  Applied Natural Language Processing

32

Clustering Multi-Dimensional Document Space(image from Wise et al 95)

Page 33: I256:  Applied Natural Language Processing

33

How to evaluate clusters?

In practice, it’s hard to doDifferent algorithms’ results look good and bad in different waysIt’s difficult to distinguish their outcomes

In theory, define an evaluation functionTypically choose something easy to measure (e.g., the sum of the average distance in each class)

Page 34: I256:  Applied Natural Language Processing

34Slide by Inderjit S. Dhillon

Two Types of Document Clustering

Grouping together of “similar” objectsHard Clustering -- Each object belongs to a single cluster

Soft Clustering -- Each object is probabilistically assigned to clusters

Page 35: I256:  Applied Natural Language Processing

35Slide by Vasileios Hatzivassiloglou

Soft clustering

A variation of many clustering methodsInstead of assigning each data sample to one and only one cluster, it calculates probabilities of membership for all clustersSo, a sample might belong to cluster A with probability 0.4 and to cluster B with probability 0.6

Page 36: I256:  Applied Natural Language Processing

36Slide by Vasileios Hatzivassiloglou

Application: Clustering of adjectivesCluster adjectives based on the nouns they modifyMultiple syntactic clues for modificationThe similarity measure is Kendall’s τ, a robust measure of similarityClustering is done via a hill-climbing method that minimizes the combined average dissimilarity

Predicting the semantic orientation of adjectives,

V Hatzivassiloglou, KR McKeown, EACL 1997

Page 37: I256:  Applied Natural Language Processing

37Slide by Vasileios Hatzivassiloglou

Clustering of nouns

Work by Pereira, Tishby, and LeeDissimilarity is KL divergenceAsymmetric relationship: nouns are clustered, verbs which have the nouns as objects serve as indicatorsSoft, hierarchical clustering

Page 38: I256:  Applied Natural Language Processing

38

Distributional Clustering of English Words - Pereira, Tishby and Lee, ACL 93

Page 39: I256:  Applied Natural Language Processing

39

Distributional Clustering of English Words - Pereira, Tishby and Lee, ACL 93

Page 40: I256:  Applied Natural Language Processing

40Slide by Kostas Kleisouris

Latent Semantic AnalysisMathematical/statistical technique for extracting and representing the similarity of meaning of wordsRepresents word and passage meaning as high-dimensional vectors in the semantic spaceUses Singular Value Decomposition (SVD) to simulate human learning of word and passage meaningIts success depends on:

Sufficient scale and sampling of the data it is givenChoosing the right number of dimensions to extract

Page 41: I256:  Applied Natural Language Processing

41Slide by Schone, Jurafsky, and Stenchikova

LSA CharacteristicsWhy is reducing dimensionality beneficial?

Some words with similar occurrence patterns are projected onto the same dimensionClosely mimics human judgments of meaning similarity

Page 42: I256:  Applied Natural Language Processing

42Slide by Kostas Kleisouris

Sample Applications of LSAEssay Grading

LSA is trained on a large sample of text from the same domain as the topic of the essayEach essay is compared to a large set of essays scored by experts and a subset of the most similar identified by LSAThe target essay is assigned a score consisting of a weighted combination of the scores for the comparison essays

Page 43: I256:  Applied Natural Language Processing

43Slide by Kostas Kleisouris

Sample Applications of LSA Prediction of differences in comprehensibility of texts

By using conceptual similarity measures between successive sentencesLSA has predicted comprehension test results with students

Evaluate and give advice to students as they write and revise summaries of texts they have readAssess psychiatric status

By representing the semantic content of answers to psychiatric interview questions

Page 44: I256:  Applied Natural Language Processing

44Slide by Kostas Kleisouris

Sample Applications of LSAImproving Information Retrieval

Use LSA to match users’ queries with documents that have the desired conceptual meaning Not used in practice – doesn’t help much when you have large corpora to match against, but maybe helpful for a few difficult queries and for term expansion

Page 45: I256:  Applied Natural Language Processing

45Slide by Kostas Kleisouris

LSA intuitionsImplements the idea that the meaning of a passage is the sum of the meanings of its words: meaning of word1 + meaning of word2 + … + meaning of wordn = meaning of passage

This “bag of words” function shows that a passage is considered to be an unordered set of word tokens and the meanings are additive.By creating an equation of this kind for every passage of language that a learner observes, we get a large system of linear equations.

Page 46: I256:  Applied Natural Language Processing

46Slide by Kostas Kleisouris

LSA IntuitionsHowever

Too few equations to specify the values of the variablesDifferent values for the same variable (natural since meanings are vague or multiple)

Instead of finding absolute values for the meanings, they are represented in a richer form (vectors)

Use of SVD (reduces the linear system into multidimensional vectors)

Page 47: I256:  Applied Natural Language Processing

47Slide by Jason Eisner

Latent Semantic AnalysisA trick from Information Retrieval

Each document in corpus is a length-k vector– Or each paragraph, or whatever

(0, 3, 3, 1, 0, 7, . . . 1, 0) aardvark

abacusabbot

abductabove

zygotezymurgy

abandoned

a single document

Page 48: I256:  Applied Natural Language Processing

48Slide by Jason Eisner

Latent Semantic AnalysisA trick from Information Retrieval

Each document in corpus is a length-k vectorPlot all documents in corpus

True plot in k dimensionsReduced-dimensionality plot

Page 49: I256:  Applied Natural Language Processing

49Slide by Jason Eisner

Latent Semantic AnalysisReduced plot is a perspective drawing of true plotIt projects true plot onto a few axes a best choice of axes – shows most variation in the data.

Found by linear algebra: “Singular Value Decomposition” (SVD)

True plot in k dimensionsReduced-dimensionality plot

Page 50: I256:  Applied Natural Language Processing

50Slide by Jason Eisner

Latent Semantic AnalysisSVD plot allows best possible reconstruction of true plot

(i.e., can recover 3-D coordinates with minimal distortion)Ignores variation in the axes that it didn’t pick outHope that variation’s just noise and we want to ignore it

True plot in k dimensionsReduced-dimensionality plot

word 1

word

2

word 3

theme Athem

e B

theme A

them

e B

Page 51: I256:  Applied Natural Language Processing

51Slide by Jason Eisner

Latent Semantic AnalysisSVD finds a small number of theme vectorsApproximates each doc as linear combination of themesCoordinates in reduced plot = linear coefficients

How much of theme A in this document? How much of theme B?Each theme is a collection of words that tend to appear together

True plot in k dimensionsReduced-dimensionality plot

theme Athem

e B

theme A

them

e B

Page 52: I256:  Applied Natural Language Processing

52Slide by Jason Eisner

Latent Semantic AnalysisAnother perspective (similar to neural networks):

documents1 2 3 4 5 6 7

1 2 3 4 5 6 7 8 9terms

matrix of strengths(how strong is each

term in each document?)

Each connection has aweight given by the matrix.

Page 53: I256:  Applied Natural Language Processing

53Slide by Jason Eisner

Latent Semantic AnalysisWhich documents is term 5 strong in?

docs 2, 5, 6 light up strongest.

documents1 2 3 4 5 6 7

1 2 3 4 5 6 7 8 9terms

Page 54: I256:  Applied Natural Language Processing

54Slide by Jason Eisner

This answers a query consisting of terms 5 and 8!

really just matrix multiplication:term vector (query) x strength matrix = doc

vector .

Latent Semantic AnalysisWhich documents are terms 5 and 8 strong in?

documents1 2 3 4 5 6 7

1 2 3 4 5 6 7 8 9terms

Page 55: I256:  Applied Natural Language Processing

55Slide by Jason Eisner

Latent Semantic AnalysisConversely, what terms are strong in document 5?

gives doc 5’s coordinates!

documents1 2 3 4 5 6 7

1 2 3 4 5 6 7 8 9terms

Page 56: I256:  Applied Natural Language Processing

56Slide by Jason Eisner

Latent Semantic AnalysisSVD approximates by smaller 3-layer network

Forces sparse data through a bottleneck, smoothing it

documents1 2 3 4 5 6 7

1 2 3 4 5 6 7 8 9terms

documents1 2 3 4 5 6 7

1 2 3 4 5 6 7 8 9terms

themes

Page 57: I256:  Applied Natural Language Processing

57Slide by Jason Eisner

Latent Semantic AnalysisI.e., smooth sparse data by matrix approx: M A B

A encodes camera angle, B gives each doc’s new coords

documents1 2 3 4 5 6 7

1 2 3 4 5 6 7 8 9terms

matrixM

A

B

documents1 2 3 4 5 6 7

1 2 3 4 5 6 7 8 9terms

themes

Page 58: I256:  Applied Natural Language Processing

58Slide by Kostas Kleisouris

How LSA worksTakes as input a corpus of natural languageThe corpus is parsed into meaningful passages (such as paragraphs)A matrix is formed with passages as rows and words as columns. Cells contain the number of times that a given word is used in a given passageThe cell values are transformed into a measure of the information about the passage identity the carrySVD is applied to represent the words and passages as vectors in a high dimensional semantic space

Page 59: I256:  Applied Natural Language Processing

59Slide by Schone, Jurafsky, and Stenchikova

Represent text as a matrix

d1 d2 d3 d4 d5 d6cosmonaut 1 0 1 0 0 0astronaut 0 1 0 0 0 0moon 1 1 0 0 0 0car 1 0 0 1 1 0truck 0 0 0 1 0 1

documents

word

s

A[i,j] = number of of occurrence of a word i in document j

Page 60: I256:  Applied Natural Language Processing

60Slide by Schone, Jurafsky, and Stenchikova

SVD {A}={T}{S}{D}’

An

m

n

m Min(n,m)n m

= x xTS D’

Reduce dimensionality to k and compute A1:k

A1n

m

n

kk

kk

m

= x xT1S D’

A1 is the best least square approximation of A by a matrix in rank k

Page 61: I256:  Applied Natural Language Processing

61Slide by Schone, Jurafsky, and Stenchikova

Matrix T

dim1 dim2 dim3 dim4 dim5

cosmonaut -0.44 -0.30 0.57 0.58 0.25

astronaut -0.13 -0.33 -0.59 0.00 0.73

moon -0.48 -0.51 -0.37 0.00 -0.61

car -0.70 0.35 0.15 -0.58 0.16

truck -0.26 0.65 -0.41 0.58 -0.09

dimensions

word

s

T- term matrix. Rows of T correspond toRows of original matrix A

Dim2 directly reflects the different co-occurrence patterns

Page 62: I256:  Applied Natural Language Processing

62Slide by Schone, Jurafsky, and Stenchikova

Matrix D’

d1 d2 d3 d4 d5 d6

Dimension1 -0.75 -0.28 -0.20 -0.45 -0.33 -0.12

Dimension2 -0.29 -053 -0.19 0.63 0.22 0.41

Dimension3 0.28 -0.75 0.45 -0.20 0.12 -0.33

Dimension4 -0.00 0.00 0.58 0.00 -0.58 0.58

Dimension5 -0.53 0.29 0.63 0.19 0.41 -0.22

documents

D- document matrix. Columns of D’ (rows of D) correspond to rows of original matrix A

Dim2 directly reflects the different co-occurrence patterns

dim

ensio

ns

Page 63: I256:  Applied Natural Language Processing

63Slide by Schone, Jurafsky, and Stenchikova

Reevaluating document similarities

B = S1 x D1Matrix B is a dimensionality reduction of the original matrix A Compute document correlation B’*B

d1 d2 d3 d4 d5 d6

d1 1

d2 0.78 1

d3 0.40 0.88 1

d4 0.47 -0.18 -0.62 1

d5 0.74 0.16 -0.32 0.94 1

d6 0.10 -0.54 -0.87 0.93 0.74 1

Page 64: I256:  Applied Natural Language Processing

64Slide by Schone, Jurafsky, and Stenchikova

Unfolding new documentsGiven a new document, how to determine which documents it is similar to?

A = T S D’T’A = T’ T S D’T’ A = S D’

q a new vector in the space of Aq in reduced space = T’ * q

Page 65: I256:  Applied Natural Language Processing

65

Next Time

Several takes on blog analysisSentiment classification