latent semantic analysis probabilistic topic models & associative memory

31
Latent Semantic Analysis Probabilistic Topic Models & Associative Memory

Post on 21-Dec-2015

225 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Latent Semantic Analysis Probabilistic Topic Models & Associative Memory

Latent Semantic Analysis

Probabilistic Topic Models

& Associative Memory

Page 2: Latent Semantic Analysis Probabilistic Topic Models & Associative Memory

The Psychological Problem

How do we learn semantic structure? Covariation between words and the contexts they

appear in (e.g. LSA)

How do we represent semantic structure? Semantic Spaces (e.g. LSA) Probabilistic Topics

Page 3: Latent Semantic Analysis Probabilistic Topic Models & Associative Memory

Latent Semantic Analysis(Landauer & Dumais, 1997)

word-document counts

high dimensional space

SVD

RIVERSTREAM

MONEY

BANK

Each word is a single point in semantic space Similarity measured by cosine of angle between word

vectors

Page 4: Latent Semantic Analysis Probabilistic Topic Models & Associative Memory

Critical Assumptions of Semantic Spaces (e.g. LSA)

Psychological distance should obey three axioms

Minimality

Symmetry

Triangle inequality

0),(),(),( bbdaadbad

),(),( abdbad

),(),(),( cadcbdbad

Page 5: Latent Semantic Analysis Probabilistic Topic Models & Associative Memory

For conceptual relations, violations of distance axioms often found

Similarities can often be asymmetric

“North-Korea” is more similar to “China” than vice versa

“Pomegranate” is more similar to “Apple” than vice versa

Violations of triangle inequality:

AB

BC

AC

Euclidian distance: AC AB + BC

Page 6: Latent Semantic Analysis Probabilistic Topic Models & Associative Memory

Triangle Inequality in Semantic Spaces might not always hold

w1

PLAY SOCCER

THEATER

Cosine similarity: cos(w1,w3) ≥ cos(w1,w2)cos(w2,w3) – sin(w1,w2)sin(w2,w3)

w2 w3

Euclidian distance:AC AB + BC

Page 7: Latent Semantic Analysis Probabilistic Topic Models & Associative Memory

Nearest neighbor problem (Tversky & Hutchinson (1986)

• In similarity data, “Fruit” is nearest neighbor in 18 out of 20 fruit words

• In 2D solution, “Fruit” can be nearest neighbor of at most 5 items

• High-dimensional solutions might solve this but these are less appealing

Page 8: Latent Semantic Analysis Probabilistic Topic Models & Associative Memory

Probabilistic Topic Models

A probabilistic version of LSA: no spatial constraints.

Originated in domain of statistics & machine learning (e.g., Hoffman, 2001; Blei, Ng, Jordan, 2003)

Extracts topics from large collections of text

Topics are interpretable unlike the arbitrary dimensions of LSA

Page 9: Latent Semantic Analysis Probabilistic Topic Models & Associative Memory

DATACorpus of text:

Word counts for each document

Topic Model

Find parameters that “reconstruct” data

Model is Generative

Page 10: Latent Semantic Analysis Probabilistic Topic Models & Associative Memory

Probabilistic Topic Models

Each document is a probability distribution over topics (distribution over topics = gist)

Each topic is a probability distribution over words

Page 11: Latent Semantic Analysis Probabilistic Topic Models & Associative Memory

Document generation as a probabilistic process

TOPICS MIXTURETOPICS MIXTURE

TOPIC TOPIC TOPICTOPIC

WORDWORD WORDWORD

......

......

1. for each document, choosea mixture of topics

2. For every word slot, sample a topic [1..T] from the mixture

3. sample a word from the topic

Page 12: Latent Semantic Analysis Probabilistic Topic Models & Associative Memory

loan

TOPIC 1

money

loan

bank

moneyb

an

k

river

TOPIC 2

river

river

stream

bank

bank

stream

bank

loan

DOCUMENT 2: river2 stream2 bank2 stream2 bank2 money1 loan1

river2 stream2 loan1 bank2 river2 bank2 bank1 stream2 river2 loan1

bank2 stream2 bank2 money1 loan1 river2 stream2 bank2 stream2 bank2 money1 river2 stream2 loan1 bank2 river2 bank2 money1 bank1 stream2 river2 bank2 stream2 bank2 money1

DOCUMENT 1: money1 bank1 bank1 loan1 river2 stream2 bank1

money1 river2 bank1 money1 bank1 loan1 money1 stream2 bank1

money1 bank1 bank1 loan1 river2 stream2 bank1 money1 river2 bank1

money1 bank1 loan1 bank1 money1 stream2

.3

.8

.2

Example

Mixture components

Mixture weights

Bayesian approach: use priors Mixture weights ~ Dirichlet( ) Mixture components ~ Dirichlet( )

.7

Page 13: Latent Semantic Analysis Probabilistic Topic Models & Associative Memory

DOCUMENT 2: river? stream? bank? stream? bank? money? loan?

river? stream? loan? bank? river? bank? bank? stream? river? loan?

bank? stream? bank? money? loan? river? stream? bank? stream? bank? money? river? stream? loan? bank? river? bank? money? bank? stream? river? bank? stream? bank? money?

DOCUMENT 1: money? bank? bank? loan? river? stream? bank?

money? river? bank? money? bank? loan? money? stream? bank?

money? bank? bank? loan? river? stream? bank? money? river? bank?

money? bank? loan? bank? money? stream?

Inverting (“fitting”) the model

Mixture components

Mixture weights

TOPIC 1

TOPIC 2

?

?

?

Page 14: Latent Semantic Analysis Probabilistic Topic Models & Associative Memory

Application to corpus data

TASA corpus: text from first grade to college representative sample of text

26,000+ word types (stop words removed) 37,000+ documents 6,000,000+ word tokens

Page 15: Latent Semantic Analysis Probabilistic Topic Models & Associative Memory

Example: topics from an educational corpus (TASA)

PRINTINGPAPERPRINT

PRINTEDTYPE

PROCESSINK

PRESSIMAGE

PRINTERPRINTS

PRINTERSCOPY

COPIESFORM

OFFSETGRAPHICSURFACE

PRODUCEDCHARACTERS

PLAYPLAYSSTAGE

AUDIENCETHEATERACTORSDRAMA

SHAKESPEAREACTOR

THEATREPLAYWRIGHT

PERFORMANCEDRAMATICCOSTUMES

COMEDYTRAGEDY

CHARACTERSSCENESOPERA

PERFORMED

TEAMGAME

BASKETBALLPLAYERSPLAYER

PLAYPLAYINGSOCCERPLAYED

BALLTEAMSBASKET

FOOTBALLSCORECOURTGAMES

TRYCOACH

GYMSHOT

JUDGETRIAL

COURTCASEJURY

ACCUSEDGUILTY

DEFENDANTJUSTICE

EVIDENCEWITNESSES

CRIMELAWYERWITNESS

ATTORNEYHEARING

INNOCENTDEFENSECHARGE

CRIMINAL

HYPOTHESISEXPERIMENTSCIENTIFIC

OBSERVATIONSSCIENTISTS

EXPERIMENTSSCIENTIST

EXPERIMENTALTEST

METHODHYPOTHESES

TESTEDEVIDENCE

BASEDOBSERVATION

SCIENCEFACTSDATA

RESULTSEXPLANATION

STUDYTEST

STUDYINGHOMEWORK

NEEDCLASSMATHTRY

TEACHERWRITEPLAN

ARITHMETICASSIGNMENT

PLACESTUDIED

CAREFULLYDECIDE

IMPORTANTNOTEBOOK

REVIEW

• 37K docs, 26K words• 1700 topics, e.g.:

Page 16: Latent Semantic Analysis Probabilistic Topic Models & Associative Memory

Polysemy

PRINTINGPAPERPRINT

PRINTEDTYPE

PROCESSINK

PRESSIMAGE

PRINTERPRINTS

PRINTERSCOPY

COPIESFORM

OFFSETGRAPHICSURFACE

PRODUCEDCHARACTERS

PLAYPLAYSSTAGE

AUDIENCETHEATERACTORSDRAMA

SHAKESPEAREACTOR

THEATREPLAYWRIGHT

PERFORMANCEDRAMATICCOSTUMES

COMEDYTRAGEDY

CHARACTERSSCENESOPERA

PERFORMED

TEAMGAME

BASKETBALLPLAYERSPLAYERPLAY

PLAYINGSOCCERPLAYED

BALLTEAMSBASKET

FOOTBALLSCORECOURTGAMES

TRYCOACH

GYMSHOT

JUDGETRIAL

COURTCASEJURY

ACCUSEDGUILTY

DEFENDANTJUSTICE

EVIDENCEWITNESSES

CRIMELAWYERWITNESS

ATTORNEYHEARING

INNOCENTDEFENSECHARGE

CRIMINAL

HYPOTHESISEXPERIMENTSCIENTIFIC

OBSERVATIONSSCIENTISTS

EXPERIMENTSSCIENTIST

EXPERIMENTALTEST

METHODHYPOTHESES

TESTEDEVIDENCE

BASEDOBSERVATION

SCIENCEFACTSDATA

RESULTSEXPLANATION

STUDYTEST

STUDYINGHOMEWORK

NEEDCLASSMATHTRY

TEACHERWRITEPLAN

ARITHMETICASSIGNMENT

PLACESTUDIED

CAREFULLYDECIDE

IMPORTANTNOTEBOOK

REVIEW

Page 17: Latent Semantic Analysis Probabilistic Topic Models & Associative Memory

Three documents with the word “play”(numbers & colors topic assignments)

A Play082 is written082 to be performed082 on a stage082 before a live093 audience082 or before motion270 picture004 or television004 cameras004 ( for later054 viewing004 by large202 audiences082). A Play082 is written082 because playwrights082 have something ... He was listening077 to music077 coming009 from a passing043 riverboat. The music077 had already captured006 his heart157 as well as his ear119. It was jazz077. Bix beiderbecke had already had music077 lessons077. He wanted268 to play077 the cornet. And he wanted268 to play077 jazz077... J im296 plays166 the game166. J im296 likes081 the game166 for one. The game166 book254 helps081 jim296. Don180 comes040 into the house038. Don180 and jim296 read254 the game166 book254. The boys020 see a game166 for two. The two boys020 play166 the game166....

Page 18: Latent Semantic Analysis Probabilistic Topic Models & Associative Memory

No Problem of Triangle Inequality

SOCCER

MAGNETICFIELD

TOPIC 1 TOPIC 2

Topic structure easily explains violations of triangle inequality

Page 19: Latent Semantic Analysis Probabilistic Topic Models & Associative Memory

Applications

Page 20: Latent Semantic Analysis Probabilistic Topic Models & Associative Memory

Enron email data 500,000 emails500,000 emails

5000 authors5000 authors

1999-20021999-2002

Page 21: Latent Semantic Analysis Probabilistic Topic Models & Associative Memory

Enron topics

2000 2001 2002 2003

PERSON1

PERSON2

TEXANSWIN

FOOTBALLFANTASY

SPORTSLINEPLAYTEAMGAME

SPORTSGAMES

GODLIFEMAN

PEOPLECHRISTFAITHLORDJESUS

SPIRITUALVISIT

ENVIRONMENTALAIR

MTBEEMISSIONS

CLEANEPA

PENDINGSAFETYWATER

GASOLINE

FERCMARKET

ISOCOMMISSION

ORDERFILING

COMMENTSPRICE

CALIFORNIAFILED

POWERCALIFORNIAELECTRICITY

UTILITIESPRICESMARKET

PRICEUTILITY

CUSTOMERSELECTRIC

STATEPLAN

CALIFORNIADAVISRATE

BANKRUPTCYSOCALPOWERBONDSMOU

TIMELINEMay 22, 2000

Start of California

energy crisis

Page 22: Latent Semantic Analysis Probabilistic Topic Models & Associative Memory

Applying Model to Psychological Data

Page 23: Latent Semantic Analysis Probabilistic Topic Models & Associative Memory

BASEBALL

BAT

BALL

GAME

PLAY

STAGE

Network of Word Associations

THEATER

(Association norms by Doug Nelson et al. 1998)

Page 24: Latent Semantic Analysis Probabilistic Topic Models & Associative Memory

BASEBALL

BAT

BALL

GAME

PLAY

STAGE THEATER

Explaining structure with topics

topic 1

topic 2

Page 25: Latent Semantic Analysis Probabilistic Topic Models & Associative Memory

Modeling Word Association

Word association modeled as prediction

Given that a single word is observed, what future other words might occur?

Under a single topic assumption:

z

nn zPzwPwP w||w| 11

Response Cue

Page 26: Latent Semantic Analysis Probabilistic Topic Models & Associative Memory

Observed associates for the cue “play”

Word P( word ) Word P( word ) Word Cosine FUN .141 BALL .041 KICKBALL .558 GAME 42 BALL .134 GAME .039 VOLLEYBALL .519 BALL 33 GAME .074 CHILDREN .019 GAMES .492 CHILDREN 30 WORK .067 ROLE .014 COSTUMES .478 SCHOOL 27

GROUND .060 GAMES .014 DRAMA .469 ROLE 25 MATE .027 MUSIC .009 ROLE .465 WANT 24 CHILD .020 BASEBALL .009 PLAYWRIGHT .464 GAMES 23 ENJOY .020 HIT .008 FUN .454 MOTHER 23 WIN .020 FUN .008 ACTOR .448 THINGS 21

ACTOR .013 TEAM .008 REHEARSALS .445 MUSIC 21 FIGHT .013 IMPORTANT .006 GAME .445 HELP 20 HORSE .013 BAT .006 ACTORS .439 FUN 19

KID .013 RUN .006 CHECKERS .431 READ 18 MUSIC .013 STAGE .005 MOLIERE .429 DON 18

HUMANS

Page 27: Latent Semantic Analysis Probabilistic Topic Models & Associative Memory

Model predictions

Word P( word ) Word P( word ) Word Cosine FUN .141 BALL .041 KICKBALL .558 GAME 42 BALL .134 GAME .039 VOLLEYBALL .519 BALL 33 GAME .074 CHILDREN .019 GAMES .492 CHILDREN 30 WORK .067 ROLE .014 COSTUMES .478 SCHOOL 27

GROUND .060 GAMES .014 DRAMA .469 ROLE 25 MATE .027 MUSIC .009 ROLE .465 WANT 24 CHILD .020 BASEBALL .009 PLAYWRIGHT .464 GAMES 23 ENJOY .020 HIT .008 FUN .454 MOTHER 23 WIN .020 FUN .008 ACTOR .448 THINGS 21

ACTOR .013 TEAM .008 REHEARSALS .445 MUSIC 21 FIGHT .013 IMPORTANT .006 GAME .445 HELP 20 HORSE .013 BAT .006 ACTORS .439 FUN 19

KID .013 RUN .006 CHECKERS .431 READ 18 MUSIC .013 STAGE .005 MOLIERE .429 DON 18

HUMANS TOPICS (T=500)

RANK 9

Page 28: Latent Semantic Analysis Probabilistic Topic Models & Associative Memory

Median rank of first associate

10

5

10

15

20

25

30

35

40Best LSA cosineBest LSA inner product1700 topics1500 topics1300 topics1100 topics900 topics700 topics500 topics300 topics

Med

ian

R

an

k

Page 29: Latent Semantic Analysis Probabilistic Topic Models & Associative Memory

Recall: example study List

STUDY: Bed, Rest, Awake, Tired, Dream, Wake, Snooze, Blanket, Doze, Slumber, Snore, Nap, Peace, Yawn, Drowsy

FALSE RECALL: “Sleep” 61%

Page 30: Latent Semantic Analysis Probabilistic Topic Models & Associative Memory

Recall as a reconstructive process

Reconstruct study list based on the stored “gist”

The gist can be represented by a distribution over topics

Under a single topic assumption:

z

nn zPzwPwP w||w| 11

Retrieved wordStudy list

Page 31: Latent Semantic Analysis Probabilistic Topic Models & Associative Memory

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

BEDRESTTIRED

AWAKEWAKE

NAPDREAM

YAWNDROWSYBLANKETSNORE

SLUMBERPEACEDOZE

SLEEPNIGHT

ASLEEPMORNINGHOURS

SLEEPYEYESAWAKENED

Predictions for the “Sleep” list

STUDYLIST

EXTRALIST

(top 8)

w|1nwP