applying semantic analyses to content-based recommendation and document clustering

Applying Semantic Analyses to Content-based Recommendation and Document Clustering

Eric Rozell, MRC InternRensselaer Polytechnic Institute

2

Bio

• Graduate Student @ Rensselaer Polytechnic Institute• Research Assistant @ Tetherless World Constellation• Student Fellow @ Federation of Earth Science

Informatics Partners• Research Advisor: Peter Fox• Research Focus: Semantic eScience• Contact: [email protected]

mailto:[email protected]

3

Outline• Background• Semantic Analysis

– Probase Conceptualization– Explicit Semantic Analysis– Latent Dirichlet Allocation

• Recommendation Experiment– Recommendation Systems– Experiment Setup– Results

• Clustering Experiment– Problem– K-Means– Results

• Conclusions

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

4

Background

• Billions of documents on the Web• Semi-structured data from Web 2.0 (e.g., tags,

microformats)• Most knowledge remains in unstructured text• Many natural language techniques for:– Ontology extraction– Topic extraction– Named entity recognition/disambiguation

• Some techniques are better than others for various information retrieval tasks…

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

5

Probase

• Developed at Microsoft Research Asia• Probabilistic knowledge base built from Bing

index and query logs (and other sources)• Text mining patterns– Namely, Hearst patterns: “… artists such as Picaso”• Evidence for hypernym(artists, Picaso)

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

6

ProbaseSe

man

tic A

naly

sisRe

com

men

datio

nCl

uste

ring

Conc

lusio

nsBa

ckgr

ound

7

Probase• Very capable at conceptualizing groups of entities:

– “China; India; United States” yields “country”– “China; India; Brazil; Russia” yields “emerging market”

• Differentiates attributes and entities– “birthday” -> “person” as attribute– “birthday” -> “occasion” as entity

• Applications– Clustering Tweets from Concepts [Song et al., 2011]– Understanding Web Tables– Query Expansion (Topic Search)

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

8

Research Questions• What’s the best way of extracting concepts from text?

– Compare techniques for semantic analysis• How are extracted concepts useful?

– Generate data about where semantic analysis techniques are applicable

• Are user ratings affected by the concepts in media items such as movies?– Test semantic analysis techniques in recommender systems

• How useful is Web-scale domain knowledge in narrower domains for information retrieval?– Identify need for domain specific knowledge

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

9

Semantic Analysis• Generating meaning (concepts) from text• Specifically, get prevalent hypernyms

– E.g., “… Apple, IBM, and Microsoft …”– “technology companies”

• Semantic analysis using external knowledge– Probase Conceptualization– Explicit Semantic Analysis– WordNet Synsets

• Semantic analysis using latent features– Latent Dirichlet Allocation– Latent Semantic Analysis

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

10

Probase ConceptualizationSe

man

tic A

naly

sisRe

com

men

datio

nCl

uste

ring

Conc

lusio

nsBa

ckgr

ound

t1

t2

t3

t4...

This is some plain text.

Probase

c1

c2

c3

c4...

Naïve Bayes / Summation

Document Corpus

For each document…

c1 c2 c3 c4

. . .

c1 c2 c3 c4

. . .

c1 c2 c3 c4

. . .

c1 c2 c3 c4

. . .

. . .

Inverse Document

Frequency / Filtering

Document Concepts

11

Probase Conceptualization

• “Cowboy doll Woody (Tom Hanks) is co ordinating a reconnaissance mission to find out what presents his owner Andy is getting for his birthday party days before they move to a new house. Unfortunately for Woody, Andy receives a new spaceman toy, Buzz Lightyear (Tim Allen) who impresses the other toys and Andy, who starts to like Buzz more than Woody. Buzz thinks that he is an actual space ranger, not a toy, and thinks that Woody is interfering with his "mission" to return to his home planet…”

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

Text Source: Internet Movie Database (IMDb)

http://www.imdb.com/media/rm1024891136/tt0114709

12

Sample Features for “Toy Story” (Probase)

• dvd encryptions 0.050 “RC”• duty free item 0.044 “toys”• generic word 0.043 “they, travel, it,…”• satellite mission0.032 “reconnaissance mission”• creator-owned work 0.020 “Woody”• amazing song 0.013 “fury”• doubtful word 0.013 “overcome”• ill-fated tool 0.013 “Buzz”• lovable ``toy story'' character 0.011 “Buzz Lightyear, Woody,…”• pleased star 0.010 “Woody”• trail builder 0.010 “Woody”

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

13

Explicit Semantic Analysis

Image Source: Gabrilovich et al., 2007

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

14

Sample Features for “Toy Story” (ESA)

• #REDIRECT [[Buzz!]] 0.034• #REDIRECT [[The Buzz]] 0.028• #REDIRECT [[Buzz (comics)]] 0.027• #REDIRECT [[Buzz cut]] 0.027• #REDIRECT [[Buzz (DC Thomson)]] 0.024• #REDIRECT [[Buzz Out Loud]] 0.024• #REDIRECT [[The Daily Buzz]] 0.023• #REDIRECT [[Buzz Aldrin]] 0.022• #REDIRECT [[Buzz cut]] 0.022• #REDIRECT [[Buzzing Tree Frog]] 0.022

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

15

Latent Dirichlet Allocation

• Blei et al., 2003• Unsupervised Learning Method• “Generates” documents from Dirichlet

distributions over words and topics• Topic distributions over documents can be

inferred from corpus

Image Source: Wikipedia

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

http://upload.wikimedia.org/wikipedia/commons/d/d3/Latent_Dirichlet_allocation.svg

16

Recommendation Systems

• Collaborative Filtering– “Customers who purchased X also purchased Y.”

• Content-based– “Because you enjoyed ‘GoldenEye’, you may want

to watch ‘Mission: Impossible’.” • Hybrid– Most modern systems take a hybrid approach.

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

17

Content-based Recommendation

• In GoldenEye/Mission: Impossible example…– Structured item content• Genre – Action/Adventure/Thriller• Tags – Action, Espionage, Adventure

– Unstructured item content• Plot synopses – “helicopter, agent, inflitrate, CIA, …”• Concepts? – “aircraft, intelligence agency, …”

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

18

Recommendation Systems

Collaborative Filtering

Approaches

Structured Content-based

Approaches

Unstructured Content-based

Approaches

Test semantic analysis approaches

here.

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

19

Experiment

Feature Generation

Movie Synopses

from IMDb

Matchbox Recommendation

Platform

Mean Absolute

Error (MAE)

Movie Ratings from MovieLens

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

20

Matchbox

Source: Matchbox API Documentation

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gRe

late

d W

ork

Conc

lusio

ns

21

Experimental Data• Data: MovieLens Dataset [HetRec ’11]

– 855,598 ratings– 10,197 movies– 2,113 users

• Movie synopses from IMDb (http://www.imdb.com)– Collected synopses for 2,633 movies– With 435,043 ratings– From 2,113 users

• Ratings data:– Scored by half points from 0.5 to 5

• Choose different numbers of movies (200; 1,000; all)• Train on 90% of ratings, test on remaining 10%

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

http://www.imdb.com/

22

Experimental Data

• Controls– Baseline 1: Only features are user IDs and movie IDs– Baseline 2: User IDs, Movie IDs, Movie Genre– Baseline 3: User IDs, Movie IDs, Movie Tags

• Feature Sets– Term Frequency – Inverse Document Frequency– Latent Dirichlet Allocation– Explicit Semantic Analysis– Probase Conceptualization

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

23

Experimental Setup

• 4 Scenarios: (training: white, testing: black)U

sers

Movies

Use

rsMovies

Use

rs

Movies

Use

rs

Movies

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

24

Results

1 2 3 4 5 6 7 8 9 100.56

0.565

0.57

0.575

0.58

0.585

0.59

0.595

Baseline #1Baseline #2Baseline #3TFIDF NormalizedProbase SumESA

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

25

Results# of Movies All (2,633) 1,000 200

Baseline 1 0.672293 0.71654 0.802044Baseline 2 0.641556 0.683297 0.752745Baseline 3 0.655613 0.68994 0.764369TF-IDF 0.674764 0.706914 0.815245Probase 0.670694 0.715456 0.797196ESA 0.670182 0.714967 0.796787LDA (unfinished) 0.711307 0.790362

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

• testing set contains users and movies not seen in training set• recommendations based on item features alone• small amounts of structured data (e.g., genre) are the most influential

in this scenario

26



Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

• testing set contains users not seen in training set.• lots of collaborative data available (explains comparable performance

in all feature sets)• given extensive collaborative data, item features are marginally

beneficial (in Matchbox)

27



Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

• testing set contains movies not seen in the training set• recommendations based on item features and extensive information

on users “rating model”• small amounts of structured data (e.g., genre) are the most influential

in this scenario (even for long-term users)

28



Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

• testing set contains users and movies seen in the training set• recommendations again are primarily collaborative• given a large corpus of rating data for users and items, item features

are only marginally beneficial

29

Results

Experiment

Baseline 1 0.672293 0.580087 0.672843 0.560163Baseline 2 0.641556 0.576183 0.639683 0.556011Baseline 3 0.655613 0.575398 0.652071 0.550761TF-IDF 0.674764 0.579906 0.672362 0.551909Probase 0.670694 0.578889 0.670159 0.556414ESA 0.670182 0.579798 0.670451 0.556517

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

30

Document Clustering

• Divide a corpus into a specified number of groups

• Useful for information retrieval– Automatically generated topics for search results– Recommendations for similar items/pages– Visualization of search space

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

31

K-Means

1. Start with initial clusters2. Compute means of clusters3. Compare cosine distance of each item to means4. Assign to clusters to based on min. distance5. Repeat from step 2 until convergence

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

32

Experimental Setup

1. Generate features for datasets2. Randomly assign initial clusters3. Run K-Means4. Compute purity and ARI5. Repeat steps 2-4 20 times for mean and

standard deviation

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

33

Experimental Data

• 20 Newsgroups (mini)• 2,000 messages from Usenet

newsgroups• 100 messages per topic• Filter messages for body text• Source: http://

kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html

From sci.electronics …

“A couple of years ago I put together a Tesla circuit which was published in an electronics magazine and could have been the circuit which is referred to here. This one used a flyback transformer from a tv onto which you wound your own primary windings...”

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

http://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html



34

ResultsSe

man

tic A

naly

sisRe

com

men

datio

nCl

uste

ring

Conc

lusio

nsBa

ckgr

ound

Feature Set Purity ARI Scores

TF-IDF 0.379 ± 0.027 0.199 ± 0.023

Probase Only 0.265 ± 0.013 0.101 ± 0.010

Probase + TF-IDF 0.414 ± 0.034 0.241 ± 0.029

ESA Only 0.204 ± 0.010 0.040 ± 0.004

ESA + TF-IDF 0.389 ± 0.036 0.211 ± 0.032

LDA Only N/A N/A

LDA + TF-IDF N/A N/A

35

Results Comparison

• Song et al. Tweets Clustering– Experiment #2: Subtle Cluster Distinctions– Used Tweets about NA, Asia, Africa and Europe– Comparable performance for ESA and Probase

Conceptualization• Hotho et al. WordNet Clustering– Used Reuters dataset and Bisecting K-Means– Found best results for combined TF-IDF and feature sets– Overall improvement from WordNet features was

comparable to Probase features (O[+10%])

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

36

Conclusions

• Semantic analysis features are marginally beneficial in recommendation

• Structured data from limited vocabulary work best for recommending “new items”

• Explicit and latent semantic analysis are comparable in recommendation

• Knowledge bases generated at Web-scale may be too noisy for narrow domain tasks

• Confirmed the efficacy of semantic analysis in document clustering tasks

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

37

Future Directions

• Noise Reduction– Tune the recommender platform for “concepts”– Further explore parameter space for feature

generators– Hybrid Conceptualization / Named Entity

Disambiguation?• Domain-specific knowledge sources– Comparison of Web-scale and domain-specific

resources as external knowledge (e.g., [Aljaber et al., 2010])

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

38

Further Reading• Short Text Conceptualization Using a Probabilistic Knowledge

Base [Song et al., 2011]• Exploiting Wikipedia as External Knowledge for Document

Clustering [Hu et al., 2009]• Hybrid Recommender Using WordNet “Bag of Synsets”

[Degemmis et al., 2007]• Hybrid Recommender Using LDA [Jin et al., 2005]• Feature Generation for Text Categorization Using World

Knowledge [Gabrilovich and Markovitch, 2005]• WordNet Improves Text Document Clustering [Hotho et al.,

2003]

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

39

Acknowledgements

• David Stern, Ulrich Paquet, Jurgen Van Gael• Haixun Wang, Yangqiu Song, Zhongyuan Wang• Special thanks to Evelyne Viegas!• Microsoft Research Connections

40

References• [Gabrilovich et al., 2007] Evgeniy Gabrilovich and Shaul Markovitch. 2007. Computing semantic

relatedness using Wikipedia-based explicit semantic analysis. In Proceedings of the 20th international joint conference on Artifical intelligence (IJCAI'07), Rajeev Sangal, Harish Mehta, and R. K. Bagga (Eds.). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1606-1611.

• [Blei et al., 2003] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. J. Mach. Learn. Res. 3 (March 2003), 993-1022.

• [Song et al., 2011] Yangqiu Song, Haixun Wang, Zhongyuan Wang, Hongsong Li, and Weizhu Chen, Short Text Conceptualization using a Probabilistic Knowledgebase, in IJCAI, 2011.

• [Stern et al., 2009] David H. Stern, Ralf Herbrich, and Thore Graepel. 2009. Matchbox: large scale online bayesian recommendations. In Proceedings of the 18th international conference on World wide web (WWW '09). ACM, New York, NY, USA, 111-120.

• [HetRec ’11] Ivan Cantador, Peter Brusilovsky, and Tsvi Kuflik. 2011. 2nd Workshop on Information Heterogeneity and Fusion in Recommender Systems (HetRec 2011). In Proceedings of the 5th ACM conference on Recommender systems. ACM, New York, NY, USA.

• [Degemmis et al., 2007] Marco Degemmis, Pasquale Lops, and Giovanni Semeraro. A content-collaborative recommender that exploits WordNet-based user profiles for neighborhood formation. User Modeling and User-Adapted Interaction. Vol. 17, Issue 3, 217-255.

41

References• [Jin et al., 2005] Xin Jin, Yanzan Zhou, and Bamshad Mobasher. 2005. A maximum entropy web

recommendation system: combining collaborative and content features. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining (KDD '05). ACM, New York, NY, USA, 612-617.

• [Hu et al., 2009] Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou. 2009. Exploiting Wikipedia as external knowledge for document clustering. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '09). ACM, New York, NY, USA, 389-396.

• [Gabrilovich and Markovitch, 2005] Evgeniy Gabrilovich and Shaul Markovitch. 2005. Feature generation for text categorization using world knowledge. In Proceedings of the 20th international joint conference on Artifical intelligence (IJCAI'05), 1606-1611.

• [Hotho et al., 2003] Andreas Hotho, Steffen Staab, and Gerd Stumme. 2003. Wordnet improves text document clustering. In Proceedings of the SIGIR 2003 Semantic Web Workshop, 541-544.

• [Aljaber et al., 2010] Bader Aljaber, Nicola Stokes, James Bailey, and Jian Pei. 2010. Document clustering of scientific texts using citation contexts. Information Retrieval. Vol. 13, Issue 2, 101-131.

42

Questions?

• Thanks for attending

43

Appendix

A. Matchbox DetailsB. Implementation DetailsC. Probase Conceptualization DetailsD. Explicit Semantic Analysis DetailsE. Learnings from Probase

44

(Appendix A) Matchbox

• [Stern et al., 2009]• MSR Cambridge recommendation platform• Implements a hybrid recommender using

Infer.NET– Uses combination of expectation propagation (EP)

and variational message passing• Reduces user, item, and context features to

low dimensional trait space

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gRe

late

d W

ork

Conc

lusio

ns

45

(Appendix A) Matchbox Setup

• Matchbox settings– Use 20 trait dimension (determined

experimentally)– 10 iterations of EP algorithm– Trained on approx. 90% of ratings– Updated model with 75% of ratings per user (in

remaining 10%)– MAE computed for remaining 25% per user

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gRe

late

d W

ork

Conc

lusio

ns

46

(Appendix B) Implementation

• ESA: https://github.com/faraday/wikiprep-esa• LDA: Infer.NET• Probase: Probase Package v. 0.18• TF-IDF: http://www.codeproject.com/KB/cs/tfidf.aspx• Matchbox: http://codebox/matchbox

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gRe

late

d W

ork

Conc

lusio

ns

https://github.com/faraday/wikiprep-esa

https://github.com/faraday/wikiprep-esa

http://www.codeproject.com/KB/cs/tfidf.aspx

http://www.codeproject.com/KB/cs/tfidf.aspx

http://codebox/matchbox

http://codebox/matchbox

47

(Appendix C) Probase Conceptualization

1. Identify all Probase terms in text2. Use Noisy-or Model to combine:– Concepts from tl as attribute (zl = 1)– Concepts from tl as entity/concept (zl = 0)

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

48


3. Weight terms based on occurrencea. Naïve Bayes (similar to Song et al., 2010)

• Compute P(c|t) for individual terms and use Naïve Bayes model to derive concepts

• Penalizes false positives, does not reward true positives• Generates very small probabilities for large numbers of terms

b. Weighted Sum (similar to Gabrilovich et al., 2007)• Compute P(c|t) for individual terms and compute sum over

document for each concept• Rewards true positives, does not penalize false positives

(accurate concepts and inaccurate concepts, resp.)

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

49


4. Penalize frequent concepts– Stop word (concepts) are domain-independent– For films, many domain-specific stop concepts

• E.g., “movie”, “character”, “actor”, etc.– Inverse Document Frequency on concepts penalizes

those that are too frequent– Also rewards those that are too infrequent (in only

one document)– Solution: Filter for minimum and maximum

occurrence

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

50


• Using Summation (similar to Wikipedia ESA)

• Using Naïve Bayes from Song et al. approach– P(|T) P(T|)P()/P(T) / P()L - 1

• Inverse Document Frequency for concepts– IDF(ck) = log ( # of documents / document frequency of ck )– Minimum occurrence = 2– Maximum occurrence = 0.5 * # of documents

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gRe

late

d W

ork

Conc

lusio

ns

51

(Appendix D) Explicit Semantic Analysis

• Gabrilovich et al., 2007• Builds inverted index of Wikipedia content• Input text converted to weight vector of

concepts based on TF-IDF

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gRe

late

d W

ork

Conc

lusio

ns

52

(Appendix E) Learnings from Probase

• Conceptualization works wonders for small numbers of entities

• Would be extremely useful in a large-scale QA environment with many semantic analysis and ML algorithms (e.g., Watson)

• A noisy source of knowledge is best suited to noise-tolerant IR applications

• Still being developed and improving!– Working on recognizing verbs

applying semantic analyses to content-based recommendation and document clustering

Documents

semantic analyses

semantic esciencecontact

semantic analysishow

concepts useful

concepts song

webscale domain knowledge

contentbased recommendation

microformatsmost knowledge