applying semantic analyses to content-based recommendation and document clustering

52
Applying Semantic Analyses to Content-based Recommendation and Document Clustering Eric Rozell, MRC Intern Rensselaer Polytechnic Institute

Upload: kamal

Post on 25-Feb-2016

40 views

Category:

Documents


0 download

DESCRIPTION

Applying Semantic Analyses to Content-based Recommendation and Document Clustering. Eric Rozell, MRC Intern Rensselaer Polytechnic Institute. Bio. Graduate Student @ Rensselaer Polytechnic Institute Research Assistant @ Tetherless World Constellation - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Applying Semantic Analyses to Content-based Recommendation and Document Clustering

Applying Semantic Analyses to Content-based Recommendation and Document Clustering

Eric Rozell, MRC InternRensselaer Polytechnic Institute

Page 2: Applying Semantic Analyses to Content-based Recommendation and Document Clustering

2

Bio

• Graduate Student @ Rensselaer Polytechnic Institute• Research Assistant @ Tetherless World Constellation• Student Fellow @ Federation of Earth Science

Informatics Partners• Research Advisor: Peter Fox• Research Focus: Semantic eScience• Contact: [email protected]

Page 3: Applying Semantic Analyses to Content-based Recommendation and Document Clustering

3

Outline• Background• Semantic Analysis

– Probase Conceptualization– Explicit Semantic Analysis– Latent Dirichlet Allocation

• Recommendation Experiment– Recommendation Systems– Experiment Setup– Results

• Clustering Experiment– Problem– K-Means– Results

• Conclusions

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

Page 4: Applying Semantic Analyses to Content-based Recommendation and Document Clustering

4

Background

• Billions of documents on the Web• Semi-structured data from Web 2.0 (e.g., tags,

microformats)• Most knowledge remains in unstructured text• Many natural language techniques for:– Ontology extraction– Topic extraction– Named entity recognition/disambiguation

• Some techniques are better than others for various information retrieval tasks…

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

Page 5: Applying Semantic Analyses to Content-based Recommendation and Document Clustering

5

Probase

• Developed at Microsoft Research Asia• Probabilistic knowledge base built from Bing

index and query logs (and other sources)• Text mining patterns– Namely, Hearst patterns: “… artists such as Picaso”• Evidence for hypernym(artists, Picaso)

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

Page 6: Applying Semantic Analyses to Content-based Recommendation and Document Clustering

6

ProbaseSe

man

tic A

naly

sisRe

com

men

datio

nCl

uste

ring

Conc

lusio

nsBa

ckgr

ound

Page 7: Applying Semantic Analyses to Content-based Recommendation and Document Clustering

7

Probase• Very capable at conceptualizing groups of entities:

– “China; India; United States” yields “country”– “China; India; Brazil; Russia” yields “emerging market”

• Differentiates attributes and entities– “birthday” -> “person” as attribute– “birthday” -> “occasion” as entity

• Applications– Clustering Tweets from Concepts [Song et al., 2011]– Understanding Web Tables– Query Expansion (Topic Search)

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

Page 8: Applying Semantic Analyses to Content-based Recommendation and Document Clustering

8

Research Questions• What’s the best way of extracting concepts from text?

– Compare techniques for semantic analysis• How are extracted concepts useful?

– Generate data about where semantic analysis techniques are applicable

• Are user ratings affected by the concepts in media items such as movies?– Test semantic analysis techniques in recommender systems

• How useful is Web-scale domain knowledge in narrower domains for information retrieval?– Identify need for domain specific knowledge

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

Page 9: Applying Semantic Analyses to Content-based Recommendation and Document Clustering

9

Semantic Analysis• Generating meaning (concepts) from text• Specifically, get prevalent hypernyms

– E.g., “… Apple, IBM, and Microsoft …”– “technology companies”

• Semantic analysis using external knowledge– Probase Conceptualization– Explicit Semantic Analysis– WordNet Synsets

• Semantic analysis using latent features– Latent Dirichlet Allocation– Latent Semantic Analysis

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

Page 10: Applying Semantic Analyses to Content-based Recommendation and Document Clustering

10

Probase ConceptualizationSe

man

tic A

naly

sisRe

com

men

datio

nCl

uste

ring

Conc

lusio

nsBa

ckgr

ound

t1

t2

t3

t4...

This is some plain text.

Probase

c1

c2

c3

c4...

Naïve Bayes / Summation

Document Corpus

For each document…

c1 c2 c3 c4

. . .

c1 c2 c3 c4

. . .

c1 c2 c3 c4

. . .

c1 c2 c3 c4

. . .

. . .

Inverse Document

Frequency / Filtering

Document Concepts

Page 11: Applying Semantic Analyses to Content-based Recommendation and Document Clustering

11

Probase Conceptualization

• “Cowboy doll Woody (Tom Hanks) is co ordinating a reconnaissance mission to find out what presents his owner Andy is getting for his birthday party days before they move to a new house. Unfortunately for Woody, Andy receives a new spaceman toy, Buzz Lightyear (Tim Allen) who impresses the other toys and Andy, who starts to like Buzz more than Woody. Buzz thinks that he is an actual space ranger, not a toy, and thinks that Woody is interfering with his "mission" to return to his home planet…”

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

Text Source: Internet Movie Database (IMDb)

Page 12: Applying Semantic Analyses to Content-based Recommendation and Document Clustering

12

Sample Features for “Toy Story” (Probase)

• dvd encryptions 0.050 “RC”• duty free item 0.044 “toys”• generic word 0.043 “they, travel, it,…”• satellite mission0.032 “reconnaissance mission”• creator-owned work 0.020 “Woody”• amazing song 0.013 “fury”• doubtful word 0.013 “overcome”• ill-fated tool 0.013 “Buzz”• lovable ``toy story'' character 0.011 “Buzz Lightyear, Woody,…”• pleased star 0.010 “Woody”• trail builder 0.010 “Woody”

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

Page 13: Applying Semantic Analyses to Content-based Recommendation and Document Clustering

13

Explicit Semantic Analysis

Image Source: Gabrilovich et al., 2007

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

Page 14: Applying Semantic Analyses to Content-based Recommendation and Document Clustering

14

Sample Features for “Toy Story” (ESA)

• #REDIRECT [[Buzz!]] 0.034• #REDIRECT [[The Buzz]] 0.028• #REDIRECT [[Buzz (comics)]] 0.027• #REDIRECT [[Buzz cut]] 0.027• #REDIRECT [[Buzz (DC Thomson)]] 0.024• #REDIRECT [[Buzz Out Loud]] 0.024• #REDIRECT [[The Daily Buzz]] 0.023• #REDIRECT [[Buzz Aldrin]] 0.022• #REDIRECT [[Buzz cut]] 0.022• #REDIRECT [[Buzzing Tree Frog]] 0.022

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

Page 15: Applying Semantic Analyses to Content-based Recommendation and Document Clustering

15

Latent Dirichlet Allocation

• Blei et al., 2003• Unsupervised Learning Method• “Generates” documents from Dirichlet

distributions over words and topics• Topic distributions over documents can be

inferred from corpus

Image Source: Wikipedia

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

Page 16: Applying Semantic Analyses to Content-based Recommendation and Document Clustering

16

Recommendation Systems

• Collaborative Filtering– “Customers who purchased X also purchased Y.”

• Content-based– “Because you enjoyed ‘GoldenEye’, you may want

to watch ‘Mission: Impossible’.” • Hybrid– Most modern systems take a hybrid approach.

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

Page 17: Applying Semantic Analyses to Content-based Recommendation and Document Clustering

17

Content-based Recommendation

• In GoldenEye/Mission: Impossible example…– Structured item content• Genre – Action/Adventure/Thriller• Tags – Action, Espionage, Adventure

– Unstructured item content• Plot synopses – “helicopter, agent, inflitrate, CIA, …”• Concepts? – “aircraft, intelligence agency, …”

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

Page 18: Applying Semantic Analyses to Content-based Recommendation and Document Clustering

18

Recommendation Systems

Collaborative Filtering

Approaches

Structured Content-based

Approaches

Unstructured Content-based

Approaches

Test semantic analysis approaches

here.

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

Page 19: Applying Semantic Analyses to Content-based Recommendation and Document Clustering

19

Experiment

Feature Generation

Movie Synopses

from IMDb

Matchbox Recommendation

Platform

Mean Absolute

Error (MAE)

Movie Ratings from MovieLens

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

Page 20: Applying Semantic Analyses to Content-based Recommendation and Document Clustering

20

Matchbox

Source: Matchbox API Documentation

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gRe

late

d W

ork

Conc

lusio

ns

Page 21: Applying Semantic Analyses to Content-based Recommendation and Document Clustering

21

Experimental Data• Data: MovieLens Dataset [HetRec ’11]

– 855,598 ratings– 10,197 movies– 2,113 users

• Movie synopses from IMDb (http://www.imdb.com)– Collected synopses for 2,633 movies– With 435,043 ratings– From 2,113 users

• Ratings data:– Scored by half points from 0.5 to 5

• Choose different numbers of movies (200; 1,000; all)• Train on 90% of ratings, test on remaining 10%

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

Page 22: Applying Semantic Analyses to Content-based Recommendation and Document Clustering

22

Experimental Data

• Controls– Baseline 1: Only features are user IDs and movie IDs– Baseline 2: User IDs, Movie IDs, Movie Genre– Baseline 3: User IDs, Movie IDs, Movie Tags

• Feature Sets– Term Frequency – Inverse Document Frequency– Latent Dirichlet Allocation– Explicit Semantic Analysis– Probase Conceptualization

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

Page 23: Applying Semantic Analyses to Content-based Recommendation and Document Clustering

23

Experimental Setup

• 4 Scenarios: (training: white, testing: black)U

sers

Movies

Use

rsMovies

Use

rs

Movies

Use

rs

Movies

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

Page 24: Applying Semantic Analyses to Content-based Recommendation and Document Clustering

24

Results

1 2 3 4 5 6 7 8 9 100.56

0.565

0.57

0.575

0.58

0.585

0.59

0.595

Baseline #1Baseline #2Baseline #3TFIDF NormalizedProbase SumESA

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

Page 25: Applying Semantic Analyses to Content-based Recommendation and Document Clustering

25

Results# of Movies All (2,633) 1,000 200

Baseline 1 0.672293 0.71654 0.802044Baseline 2 0.641556 0.683297 0.752745Baseline 3 0.655613 0.68994 0.764369TF-IDF 0.674764 0.706914 0.815245Probase 0.670694 0.715456 0.797196ESA 0.670182 0.714967 0.796787LDA (unfinished) 0.711307 0.790362

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

• testing set contains users and movies not seen in training set• recommendations based on item features alone• small amounts of structured data (e.g., genre) are the most influential

in this scenario

Page 26: Applying Semantic Analyses to Content-based Recommendation and Document Clustering

26

Results# of Movies All (2,633) 1,000 200

Baseline 1 0.580087 0.564226 0.577349Baseline 2 0.576183 0.563028 0.576673Baseline 3 0.575398 0.563378 0.572297TF-IDF 0.579906 0.575932 0.588288Probase 0.578889 0.563669 0.578089ESA 0.579798 0.564334 0.577638LDA (unfinished) 0.566639 0.579633

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

• testing set contains users not seen in training set.• lots of collaborative data available (explains comparable performance

in all feature sets)• given extensive collaborative data, item features are marginally

beneficial (in Matchbox)

Page 27: Applying Semantic Analyses to Content-based Recommendation and Document Clustering

27

Results# of Movies All (2,633) 1,000 200

Baseline 1 0.672843 0.687586 0.832491Baseline 2 0.639683 0.651141 0.81416Baseline 3 0.652071 0.66492 0.745593TF-IDF 0.672362 0.665116 0.844305Probase 0.670159 0.686235 0.823972ESA 0.670451 0.683594 0.817306LDA (unfinished) 0.684689 0.852056

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

• testing set contains movies not seen in the training set• recommendations based on item features and extensive information

on users “rating model”• small amounts of structured data (e.g., genre) are the most influential

in this scenario (even for long-term users)

Page 28: Applying Semantic Analyses to Content-based Recommendation and Document Clustering

28

Results# of Movies All (2,633) 1,000 200

Baseline 1 0.560163 0.564673 0.568706Baseline 2 0.556011 0.556456 0.567598Baseline 3 0.550761 0.561643 0.56445TF-IDF 0.551909 0.558942 0.588288Probase 0.556414 0.558113 0.567332ESA 0.556517 0.55706 0.568174LDA (unfinished) 0.558105 0.568927

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

• testing set contains users and movies seen in the training set• recommendations again are primarily collaborative• given a large corpus of rating data for users and items, item features

are only marginally beneficial

Page 29: Applying Semantic Analyses to Content-based Recommendation and Document Clustering

29

Results

Experiment

Baseline 1 0.672293 0.580087 0.672843 0.560163Baseline 2 0.641556 0.576183 0.639683 0.556011Baseline 3 0.655613 0.575398 0.652071 0.550761TF-IDF 0.674764 0.579906 0.672362 0.551909Probase 0.670694 0.578889 0.670159 0.556414ESA 0.670182 0.579798 0.670451 0.556517

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

Page 30: Applying Semantic Analyses to Content-based Recommendation and Document Clustering

30

Document Clustering

• Divide a corpus into a specified number of groups

• Useful for information retrieval– Automatically generated topics for search results– Recommendations for similar items/pages– Visualization of search space

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

Page 31: Applying Semantic Analyses to Content-based Recommendation and Document Clustering

31

K-Means

1. Start with initial clusters2. Compute means of clusters3. Compare cosine distance of each item to means4. Assign to clusters to based on min. distance5. Repeat from step 2 until convergence

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

Page 32: Applying Semantic Analyses to Content-based Recommendation and Document Clustering

32

Experimental Setup

1. Generate features for datasets2. Randomly assign initial clusters3. Run K-Means4. Compute purity and ARI5. Repeat steps 2-4 20 times for mean and

standard deviation

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

Page 33: Applying Semantic Analyses to Content-based Recommendation and Document Clustering

33

Experimental Data

• 20 Newsgroups (mini)• 2,000 messages from Usenet

newsgroups• 100 messages per topic• Filter messages for body text• Source: http://

kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html

From sci.electronics …

“A couple of years ago I put together a Tesla circuit which was published in an electronics magazine and could have been the circuit which is referred to here. This one used a flyback transformer from a tv onto which you wound your own primary windings...”

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

Page 34: Applying Semantic Analyses to Content-based Recommendation and Document Clustering

34

ResultsSe

man

tic A

naly

sisRe

com

men

datio

nCl

uste

ring

Conc

lusio

nsBa

ckgr

ound

Feature Set Purity ARI Scores

TF-IDF 0.379 ± 0.027 0.199 ± 0.023

Probase Only 0.265 ± 0.013 0.101 ± 0.010

Probase + TF-IDF 0.414 ± 0.034 0.241 ± 0.029

ESA Only 0.204 ± 0.010 0.040 ± 0.004

ESA + TF-IDF 0.389 ± 0.036 0.211 ± 0.032

LDA Only N/A N/A

LDA + TF-IDF N/A N/A

Page 35: Applying Semantic Analyses to Content-based Recommendation and Document Clustering

35

Results Comparison

• Song et al. Tweets Clustering– Experiment #2: Subtle Cluster Distinctions– Used Tweets about NA, Asia, Africa and Europe– Comparable performance for ESA and Probase

Conceptualization• Hotho et al. WordNet Clustering– Used Reuters dataset and Bisecting K-Means– Found best results for combined TF-IDF and feature sets– Overall improvement from WordNet features was

comparable to Probase features (O[+10%])

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

Page 36: Applying Semantic Analyses to Content-based Recommendation and Document Clustering

36

Conclusions

• Semantic analysis features are marginally beneficial in recommendation

• Structured data from limited vocabulary work best for recommending “new items”

• Explicit and latent semantic analysis are comparable in recommendation

• Knowledge bases generated at Web-scale may be too noisy for narrow domain tasks

• Confirmed the efficacy of semantic analysis in document clustering tasks

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

Page 37: Applying Semantic Analyses to Content-based Recommendation and Document Clustering

37

Future Directions

• Noise Reduction– Tune the recommender platform for “concepts”– Further explore parameter space for feature

generators– Hybrid Conceptualization / Named Entity

Disambiguation?• Domain-specific knowledge sources– Comparison of Web-scale and domain-specific

resources as external knowledge (e.g., [Aljaber et al., 2010])

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

Page 38: Applying Semantic Analyses to Content-based Recommendation and Document Clustering

38

Further Reading• Short Text Conceptualization Using a Probabilistic Knowledge

Base [Song et al., 2011]• Exploiting Wikipedia as External Knowledge for Document

Clustering [Hu et al., 2009]• Hybrid Recommender Using WordNet “Bag of Synsets”

[Degemmis et al., 2007]• Hybrid Recommender Using LDA [Jin et al., 2005]• Feature Generation for Text Categorization Using World

Knowledge [Gabrilovich and Markovitch, 2005]• WordNet Improves Text Document Clustering [Hotho et al.,

2003]

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

Page 39: Applying Semantic Analyses to Content-based Recommendation and Document Clustering

39

Acknowledgements

• David Stern, Ulrich Paquet, Jurgen Van Gael• Haixun Wang, Yangqiu Song, Zhongyuan Wang• Special thanks to Evelyne Viegas!• Microsoft Research Connections

Page 40: Applying Semantic Analyses to Content-based Recommendation and Document Clustering

40

References• [Gabrilovich et al., 2007] Evgeniy Gabrilovich and Shaul Markovitch. 2007. Computing semantic

relatedness using Wikipedia-based explicit semantic analysis. In Proceedings of the 20th international joint conference on Artifical intelligence (IJCAI'07), Rajeev Sangal, Harish Mehta, and R. K. Bagga (Eds.). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1606-1611.

• [Blei et al., 2003] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. J. Mach. Learn. Res. 3 (March 2003), 993-1022.

• [Song et al., 2011] Yangqiu Song, Haixun Wang, Zhongyuan Wang, Hongsong Li, and Weizhu Chen, Short Text Conceptualization using a Probabilistic Knowledgebase, in IJCAI, 2011.

• [Stern et al., 2009] David H. Stern, Ralf Herbrich, and Thore Graepel. 2009. Matchbox: large scale online bayesian recommendations. In Proceedings of the 18th international conference on World wide web (WWW '09). ACM, New York, NY, USA, 111-120.

• [HetRec ’11] Ivan Cantador, Peter Brusilovsky, and Tsvi Kuflik. 2011. 2nd Workshop on Information Heterogeneity and Fusion in Recommender Systems (HetRec 2011). In Proceedings of the 5th ACM conference on Recommender systems. ACM, New York, NY, USA.

• [Degemmis et al., 2007] Marco Degemmis, Pasquale Lops, and Giovanni Semeraro. A content-collaborative recommender that exploits WordNet-based user profiles for neighborhood formation. User Modeling and User-Adapted Interaction. Vol. 17, Issue 3, 217-255.

Page 41: Applying Semantic Analyses to Content-based Recommendation and Document Clustering

41

References• [Jin et al., 2005] Xin Jin, Yanzan Zhou, and Bamshad Mobasher. 2005. A maximum entropy web

recommendation system: combining collaborative and content features. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining (KDD '05). ACM, New York, NY, USA, 612-617.

• [Hu et al., 2009] Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou. 2009. Exploiting Wikipedia as external knowledge for document clustering. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '09). ACM, New York, NY, USA, 389-396.

• [Gabrilovich and Markovitch, 2005] Evgeniy Gabrilovich and Shaul Markovitch. 2005. Feature generation for text categorization using world knowledge. In Proceedings of the 20th international joint conference on Artifical intelligence (IJCAI'05), 1606-1611.

• [Hotho et al., 2003] Andreas Hotho, Steffen Staab, and Gerd Stumme. 2003. Wordnet improves text document clustering. In Proceedings of the SIGIR 2003 Semantic Web Workshop, 541-544.

• [Aljaber et al., 2010] Bader Aljaber, Nicola Stokes, James Bailey, and Jian Pei. 2010. Document clustering of scientific texts using citation contexts. Information Retrieval. Vol. 13, Issue 2, 101-131.

Page 42: Applying Semantic Analyses to Content-based Recommendation and Document Clustering

42

Questions?

• Thanks for attending

Page 43: Applying Semantic Analyses to Content-based Recommendation and Document Clustering

43

Appendix

A. Matchbox DetailsB. Implementation DetailsC. Probase Conceptualization DetailsD. Explicit Semantic Analysis DetailsE. Learnings from Probase

Page 44: Applying Semantic Analyses to Content-based Recommendation and Document Clustering

44

(Appendix A) Matchbox

• [Stern et al., 2009]• MSR Cambridge recommendation platform• Implements a hybrid recommender using

Infer.NET– Uses combination of expectation propagation (EP)

and variational message passing• Reduces user, item, and context features to

low dimensional trait space

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gRe

late

d W

ork

Conc

lusio

ns

Page 45: Applying Semantic Analyses to Content-based Recommendation and Document Clustering

45

(Appendix A) Matchbox Setup

• Matchbox settings– Use 20 trait dimension (determined

experimentally)– 10 iterations of EP algorithm– Trained on approx. 90% of ratings– Updated model with 75% of ratings per user (in

remaining 10%)– MAE computed for remaining 25% per user

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gRe

late

d W

ork

Conc

lusio

ns

Page 46: Applying Semantic Analyses to Content-based Recommendation and Document Clustering

46

(Appendix B) Implementation

• ESA: https://github.com/faraday/wikiprep-esa• LDA: Infer.NET• Probase: Probase Package v. 0.18• TF-IDF: http://www.codeproject.com/KB/cs/tfidf.aspx• Matchbox: http://codebox/matchbox

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gRe

late

d W

ork

Conc

lusio

ns

Page 47: Applying Semantic Analyses to Content-based Recommendation and Document Clustering

47

(Appendix C) Probase Conceptualization

1. Identify all Probase terms in text2. Use Noisy-or Model to combine:– Concepts from tl as attribute (zl = 1)– Concepts from tl as entity/concept (zl = 0)

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

Page 48: Applying Semantic Analyses to Content-based Recommendation and Document Clustering

48

(Appendix C) Probase Conceptualization

3. Weight terms based on occurrencea. Naïve Bayes (similar to Song et al., 2010)

• Compute P(c|t) for individual terms and use Naïve Bayes model to derive concepts

• Penalizes false positives, does not reward true positives• Generates very small probabilities for large numbers of terms

b. Weighted Sum (similar to Gabrilovich et al., 2007)• Compute P(c|t) for individual terms and compute sum over

document for each concept• Rewards true positives, does not penalize false positives

(accurate concepts and inaccurate concepts, resp.)

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

Page 49: Applying Semantic Analyses to Content-based Recommendation and Document Clustering

49

(Appendix C) Probase Conceptualization

4. Penalize frequent concepts– Stop word (concepts) are domain-independent– For films, many domain-specific stop concepts

• E.g., “movie”, “character”, “actor”, etc.– Inverse Document Frequency on concepts penalizes

those that are too frequent– Also rewards those that are too infrequent (in only

one document)– Solution: Filter for minimum and maximum

occurrence

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gCo

nclu

sions

Back

grou

nd

Page 50: Applying Semantic Analyses to Content-based Recommendation and Document Clustering

50

(Appendix C) Probase Conceptualization

• Using Summation (similar to Wikipedia ESA)

• Using Naïve Bayes from Song et al. approach– P(|T) P(T|)P()/P(T) / P()L - 1

• Inverse Document Frequency for concepts– IDF(ck) = log ( # of documents / document frequency of ck )– Minimum occurrence = 2– Maximum occurrence = 0.5 * # of documents

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gRe

late

d W

ork

Conc

lusio

ns

Page 51: Applying Semantic Analyses to Content-based Recommendation and Document Clustering

51

(Appendix D) Explicit Semantic Analysis

• Gabrilovich et al., 2007• Builds inverted index of Wikipedia content• Input text converted to weight vector of

concepts based on TF-IDF

Sem

antic

Ana

lysis

Reco

mm

enda

tion

Clus

terin

gRe

late

d W

ork

Conc

lusio

ns

Page 52: Applying Semantic Analyses to Content-based Recommendation and Document Clustering

52

(Appendix E) Learnings from Probase

• Conceptualization works wonders for small numbers of entities

• Would be extremely useful in a large-scale QA environment with many semantic analysis and ML algorithms (e.g., Watson)

• A noisy source of knowledge is best suited to noise-tolerant IR applications

• Still being developed and improving!– Working on recognizing verbs