dr. susan gauch when is a rock not a rock? conceptual approaches to personalized search and...

Post on 12-Jan-2016

212 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Dr. Susan Gauch

When is a rock not a rock?

Conceptual Approaches to Personalized Search and

Recommendations

Nov. 8, 2011

TResNet

Outline• Background• Motivation • Collecting User Information• Building Conceptual Profiles• Using User Profiles in Search

– Misearch• Using User Profiles in Recommender

Systems– MyCiteSeerx

• Issues with User Profiles

Background• Information retrieval (IR) studies the

indexing and retrieval of textual documents• Searching for pages on the World Wide

Web is the most recent “killer app”• Concerned with retrieving relevant

documents to a query• Concerned with retrieving from large sets of

documents efficiently

Web Search System

Query String

IRSystem

RankedDocuments

1. Page12. Page23. Page3 . .

Documentcorpus

Web Spider

The Vector-Space Model• Assume t distinct terms remain after

preprocessing; call them index terms or the vocabulary.

• These “orthogonal” terms form a vector space. Dimension = t = |vocabulary|

• Each term, i, in a document or query, j, is given a real-valued weight, wij.

• Both documents and queries are expressed as t-dimensional vectors:

dj = (w1j, w2j, …, wtj)

Graphic Representation

T3

T1

T2

Q = 0T1 + 0T2 + 2T3

• Is D1 or D2 more similar to Q?• How to measure the degree of

similarity? Distance? Angle?

D2 = 3T1 + 7T2 + T3

7

3

D1 = 2T1+3T2 + 5T3

2

5

3

Cosine Similarity Measure• Cosine similarity measures the

cosine of the angle between two vectors.

• Inner product normalized by the vector lengths. 2

t3

t1

t2

D1

D2

Q

1

D1 is 6 times better match than D2 using cosine

similarity

t

i

t

i

t

i

ww

ww

qd

qd

iqij

iqij

j

j

1 1

22

1

)(

CosSim(dj, q) =

Motivation• Search engines contain very large

collections – Google reports over 1 trillion web pages

• Receive very short queries– 68% are 3 words long or less

• Users examine few results– rarely go beyond first page– rarely examine more than 1 result– Exacerbated by small mobile screens

Ambiguity• How return precise results with

ambiguous queries?• Return results based on simple key-

word matches• No consideration of differing meanings• If the query is “salsa”, is it……

Dealing with Ambiguity

• Expand user queries using a thesaurus – “An Expert System for Searching in Full-Text,”

Susan Gauch, 1990– Basically, make query vectors longer so more

likely to match documents• Represent documents and queries using high-

level concepts instead of keywords– “Conceptual Search with KeyConcept,” Susan

Gauch, 2010– Basically, make reduce dimensions in vectors to

provide conceptual match

Ontologies

• A structured set of concepts• Where do ontologies come from?

Semantic Web

• Manually build ontologies• Experts manually tag data items• Very “intelligent” but not scalable

IR Community

• Use implicit ontologies– Wikipedia– Open Directory Project

• Develop automated techniques to tag items

• Not as “intelligent” but much more scalable

Need for Personalization

• All users get identical results for identical queries

• No distinction between veterinarian and child for query “beagle puppy”

• Need for personalized results based on background and current context

• How pick best 10 (or 1!) result for _you_?

How to Personalize

• Build a user profile that represents user interests– Collect information– Construct user profile– Use user profile for personalized

interactions

Collecting User Information

• Explicit user information– Users fill in site-specific surveys– Users too lazy busy– Data may be deliberately accidentally

inaccurate– Information becomes out of date

Implicit user information– Software collects information about

user activity as they perform regular activities

– Information is • indirect• noisy

– Various approaches used by well-known applications

Implicit Sources

• Browsing histories– User connects to Internet via a proxy– User periodically shares history – Pros:

• captures browsing activity at multiple sites

– Cons: • captures history from only one computer

My Browsing History

Used to Autofill urls

Implicit Sources• Desktop toolbar

– User must install desktop toolbar– Communication between toolbar and

site– Pros:

• interactions tracked across multiple sites• access to desktop windows, file system

– Cons:• user must install software• fine line between toolbar and spyware

Google’s Toolbar

Used to Personalize Search

Implicit Sources– User Account

• user activity is tracked via cookies/session variables

• best if user signs in to retain same profile across multiple machines

– Pros:• users tracked across all interactions

– Cons:• only works at one site• users must create an account

Amazon’s Login

Used for Recommendations

Our Approach– Personalization based on implicit data– Represent profile using weighted

conceptual taxonomy– Use profile for personalization in many

different ways• OBIWAN – Web browsing• Misearch – Web search• MyCiteSeerx – recommender system

Building a Conceptual Profile

• Need an ontology for the domain• Need a collection of text that

represents the user’s interests• Need classification technique

– train classifier with training data– classify user texts w.r.t

ontology/taxonomy/concept hierarchy/thesaurus/knowledge base

– accumulate weights

Arts

Root

Games

Music Design Comics

Doc 1Doc 2Doc 3

.

.

.Doc n

Doc 1Doc 2Doc 3

.

.

.Doc n

Doc 1Doc 2Doc 3

.

.

.Doc n

Doc 1Doc 2Doc 3

.

.

.Doc n

Doc 1Doc 2Doc 3

.

.

.Doc n

TraditionalIndexer

Newdocuments

ConceptDatabase

Classifier Results

Building the User Profile

User Profile Representation

Entertainment0.01

Homemaking0.04

Cooking0.49

Lessons0.3

Videos0.1

Root

MiSearch • User search histories

– information available to search engine itself– collect the user’s queries, clicked on search

results– no software installed

• Users create accounts– login– just track userid in a cookie during the

session– Similar to Amazon, Ebay, etc.

Personalizing Search Results

• Submit query to Internet search engine (e.g., Google)

• Categorize each result into same concept hierarchy to create result profiles– top 3 levels of ODP, ~3,000 categories

• Calculate similarity between result profile and user profile

Ambiguous: “canon book”

User Profile (Classics)

User Profile (Photography)

MyCiteSeerx

• Categorize contents of CiteSeerx with respect to ACM CCS topic hierarchy

• Users create an account• Capture their queries and clicked-on

documents• Build a conceptual profile• Compare user concepts to document

concepts to create recommendations

User interested in IR

Their recommendations

User interested in multimedia

Their recommendations

Recent Work• Bridge gap between Semantic Web

and Information Retrieval– Semi-automatically build domain-

specific ontologies • Do text mining from domain-specific

literature collection

Conclusions

• Information on which to base user profiles can be collected via interactions with a specific site

• Conceptual profiles can be used to improve search (misearch)

• Conceptual profiles can be used to provide conceptual recommendations for the CiteSeerx collection

• Creates issues for profile sharing and user privacy

• Leads to work on how to reuse/expand/build ontologies for narrow domains

top related