characterizing web content, user interests, and search behavior by reading level and topic
DESCRIPTION
Characterizing Web Content, User Interests, and Search Behavior by Reading Level and Topic. Jin Young Kim*, Kevyn Collins-Thompson, Paul Bennett and Susan Dumais. *Work done during internship at Microsoft Research . Search and recommendation are about the matching. Queries Documents - PowerPoint PPT PresentationTRANSCRIPT
Characterizing Web Content, User Interests, and Search Behavior by
Reading Level and TopicJin Young Kim*, Kevyn Collins-Thompson,
Paul Bennett and Susan Dumais
*Work done during internship at Microsoft Research
Search and recommendation are about the matching.
QueriesDocumentsWebsites
Users
Term-space matching is not always a good idea.
GranularitySparsity
Efficiency
Can we build representations beyond the term vectors?
Topic CategoryReading Level
SentimentStyle
What would be their implications for search and recommendations?
QueriesDocumentsWebsites
Users
Topic CategoryReading Level
SentimentStyle
In a Nutshell,
WHAT WE DID: Build Profiles of
Reading Level and Topic (RLT)
For queries, websites, users and search sessions
In order to characterize and compare entities
WHAT WE FOUND: Profile matching
predicts user’s content preference
Profiles can indicate when not to personalize
Profile features can predict expert content
Building Reading Level and Topic Profiles
Predicting Reading Level and Topic for URL Reading Level Classifier
Based on language model and other sources
Topic Classifier Trained using URLs in each Open Directory Project
category
Profile Distribution over reading level, topic,
or reading level and topic (RLT)P(R|d1) P(T|d1)
Entities and Related URLs Websites : content vs. user-viewed URLs Users : URLs visited during search sessions Queries : top-10 retrieved URLs
Example: Site profile made from URLs visited during search
sessions
Entity Profile Built from Related URLs
P(R|d1) P(T|d1)P(R|d1) P(T|d1)P(R|d1) P(T|d1) P(R,T|s)
Entity and related entities User – Websites visited Website – Surfacing queries Query – Issuing users
Example: Site profile made from the profiles of its visitors
Entity Profile Built with Related Entities
User
Query
WebsiteVisit
IssueSurface
P(R,T|s)P(R,T|u)P(R,T|u)P(R,T|u)
Characterizing an Individual Entity Mean : expectation Variance : entropy
Characterizing a Group of Entities Build a group centroid from its members Variance : divergence among members
Comparing Entitles and Groups Difference in mean Divergence in profile (distribution)
Characterizing and Comparing Profiles
Characterizing Web Content, User Interests, and Search Behavior
Data Set Session Log Data
2,281,150 URL visits (1,218,433 SERP clicks) Collected from 8,841 users
Profiles of Entities 4,715 websites with 25+ clicked URLs 7,613 users with 25+ URL visits 141,325 unique queries
Each topic has different reading level distribution
Reading Level Distribution for Top ODP Categories
Category R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 E[R|T]Reference 0.00 0.00 0.00 0.02 0.17 0.10 0.15 0.04 0.02 0.03 0.20 0.27 8.80Health 0.00 0.00 0.00 0.03 0.18 0.08 0.13 0.04 0.04 0.10 0.27 0.11 8.53Science 0.00 0.00 0.00 0.06 0.23 0.09 0.07 0.02 0.01 0.08 0.27 0.17 8.44Computers 0.00 0.00 0.00 0.06 0.24 0.19 0.03 0.01 0.01 0.02 0.32 0.12 8.11Business 0.00 0.00 0.00 0.05 0.22 0.16 0.09 0.03 0.02 0.04 0.26 0.12 8.08Society 0.00 0.00 0.00 0.02 0.23 0.07 0.35 0.03 0.01 0.01 0.22 0.06 7.62Adult 0.00 0.00 0.00 0.05 0.28 0.26 0.14 0.05 0.02 0.01 0.13 0.06 6.98Kids and Teens 0.00 0.00 0.02 0.23 0.26 0.13 0.09 0.02 0.01 0.02 0.15 0.08 6.60Games 0.00 0.00 0.00 0.19 0.36 0.10 0.11 0.02 0.02 0.03 0.12 0.03 6.39Recreation 0.00 0.00 0.00 0.11 0.44 0.19 0.08 0.02 0.02 0.02 0.09 0.02 6.18Arts 0.00 0.00 0.00 0.08 0.40 0.27 0.10 0.05 0.01 0.01 0.06 0.02 6.18Home 0.00 0.00 0.02 0.19 0.41 0.14 0.04 0.03 0.01 0.03 0.09 0.04 6.08News 0.00 0.00 0.00 0.04 0.41 0.33 0.14 0.02 0.02 0.01 0.03 0.01 5.99Shopping 0.00 0.00 0.01 0.22 0.29 0.24 0.09 0.03 0.01 0.02 0.07 0.02 5.98Sports 0.00 0.00 0.00 0.09 0.56 0.11 0.10 0.03 0.03 0.02 0.06 0.02 5.94
Topic and reading level characterize websites in each category
Profile matching predict user’s preference over search results Metric
% of user’s preferences predicted by profile matching,for each clicked website over the skipped website above
Results By degree of focus in user profile : H(R,T|u) By the distance metric between user and website
KLR(u,s) / KLT(u,s) / KLRLT(u,s)User
Group #Clicks KLR(u,s) KLT(u,s)KLRLT(u,s)
↑Focused 5,960 59.23% 60.79% 65.27% 147,195 52.25% 54.20% 54.41%
↓Diverse 197,733 52.75% 53.36% 53.63%
Users’ Deviation from Their Own Profiles Stretch reading
Session-level reading level >> Long-term reading level
Casual reading Session-level reading level << Long-term reading
level URL Title Words for Stretch Reading
URL Title Words for
Casual ReadingTitle word Log
ratio Title word Log ratio
tests 2.22 best -0.42test 1.99 football -0.45sample 1.94 store -0.46digital 1.88 great (deals) -0.47(tuition) options 1.87 items -0.52(financial) aid 1.87 new -0.53(medication) effects 1.84 sale -0.61education 1.77 games -0.65
Comparing Expert vs. Non-expert URLs Expert vs. Non-expert URLs taken from
[White’09]
Predicting Expert vs. Novice Websites Results
Features
Baseline(predict most likely class)
65.8%
Classifier accuracy 82.2%
FeatureCorrel. with
Expertness
Description
E[R|Qs] +0.34 Expectation of Surfacing Query's RLE[R|Us] +0.44 Expectation of Visitor's RLDivRLT(U,s) -0.56 Distance of visitors’ RLT profile from site'sDivT(U,s) -0.55 Distance of visitors’ Topic profile from
site's
Thank you for your attention!
WHAT WE DID: Build Profiles of
Reading Level and Topic (RLT)
For Queries, Websites, Users and Search Sessions
To characterize and compare entities
WHAT WE FOUND: Profile matching predict
user’s content preference
Profiles can indicate when not to personalize
Profile features can predict expert content
More at : @jin4ir / cs.umass.edu/~jykim
Optional Slides
Website reading level vs. visitor diversity
Breakdown per topic revealsstronger relationship
Correlation between Site vs. Visitor Profiles
Website Reading Level Visitor Profile Diversity
DivR(U|s) DivT(U|s) DivRT(U|s)
E[R|s] 0.052 0.081 0.095
ComputersReference
NewsArts
RecreationScienceHealthSports
SocietyBusiness
AdultGamesHome
ShoppingKids_and_Teens
-0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4
Query / User Reading Level against P(Topic) User profile shows different trends in Computers