lecture 6: eitn01 web intelligence and information retrieval
TRANSCRIPT
logolund
Lecture 6: EITN01 Web Intelligence and InformationRetrieval
Anders Ardö
EIT – Electrical and Information Technology, Lund University
February 26, 2013
A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 1 / 46
logolund
Outline
1 Reiteration
2 Recommender systems
3 Indexing, searching
4 Example IR systems
A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 2 / 46
logolund
Previous lecture
Web SearchMetasearch enginesWeb crawlingBrowsing vs search
A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 3 / 46
logolund
Web Search
ChallengesDistributed, dynamic dataLarge volumeUnstructured, heterogeneous data
Size, coverageGeneral vs focusedSpecial functions, User interfaceRankingLimited overlap between search engines
A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 4 / 46
logolund
Search Engine - Basic structure
���������������������������
���������������������������
Database
Interface
Database
Web pagesHTTP Web browserQuery
Answer
CGI−script
Web robot The WebHTTP
Size efficiency response time
software crawling the web (much like a human clicking on links)collect all found web-pages into a database (IR system)offer a web-interface to that database
A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 5 / 46
logolund
started late 1990:sEstimated 450,000 low-cost commodity servers (2006)1 trillion links to web pages (July 2008)“over 8 billion web pages”estimate 40 billion pages?goal is to index all the world’s dataGoogle Flu Trends
A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 6 / 46
logolund
Metasearch engines
Simultaneously search several individual search enginesQuery translationResult merging
Simple mergeDuplicate detectiontf-idf/similarity rankingPosition based
A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 7 / 46
logolund
Web Robot - Basic architecture
Spider, Crawler, Robot, agent, ...
Frontier
List of
unvisited
pages
Database
Get URL
Fetch
Web page
Analyze
Save
pagesWeb
Repository
of visited
pages
URLs
Links
Seed
URLs
A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 8 / 46
logolund
Focused Crawling
Frontier
List of
unvisited
pages
Seed
URLs
Database
pagesWeb
Repository
of visited
pages
URLsGet URL
Fetch
Web page
URL
focus
filter
Analyze
Linksfocus
inNot
Within the
focusSave
filterFocus
Focus:
DomainProjectCountryRegionTopicSubject
A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 9 / 46
logolund
Basic Algorithm
Add good start pages (seeds) to frontierLOOP:
Choose a page among linksPage OK?
Save pageAdd all links to frontier
Go to LOOP
Save (database(s)):All relevant pages (search engine database)All analyzed pages (seen pages)All new links (frontier)
A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 10 / 46
logolund
Browsing
No idea how formulate a queryWilling to invest some timeStructure: flat vs hierarchy
Manual vs automatic classificationLack of standard classification/terminology
Everything vs Quality assessedPrecision - NOT recall
A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 11 / 46
logolund
Browsing vs search
SearchLOTS of dataUnstructuredUnrelated items clutter results
BrowsingSmall amounts of dataHierarchically structuredQuality assessed
A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 12 / 46
logolund
Lecture 6 agenda
Chapter 9 in “Modern Information Retrieval”;G. Adomavicius, A. Tuzhilin: “Toward the Next Generation ofRecommender systems: A survey of the State-of-the-Art and PossibleExtensions”; Sections 1 - 2
1 Reiteration
2 Recommender systems
3 Indexing, searching
4 Example IR systems
A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 13 / 46
logolund
Outline
1 Reiteration
2 Recommender systems
3 Indexing, searching
4 Example IR systems
A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 14 / 46
logolund
Recommender systems
text
image
audio
video
Profiles
Preferences
Usage history
content
Recommender
system
Context
Rep
rese
nta
tio
n
Rep
rese
nta
tio
n
User
Recommendations
Media
Representation
A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 15 / 46
logolund
Recommender systems
Make machines understandmedia
annotation - metadatacontext
?user
usage historyprofiles
A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 16 / 46
logolund
Recommender systems
Content based filteringbased on items similar to what the user has liked in the past
Collaborative filteringbased on opinions of other users (user/item matrix)(user-user similarity, item-item similarity)find like-minded users (neighborhood)predictions for unseen items
Hybrid systems
A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 17 / 46
logolund
Recommender systems
x x x ... x ... xx x x ... x ... xx x x ... x ... xx x x ... x ... x
x x x ... x ... x
.
x x x ... x ... x
.
.
p p p ... p ... pp p p ... p ... pp p p ... p ... pp p p ... p ... p
p p p ... p ... p
.
p p p ... p ... p
.
.
USERS
ITEMS
Recommendation
algorithm
collaborative
content−based
...
RecommendationsPredicted ratings
r ... ...
.
.
.
... ...
... ...
... ...
... r ... r ... ... r
A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 18 / 46
logolund
Content based filtering
Try to predict a rating based on my own ratingsRepresent items as a set of features
itemj = (w1j , ...wkj)
Users rank items→ user profile in feature spaceuserc = (wc1, ...wck )
Vector space! (feature/item matrix, tf idf, similarity (cosine,Pearson), ...)User profile used as query
A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 19 / 46
logolund
Collaborative filtering
Try to predict rating based on other users ratingsMemory based
Make rating based on entire collectionEx user-user: ratingc,s = k ∗
∑c′∈C
sim(c, c′) ∗ ratingc′,s
User c, Item sC Set of users most similar to ck Normalizing factor (usually 1∑
c′∈C
|sim(c, c′)|)
Ex item-item: ratingc,s = k ∗∑s′∈S
sim(s, s′) ∗ ratingc,s
User c, Item sS Set of items most similar to sk Normalizing factor (usually 1∑
s′∈S
|sim(s, s′)|)
Model basedTry to learn a model to be used for predicting ratingsEx: Probabilistic model, Machine learning, ...
A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 20 / 46
logolund
Collaborative filtering – Item-Item – I
The Movie – Users matrix: Users ratings (1-5) of movies
Movie Users1 2 3 4 5 6 7 8 9 10 11 12
m1 1 3 5 5 4m2 5 4 4 2 1 3m3 2 4 1 2 3 4 3 5m4 2 4 5 4 2m5 4 3 4 2 2 5m6 1 3 3 2 4
A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 21 / 46
logolund
Collaborative filtering – Item-Item – II
Estimate User 5 ranking of movie m1?
Movie Users1 2 3 4 5 6 7 8 9 10 11 12
m1 1 3 ?? 5 5 4m2 5 4 4 2 1 3m3 2 4 1 2 3 4 3 5m4 2 4 5 4 2m5 4 3 4 2 2 5m6 1 3 3 2 4
A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 22 / 46
logolund
Collaborative filtering – Item-Item – III
Estimate User 5 ranking of movie m1?Neighbor selection – movies most similar to m1→ m3, m6, m5
Movie Users sim(m1,mx)1 2 3 4 5 6 7 8 9 10 11 12
m1 1 3 ?? 5 5 4 1.0m2 5 4 4 2 1 3 0.26m3 2 4 1 2 3 4 3 5 0.52m4 2 4 5 4 2 0.28m5 4 3 4 2 2 5 0.40m6 1 3 3 2 4 0.48
A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 23 / 46
logolund
Collaborative filtering – Item-Item – IV
Estimate User 5 ranking of movie m1?Neighbor selection – movies most similar to m1→ m3, m6, m5Predict ranking rm1,5 as sim(m1,m3)∗rm3,5+sim(m1,m6)∗rm6,5+sim(m1,m5)∗rm5,5
sim(m1,m3)+sim(m1,m6)+sim(m1,m5)
rm1,5 = 0.52∗2+0.48∗3+0.40∗40.52+0.48+0.40 = 2.9
Movie Users sim(m1,mx)1 2 3 4 5 6 7 8 9 10 11 12
m1 1 3 2.9 5 5 4 1.0m2 5 4 4 2 1 3 0.26m3 2 4 1 2 3 4 3 5 0.52m4 2 4 5 4 2 0.28m5 4 3 4 2 2 5 0.40m6 1 3 3 2 4 0.48
A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 24 / 46
logolund
Hybrid systems
Content based filtering + Collaborative filteringCombining separate recommendersAdding content based characteristics to collaborative filteringAdding collaborative characteristics to content based filtering
A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 25 / 46
logolund
Examples
Amazon, Course Recommender
A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 26 / 46
logolund
Outline
1 Reiteration
2 Recommender systems
3 Indexing, searching
4 Example IR systems
A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 27 / 46
logolund
Introduction
Sequential searchSmall databasesVolatile data
IndexesLarge databasesSemi-static data
Inverted files
A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 28 / 46
logolund
How to represent indexed documents?
43
Documents
break into words
stoplist
stemming*
term weighting*
Index /
database
text
non-stoplist
words
words
stemmed
words
terms with
weights
* Indicates
optional
operation
assign document IDs
document
numbers
and *field
numbers
Lexical analysis
A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 29 / 46
logolund
Inverted files
Principal data structureEffectiveAllows fast searchingSubstantial storage overhead
Speed more important than storage
For each termList of document ID’s(Term frequency in each document)(Position in document)
Used forBoolean searchesVector space rankingProximity, phrases
A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 30 / 46
logolund
Inverted files
docs t1 t2 t3
D1 1 0 1
D2 1 0 0
D3 0 1 1
D4 1 0 0
D5 1 1 1
D6 1 1 0
D7 0 1 0
D8 0 1 0
D9 0 0 1
D10 0 1 1
Terms D1 D2 D3 D4 D5 D6 D7 …
t1 1 1 0 1 1 1 0
t2 0 0 1 0 1 1 1
t3 1 0 1 0 1 0 0
(From J. W. Schneider: “Informetrics & Scientometrics”)
A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 31 / 46
logolund
Inverted files
(From R. Baeza-Yates, B. Ribeiro-Neto: “Modern Information Retrieval”, 2nd Ed, 2010)
A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 32 / 46
logolund
Inverted files
(From R. Baeza-Yates, B. Ribeiro-Neto: “Modern Information Retrieval”, 2nd Ed, 2010)
A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 33 / 46
logolund
Creation of inverted files
For each term in the dictionarystore ID’s of documents containing that word
Lexical analysis⇒ termsSave terms with document IDSort alphabetically⇒ dictionary(Calculate tf and idf)Create posting list (list of document ID’s per term)
A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 34 / 46
logolund
Example
Document A:Now is the time for men tocome to the aid of their coun-try.
Document B:It was a dark night in thecountry. The time was pastmidnight.
Dictionary:Term DocIDtime Amen Aaid Acountry Adark Bnight Bcountry Btime Bmidnight B
Dictionary:
Term DocIDaid Acountry Acountry Bdark Bmen Amidnight Bnight Btime Atime B
A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 35 / 46
logolund
Example cont’d
Document A:Now is the time for men tocome to the aid of their coun-try.
Document B:It was a dark night in thecountry. The time was pastmidnight.
Inverted file:Dictionary PostingsTerm Docs ID ID ...aid 1 Acountry 2 A Bdark 1 Bmen 1 Amidnight 1 Bnight 1 Btime 2 A B
A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 36 / 46
logolund
Example cont’d
Inverted file:Dictionary PostingsTerm Docs ID ID ...aid 1 Acountry 2 A Bdark 1 Bmen 1 Amidnight 1 Bnight 1 Btime 2 A B
Query: time AND dark
time⇒ posting list P1 = {A,B}dark⇒ posting list P2 = {B}P1 ∩ P2 = {A,B} ∩ {B} = {B}Result Document B(Do ranking)
A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 37 / 46
logolund
Example cont’d
Inverted file:Dictionary PostingsTerm Docs ID ID ...aid 1 Acountry 2 A Bdark 1 Bmen 1 Amidnight 1 Bnight 1 Btime 2 A B
Query: time OR dark
time⇒ posting list P1 = {A,B}dark⇒ posting list P2 = {B}P1 ∪ P2 = {A,B} ∪ {B} = {A,B}Result Documents A,B(Do ranking)
A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 38 / 46
logolund
Phrase search
D0 = "it is what it is", D1 = "what is it", D2 = "it is a banana"
"a": [2]"banana": [2]"is": [0], [1], [2]"it": [0], [1], [2]"what": [0], [1]
Q: “what is it”? ([0], [1])⋂
([0], [1], [2])⋂
([0], [1], [2]) = ([0], [1])
A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 39 / 46
logolund
Phrase search
D0 = "it is what it is", D1 = "what is it", D2 = "it is a banana"
"a": [2]"banana": [2]"is": [0], [1], [2]"it": [0], [1], [2]"what": [0], [1]
Q: “what is it”? ([0], [1])⋂
([0], [1], [2])⋂
([0], [1], [2]) = ([0], [1])As a phrase?
"a": [2, [2]]"banana": [2, [3]]"is": [0, [1,4]], [1, [1]], [2, [1]]"it": [0, [0,3]], [1, [2]], [2, [0]]"what": [0, [2]], [1, [0]]
A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 40 / 46
logolund
Outline
1 Reiteration
2 Recommender systems
3 Indexing, searching
4 Example IR systems
A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 41 / 46
logolund
Zebra
IndexData: http://www.indexdata.dk/zebra/high-performance, general-purpose structured text indexing andretrieval enginefree, GPL licenseindex records in XML, SGML, MARC, e-mail archives, ...combination of Boolean searching and relevance ranking (tf-idf)supports SRU/CQL, Z39.50, ZOOM
A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 42 / 46
logolund
Zebra - XML indexing
A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 43 / 46
logolund
Zebra features
supports large databasestens of millions of recordstens of gigabytes of data
regular expression queriesfuzzy queries (spelling correction)index scansfaceted browsingsorting
A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 44 / 46
logolund
Lucene/Solr
Apache: http://lucene.apache.org/Lucene: high-performance, full-featured text search engine librarySolr: enterprise search server based on Lucenefree, open sourceindex records via XML over HTTPquery via HTTP GET, XML results
A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 45 / 46
logolund
Lucene/Solr features
scalability - efficient replicationhighlighted context snippetsfaceted searchingspelling suggestions for user queries’More Like This’ suggestions for given documentq=video&fl=name,id,score
q=video&sort=inStock asc, score desc
A. Ardö, EIT Lecture 6: EITN01 Web Intelligence and Information Retrieval February 26, 2013 46 / 46