Download - Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8
Quality of a search engine
Paolo FerraginaDipartimento di Informatica
Università di Pisa
Reading 8
Is it good ?
How fast does it index Number of documents/hour (Average document size)
How fast does it search Latency as a function of index size
Expressiveness of the query language
Measures for a search engine
All of the preceding criteria are measurable
The key measure: user happiness…useless answers won’t make a user happy
Happiness: elusive to measure
Commonest approach is given by the relevance of search results How do we measure it ?
Requires 3 elements:1. A benchmark document collection2. A benchmark suite of queries3. A binary assessment of either Relevant or
Irrelevant for each query-doc pair
Evaluating an IR system
Standard benchmarks TREC: National Institute of Standards and
Testing (NIST) has run large IR testbed for
many years
Other doc collections: marked by human
experts, for each query and for each doc,
Relevant or Irrelevant
On the Web everything is more complicated since we cannot mark the entire corpus !!
General scenario
Relevant
Retrieved
collection
Precision: % docs retrieved that are relevant [issue “junk” found]
Precision vs. Recall
Relevant
Retrieved
collection
Recall: % docs relevant that are retrieved [issue “info” found]
How to compute them
Precision: fraction of retrieved docs that are relevant Recall: fraction of relevant docs that are retrieved
Precision P = tp/(tp + fp) Recall R = tp/(tp + fn)
Relevant Not Relevant
Retrieved tp (true positive) fp (false positive)
Not Retrieved
fn (false negative) tn (true negative)
Some considerations
Can get high recall (but low precision) by retrieving all docs for all queries!
Recall is a non-decreasing function of the number of docs retrieved
Precision usually decreases
Precision-Recall curve
We measures Precision at various levels of Recall Note: it is an AVERAGE over many queries
precision
recall
x
x
x
x
A common picture
precision
recall
x
x
x
x
F measure
Combined measure (weighted harmonic mean):
People usually use balanced F1 measure
i.e., with = ½ thus 1/F = ½ (1/P + 1/R)
Use this if you need to optimize a single measure
that balances precision and recall.
RP
F1)1(
11
Recommendation systems
Paolo FerraginaDipartimento di Informatica
Università di Pisa
Recommendations
We have a list of restaurants with and ratings for some
Which restaurant(s) should I recommend to Dave?
Brahma Bull Spaghetti House Mango Il Fornaio Zao Ming's Ramona's Straits Homma'sAlice Yes No Yes NoBob Yes No No
Cindy Yes No NoDave No No Yes Yes YesEstie No Yes Yes YesFred No No
Basic Algorithm
Recommend the most popular restaurants say # positive votes minus # negative votes
What if Dave does not like Spaghetti?
Brahma Bull Spaghetti House Mango Il Fornaio Zao Ming's Ramona's Straits Homma'sAlice 1 -1 1 -1Bob 1 -1 -1
Cindy 1 -1 -1Dave -1 -1 1 1 1Estie -1 1 1 1Fred -1 -1
Smart Algorithm
Basic idea: find the person “most similar” to Dave according to cosine-similarity (i.e. Estie), and then recommend something this person likes.
Perhaps recommend Straits Cafe to Dave
Brahma Bull Spaghetti House Mango Il Fornaio Zao Ming's Ramona's Straits Homma'sAlice 1 -1 1 -1Bob 1 -1 -1
Cindy 1 -1 -1Dave -1 -1 1 1 1Estie -1 1 1 1Fred -1 -1
Do you want to rely on one person’s opinions?
Main idea
U
V
W
d1
d2
d5
d3
d4
d6
Y d7
What do we suggest to U ?
A glimpse on XML retrieval(eXtensible Markup Language)
Paolo FerraginaDipartimento di Informatica
Università di Pisa
Reading 10
XML vs HTML
HTML is a markup language for a specific purpose (display in browsers) XML is a framework for defining markup
languages
HTML has fixed markup tags, XML no
HTML can be formalized as an XML language (XHTML)
XML Example (visual)
XML Example (textual)
<chapter id="cmds"> <chaptitle> FileCab </chaptitle> <para>This chapter describes the
commands that manage the <tm>FileCab</tm>inet application.
</para> </chapter>
Basic Structure
An XML doc is an ordered, labeled tree
character data: leaf nodes contain the actual data (text strings)
element nodes: each labeled with a name (often called the element type), and a set of attributes, each consisting of a
name and a value, can have child nodes
XML: Design Goals
Separate syntax from semantics to provide a framework for structuring information
Allow tailor-made markup for any imaginable application domain
Support internationalization (Unicode) and platform independence
Be the standard of (semi)structured information (do some of the work now done by databases)
Why Use XML?
Represent semi-structured
XML is more flexible than DBs
XML is more structured than simple IR
You get a massive infrastructure for free
Data vs. Text-centric XML
Data-centric XML: used for messaging between enterprise applications Mainly a recasting of relational data
Text-centric XML: used for annotating content Rich in text Demands good integration of text retrieval
functionality E.g., find me the ISBN #s of Books with at least
three Chapters discussing cocoa production, ranked by Price
IR Challenges in XML
There is no document unit in XML How do we compute tf and idf? Indexing granularity Need to go to document for retrieving or
displaying a fragment E.g., give me the Abstracts of Papers on
existentialism
Need to identify similar elements in different schemas Example: employee
Xquery: SQL for XML ? Simple attribute/value
/play/title contains “hamlet”
Path queries title contains “hamlet” /play//title contains “hamlet”
Complex graphs Employees with two managers
What about relevance ranking?
Data structures for XML retrieval
Inverted index: give me all elements matching text query Q We know how to do this – treat each
element as a document
Give me all elements below any instance of the Book element (Parent/child relationship is not enough)
Positional containment
Doc:1
27 1122 2033 5790Play
431 867Verse
Term:droppeth720
droppeth under Verse under Play.
Containment can beviewed as mergingpostings.
Summary of data structures
Path containment etc. can essentially be solved by positional inverted indexes
Retrieval consists of “merging” postings
All the compression tricks are still applicable
Complications arise from insertion/deletion of elements, text within elements Beyond the scope of this course
Search Engines
Advertising
Classic approach…
Socio-demo Geographic Contextual
Search Engines vs Advertisement First generation -- use only on-page, web-text data
Word frequency and language
Second generation -- use off-page, web-graph data Link (or connectivity) analysis Anchor-text (How people refer to a page)
Third generation -- answer “the need behind the query” Focus on “user need”, rather than on query Integrate multiple data-sources Click-through data
Pure search vs Paid search
Ads show on search (who pays more), Goto/Overture
2003 Google/YahooNew model
All players now have:SE, Adv platform + network
The new scenario
SEs make possible aggregation of interests unlimited selection (Amazon, Netflix,...)
Incentives for specialized niche players
The biggest money is in the smallest sales !!
Two new approaches
Sponsored search: Ads driven by search keywords
(and user-profile issuing them)
AdWords
-$
+$
Two new approaches
Sponsored search: Ads driven by search keywords
(and user-profile issuing them)
Context match: Ads driven by the content of a web page
(and user-profile reaching that page)
AdWords
AdSense
How does it work ?
1) Match Ads to query or pg content2) Order the Ads3) Pricing on a click-through
IR
Econ
Visited Pages
Clicked Banner
Web Searches
Clicks on Search Results
Web usage data !!!
Dictionary problem
A new game
For advertisers: What words to buy, how much to pay SPAM is an economic activity
For search engines owners: How to price the words Find the right Ad Keyword suggestion, geo-coding, business
control, language restriction, proper Ad display
Similar to web searching, but:Ad-DB is smaller, Ad-items are
small pages, ranking depends on clicks