quality of a search engine
DESCRIPTION
Quality of a search engine. Paolo Ferragina Dipartimento di Informatica Università di Pisa. Reading 8. Is it good ?. How fast does it index Number of documents/hour (Average document size) How fast does it search Latency as a function of index size Expressiveness of the query language. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Quality of a search engine](https://reader036.vdocuments.us/reader036/viewer/2022062321/5681378a550346895d9f265e/html5/thumbnails/1.jpg)
Quality of a search engine
Paolo FerraginaDipartimento di Informatica
Università di Pisa
Reading 8
![Page 2: Quality of a search engine](https://reader036.vdocuments.us/reader036/viewer/2022062321/5681378a550346895d9f265e/html5/thumbnails/2.jpg)
Is it good ?
How fast does it index Number of documents/hour (Average document size)
How fast does it search Latency as a function of index size
Expressiveness of the query language
![Page 3: Quality of a search engine](https://reader036.vdocuments.us/reader036/viewer/2022062321/5681378a550346895d9f265e/html5/thumbnails/3.jpg)
Measures for a search engine
All of the preceding criteria are measurable
The key measure: user happiness…useless answers won’t make a user happy
![Page 4: Quality of a search engine](https://reader036.vdocuments.us/reader036/viewer/2022062321/5681378a550346895d9f265e/html5/thumbnails/4.jpg)
Happiness: elusive to measure
Commonest approach is given by the relevance of search results How do we measure it ?
Requires 3 elements:1. A benchmark document collection2. A benchmark suite of queries3. A binary assessment of either Relevant or
Irrelevant for each query-doc pair
![Page 5: Quality of a search engine](https://reader036.vdocuments.us/reader036/viewer/2022062321/5681378a550346895d9f265e/html5/thumbnails/5.jpg)
Evaluating an IR system
Standard benchmarks TREC: National Institute of Standards and
Testing (NIST) has run large IR testbed for
many years
Other doc collections: marked by human
experts, for each query and for each doc,
Relevant or Irrelevant
On the Web everything is more complicated since we cannot mark the entire corpus !!
![Page 6: Quality of a search engine](https://reader036.vdocuments.us/reader036/viewer/2022062321/5681378a550346895d9f265e/html5/thumbnails/6.jpg)
General scenario
Relevant
Retrieved
collection
![Page 7: Quality of a search engine](https://reader036.vdocuments.us/reader036/viewer/2022062321/5681378a550346895d9f265e/html5/thumbnails/7.jpg)
Precision: % docs retrieved that are relevant [issue “junk” found]
Precision vs. Recall
Relevant
Retrieved
collection
Recall: % docs relevant that are retrieved [issue “info” found]
![Page 8: Quality of a search engine](https://reader036.vdocuments.us/reader036/viewer/2022062321/5681378a550346895d9f265e/html5/thumbnails/8.jpg)
How to compute them
Precision: fraction of retrieved docs that are relevant Recall: fraction of relevant docs that are retrieved
Precision P = tp/(tp + fp) Recall R = tp/(tp + fn)
Relevant Not Relevant
Retrieved tp (true positive) fp (false positive)
Not Retrieved
fn (false negative) tn (true negative)
![Page 9: Quality of a search engine](https://reader036.vdocuments.us/reader036/viewer/2022062321/5681378a550346895d9f265e/html5/thumbnails/9.jpg)
Some considerations
Can get high recall (but low precision) by retrieving all docs for all queries!
Recall is a non-decreasing function of the number of docs retrieved
Precision usually decreases
![Page 10: Quality of a search engine](https://reader036.vdocuments.us/reader036/viewer/2022062321/5681378a550346895d9f265e/html5/thumbnails/10.jpg)
Precision-Recall curve
We measures Precision at various levels of Recall Note: it is an AVERAGE over many queries
precision
recall
x
x
x
x
![Page 11: Quality of a search engine](https://reader036.vdocuments.us/reader036/viewer/2022062321/5681378a550346895d9f265e/html5/thumbnails/11.jpg)
A common picture
precision
recall
x
x
x
x
![Page 12: Quality of a search engine](https://reader036.vdocuments.us/reader036/viewer/2022062321/5681378a550346895d9f265e/html5/thumbnails/12.jpg)
F measure
Combined measure (weighted harmonic mean):
People usually use balanced F1 measure
i.e., with = ½ thus 1/F = ½ (1/P + 1/R)
Use this if you need to optimize a single measure
that balances precision and recall.
RP
F1)1(
11
![Page 13: Quality of a search engine](https://reader036.vdocuments.us/reader036/viewer/2022062321/5681378a550346895d9f265e/html5/thumbnails/13.jpg)
Recommendation systems
Paolo FerraginaDipartimento di Informatica
Università di Pisa
![Page 14: Quality of a search engine](https://reader036.vdocuments.us/reader036/viewer/2022062321/5681378a550346895d9f265e/html5/thumbnails/14.jpg)
Recommendations
We have a list of restaurants with and ratings for some
Which restaurant(s) should I recommend to Dave?
Brahma Bull Spaghetti House Mango Il Fornaio Zao Ming's Ramona's Straits Homma'sAlice Yes No Yes NoBob Yes No No
Cindy Yes No NoDave No No Yes Yes YesEstie No Yes Yes YesFred No No
![Page 15: Quality of a search engine](https://reader036.vdocuments.us/reader036/viewer/2022062321/5681378a550346895d9f265e/html5/thumbnails/15.jpg)
Basic Algorithm
Recommend the most popular restaurants say # positive votes minus # negative votes
What if Dave does not like Spaghetti?
Brahma Bull Spaghetti House Mango Il Fornaio Zao Ming's Ramona's Straits Homma'sAlice 1 -1 1 -1Bob 1 -1 -1
Cindy 1 -1 -1Dave -1 -1 1 1 1Estie -1 1 1 1Fred -1 -1
![Page 16: Quality of a search engine](https://reader036.vdocuments.us/reader036/viewer/2022062321/5681378a550346895d9f265e/html5/thumbnails/16.jpg)
Smart Algorithm
Basic idea: find the person “most similar” to Dave according to cosine-similarity (i.e. Estie), and then recommend something this person likes.
Perhaps recommend Straits Cafe to Dave
Brahma Bull Spaghetti House Mango Il Fornaio Zao Ming's Ramona's Straits Homma'sAlice 1 -1 1 -1Bob 1 -1 -1
Cindy 1 -1 -1Dave -1 -1 1 1 1Estie -1 1 1 1Fred -1 -1
Do you want to rely on one person’s opinions?
![Page 17: Quality of a search engine](https://reader036.vdocuments.us/reader036/viewer/2022062321/5681378a550346895d9f265e/html5/thumbnails/17.jpg)
Main idea
U
V
W
d1
d2
d5
d3
d4
d6
Y d7
What do we suggest to U ?
![Page 18: Quality of a search engine](https://reader036.vdocuments.us/reader036/viewer/2022062321/5681378a550346895d9f265e/html5/thumbnails/18.jpg)
A glimpse on XML retrieval(eXtensible Markup Language)
Paolo FerraginaDipartimento di Informatica
Università di Pisa
Reading 10
![Page 19: Quality of a search engine](https://reader036.vdocuments.us/reader036/viewer/2022062321/5681378a550346895d9f265e/html5/thumbnails/19.jpg)
XML vs HTML
HTML is a markup language for a specific purpose (display in browsers) XML is a framework for defining markup
languages
HTML has fixed markup tags, XML no
HTML can be formalized as an XML language (XHTML)
![Page 20: Quality of a search engine](https://reader036.vdocuments.us/reader036/viewer/2022062321/5681378a550346895d9f265e/html5/thumbnails/20.jpg)
XML Example (visual)
![Page 21: Quality of a search engine](https://reader036.vdocuments.us/reader036/viewer/2022062321/5681378a550346895d9f265e/html5/thumbnails/21.jpg)
XML Example (textual)
<chapter id="cmds"> <chaptitle> FileCab </chaptitle> <para>This chapter describes the
commands that manage the <tm>FileCab</tm>inet application.
</para> </chapter>
![Page 22: Quality of a search engine](https://reader036.vdocuments.us/reader036/viewer/2022062321/5681378a550346895d9f265e/html5/thumbnails/22.jpg)
Basic Structure
An XML doc is an ordered, labeled tree
character data: leaf nodes contain the actual data (text strings)
element nodes: each labeled with a name (often called the element type), and a set of attributes, each consisting of a
name and a value, can have child nodes
![Page 23: Quality of a search engine](https://reader036.vdocuments.us/reader036/viewer/2022062321/5681378a550346895d9f265e/html5/thumbnails/23.jpg)
XML: Design Goals
Separate syntax from semantics to provide a framework for structuring information
Allow tailor-made markup for any imaginable application domain
Support internationalization (Unicode) and platform independence
Be the standard of (semi)structured information (do some of the work now done by databases)
![Page 24: Quality of a search engine](https://reader036.vdocuments.us/reader036/viewer/2022062321/5681378a550346895d9f265e/html5/thumbnails/24.jpg)
Why Use XML?
Represent semi-structured
XML is more flexible than DBs
XML is more structured than simple IR
You get a massive infrastructure for free
![Page 25: Quality of a search engine](https://reader036.vdocuments.us/reader036/viewer/2022062321/5681378a550346895d9f265e/html5/thumbnails/25.jpg)
Data vs. Text-centric XML
Data-centric XML: used for messaging between enterprise applications Mainly a recasting of relational data
Text-centric XML: used for annotating content Rich in text Demands good integration of text retrieval
functionality E.g., find me the ISBN #s of Books with at least
three Chapters discussing cocoa production, ranked by Price
![Page 26: Quality of a search engine](https://reader036.vdocuments.us/reader036/viewer/2022062321/5681378a550346895d9f265e/html5/thumbnails/26.jpg)
IR Challenges in XML
There is no document unit in XML How do we compute tf and idf? Indexing granularity Need to go to document for retrieving or
displaying a fragment E.g., give me the Abstracts of Papers on
existentialism
Need to identify similar elements in different schemas Example: employee
![Page 27: Quality of a search engine](https://reader036.vdocuments.us/reader036/viewer/2022062321/5681378a550346895d9f265e/html5/thumbnails/27.jpg)
Xquery: SQL for XML ? Simple attribute/value
/play/title contains “hamlet”
Path queries title contains “hamlet” /play//title contains “hamlet”
Complex graphs Employees with two managers
What about relevance ranking?
![Page 28: Quality of a search engine](https://reader036.vdocuments.us/reader036/viewer/2022062321/5681378a550346895d9f265e/html5/thumbnails/28.jpg)
Data structures for XML retrieval
Inverted index: give me all elements matching text query Q We know how to do this – treat each
element as a document
Give me all elements below any instance of the Book element (Parent/child relationship is not enough)
![Page 29: Quality of a search engine](https://reader036.vdocuments.us/reader036/viewer/2022062321/5681378a550346895d9f265e/html5/thumbnails/29.jpg)
Positional containment
Doc:1
27 1122 2033 5790Play
431 867Verse
Term:droppeth720
droppeth under Verse under Play.
Containment can beviewed as mergingpostings.
![Page 30: Quality of a search engine](https://reader036.vdocuments.us/reader036/viewer/2022062321/5681378a550346895d9f265e/html5/thumbnails/30.jpg)
Summary of data structures
Path containment etc. can essentially be solved by positional inverted indexes
Retrieval consists of “merging” postings
All the compression tricks are still applicable
Complications arise from insertion/deletion of elements, text within elements Beyond the scope of this course
![Page 31: Quality of a search engine](https://reader036.vdocuments.us/reader036/viewer/2022062321/5681378a550346895d9f265e/html5/thumbnails/31.jpg)
Search Engines
Advertising
![Page 32: Quality of a search engine](https://reader036.vdocuments.us/reader036/viewer/2022062321/5681378a550346895d9f265e/html5/thumbnails/32.jpg)
Classic approach…
Socio-demo Geographic Contextual
![Page 33: Quality of a search engine](https://reader036.vdocuments.us/reader036/viewer/2022062321/5681378a550346895d9f265e/html5/thumbnails/33.jpg)
![Page 34: Quality of a search engine](https://reader036.vdocuments.us/reader036/viewer/2022062321/5681378a550346895d9f265e/html5/thumbnails/34.jpg)
![Page 35: Quality of a search engine](https://reader036.vdocuments.us/reader036/viewer/2022062321/5681378a550346895d9f265e/html5/thumbnails/35.jpg)
Search Engines vs Advertisement First generation -- use only on-page, web-text data
Word frequency and language
Second generation -- use off-page, web-graph data Link (or connectivity) analysis Anchor-text (How people refer to a page)
Third generation -- answer “the need behind the query” Focus on “user need”, rather than on query Integrate multiple data-sources Click-through data
Pure search vs Paid search
Ads show on search (who pays more), Goto/Overture
2003 Google/YahooNew model
All players now have:SE, Adv platform + network
![Page 36: Quality of a search engine](https://reader036.vdocuments.us/reader036/viewer/2022062321/5681378a550346895d9f265e/html5/thumbnails/36.jpg)
The new scenario
SEs make possible aggregation of interests unlimited selection (Amazon, Netflix,...)
Incentives for specialized niche players
The biggest money is in the smallest sales !!
![Page 37: Quality of a search engine](https://reader036.vdocuments.us/reader036/viewer/2022062321/5681378a550346895d9f265e/html5/thumbnails/37.jpg)
Two new approaches
Sponsored search: Ads driven by search keywords
(and user-profile issuing them)
AdWords
![Page 38: Quality of a search engine](https://reader036.vdocuments.us/reader036/viewer/2022062321/5681378a550346895d9f265e/html5/thumbnails/38.jpg)
-$
+$
![Page 39: Quality of a search engine](https://reader036.vdocuments.us/reader036/viewer/2022062321/5681378a550346895d9f265e/html5/thumbnails/39.jpg)
Two new approaches
Sponsored search: Ads driven by search keywords
(and user-profile issuing them)
Context match: Ads driven by the content of a web page
(and user-profile reaching that page)
AdWords
AdSense
![Page 40: Quality of a search engine](https://reader036.vdocuments.us/reader036/viewer/2022062321/5681378a550346895d9f265e/html5/thumbnails/40.jpg)
![Page 41: Quality of a search engine](https://reader036.vdocuments.us/reader036/viewer/2022062321/5681378a550346895d9f265e/html5/thumbnails/41.jpg)
How does it work ?
1) Match Ads to query or pg content2) Order the Ads3) Pricing on a click-through
IR
Econ
![Page 42: Quality of a search engine](https://reader036.vdocuments.us/reader036/viewer/2022062321/5681378a550346895d9f265e/html5/thumbnails/42.jpg)
Visited Pages
Clicked Banner
Web Searches
Clicks on Search Results
Web usage data !!!
![Page 43: Quality of a search engine](https://reader036.vdocuments.us/reader036/viewer/2022062321/5681378a550346895d9f265e/html5/thumbnails/43.jpg)
Dictionary problem
![Page 44: Quality of a search engine](https://reader036.vdocuments.us/reader036/viewer/2022062321/5681378a550346895d9f265e/html5/thumbnails/44.jpg)
A new game
For advertisers: What words to buy, how much to pay SPAM is an economic activity
For search engines owners: How to price the words Find the right Ad Keyword suggestion, geo-coding, business
control, language restriction, proper Ad display
Similar to web searching, but:Ad-DB is smaller, Ad-items are
small pages, ranking depends on clicks