1 random sampling from a search engines index ziv bar-yossef maxim gurevich department of electrical...
TRANSCRIPT
![Page 1: 1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion](https://reader036.vdocuments.us/reader036/viewer/2022062303/5514c800550346b0338b4c21/html5/thumbnails/1.jpg)
1
Random Sampling from a Search Engine‘s IndexZiv Bar-YossefMaxim Gurevich
Department of Electrical Engineering Technion
![Page 2: 1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion](https://reader036.vdocuments.us/reader036/viewer/2022062303/5514c800550346b0338b4c21/html5/thumbnails/2.jpg)
2
Search Engine Samplers
IndexPublicInterface
PublicInterface
Search Engine
Sampler
Web
D
Queries
Top k results
Random document x D
Indexed Documents
![Page 3: 1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion](https://reader036.vdocuments.us/reader036/viewer/2022062303/5514c800550346b0338b4c21/html5/thumbnails/3.jpg)
3
Motivation Useful tool for search engine evaluation:
Freshness Fraction of up-to-date pages in the index
Topical bias Identification of overrepresented/underrepresented topics
Spam Fraction of spam pages in the index
Security Fraction of pages in index infected by viruses/worms/trojans
Relative Size Number of documents indexed compared with other search
engines
![Page 4: 1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion](https://reader036.vdocuments.us/reader036/viewer/2022062303/5514c800550346b0338b4c21/html5/thumbnails/4.jpg)
4
Size Wars
August 2005
: We index 20 billion documents.
So, who’s right?
September 2005
: We index 8 billion documents, but our index is 3 times larger than our competition’s.
![Page 5: 1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion](https://reader036.vdocuments.us/reader036/viewer/2022062303/5514c800550346b0338b4c21/html5/thumbnails/5.jpg)
5
Related Work
Random Sampling from a Search Engine’s Index[BharatBroder98, CheneyPerry05, GulliSignorni05]
Anecdotal queries [SearchEngineWatch, Google, BradlowSchmittlein00]
Queries from user query logs [LawrenceGiles98, DobraFeinberg04]
Random sampling from the whole web [Henzinger et al 00, Bar-Yossef et al 00, Rusmevichientong et al 01]
![Page 6: 1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion](https://reader036.vdocuments.us/reader036/viewer/2022062303/5514c800550346b0338b4c21/html5/thumbnails/6.jpg)
6
Our Contributions
A pool-based sampler Guaranteed to produce near-uniform samples
A random walk sampler After sufficiently many steps, guaranteed to produce
near-uniform samples Does not need an explicit lexicon/pool at all!
Focus of this talk
![Page 7: 1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion](https://reader036.vdocuments.us/reader036/viewer/2022062303/5514c800550346b0338b4c21/html5/thumbnails/7.jpg)
7
Search Engines as Hypergraphs
results(q) = { documents returned on query q } queries(x) = { queries that return x as a result } P = query pool = a set of queries Query pool hypergraph:
Vertices: Indexed documents Hyperedges: { result(q) | q P }
www.cnn.com
www.foxnews.com
news.google.com
news.bbc.co.ukwww.google.com
maps.google.com
www.bbc.co.uk
www.mapquest.com
maps.yahoot.com
“news”
“bbc”
“google”
“maps”
en.wikipedia.org/wiki/BBC
![Page 8: 1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion](https://reader036.vdocuments.us/reader036/viewer/2022062303/5514c800550346b0338b4c21/html5/thumbnails/8.jpg)
8
Query Cardinalities and Document Degrees
Query cardinality: card(q) = |results(q)| Document degree: deg(x) = |queries(x)| Examples:
card(“news”) = 4, card(“bbc”) = 3 deg(www.cnn.com) = 1, deg(news.bbc.co.uk) = 2
www.cnn.com
www.foxnews.com
news.google.com
news.bbc.co.ukwww.google.com
maps.google.com
www.bbc.co.uk
www.mapquest.com
maps.yahoot.com
“news”
“bbc”
“google”
“maps”
en.wikipedia.org/wiki/BBC
![Page 9: 1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion](https://reader036.vdocuments.us/reader036/viewer/2022062303/5514c800550346b0338b4c21/html5/thumbnails/9.jpg)
9
The Pool-Based Sampler: Preprocessing Step
C
Large corpusPq1
……
Query Pool
Example: P = all 3-word phrases that occur in C If “to be or not to be” occurs in C, P contains:
“to be or”, “be or not”, “or not to”, “not to be”
Choose P that “covers” most documents in D
q2
![Page 10: 1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion](https://reader036.vdocuments.us/reader036/viewer/2022062303/5514c800550346b0338b4c21/html5/thumbnails/10.jpg)
10
Monte Carlo Simulation
We don’t know how to generate uniform samples from D directly
How can we use biased samples to generate uniform samples?
Samples with weights that represent their bias can be used to simulate uniform samples
Monte Carlo Simulation Methods
Rejection Sampling
Rejection Sampling
Importance Sampling
Importance Sampling
Metropolis-Hastings
Metropolis-Hastings
Maximum-Degree
Maximum-Degree
![Page 11: 1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion](https://reader036.vdocuments.us/reader036/viewer/2022062303/5514c800550346b0338b4c21/html5/thumbnails/11.jpg)
11
Document Degree Distribution
We are able to generate biased samples from the “document degree distribution”
Advantage: Can compute weights representing the bias of p:
![Page 12: 1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion](https://reader036.vdocuments.us/reader036/viewer/2022062303/5514c800550346b0338b4c21/html5/thumbnails/12.jpg)
12
Rejection Sampling [von Neumann]
accept := falsewhile (not accept)
generate a sample x from p toss a coin whose heads probability is wp(x) if coin comes up heads,
accept := true
return x
![Page 13: 1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion](https://reader036.vdocuments.us/reader036/viewer/2022062303/5514c800550346b0338b4c21/html5/thumbnails/13.jpg)
13
Pool-Based Sampler
Degree distribution sampler
Degree distribution sampler
Search EngineSearch Engine
Rejection Sampling
Rejection Sampling
q1,q2,…results(q1), results(q2),…
x
Pool-Based Sampler
(x(x11,1/deg(x,1/deg(x11)),)),
(x(x22,1/deg(x,1/deg(x22)),…)),…
Uniform sample
Documents sampled from degree distribution with corresponding weights
Degree distribution: p(x) = deg(x) / x’deg(x’)
![Page 14: 1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion](https://reader036.vdocuments.us/reader036/viewer/2022062303/5514c800550346b0338b4c21/html5/thumbnails/14.jpg)
14
Sampling documents by degree
Select a random q P Select a random x results(q) Documents with high degree are more likely to be sampled If we sample q uniformly “oversample” documents that
belong to narrow queries We need to sample q proportionally to its cardinality
www.cnn.com
www.foxnews.com
news.google.com
news.bbc.co.ukwww.google.com
maps.google.com
www.bbc.co.uk
www.mapquest.com
maps.yahoot.com
“news”
“bbc”
“google”
“maps”
en.wikipedia.org/wiki/BBC
![Page 15: 1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion](https://reader036.vdocuments.us/reader036/viewer/2022062303/5514c800550346b0338b4c21/html5/thumbnails/15.jpg)
15
Sampling queries by cardinality
Sampling queries from pool uniformly: Easy Sampling queries from pool by cardinality: Hard
Requires knowing cardinalities of all queries in the search engine
Use Monte Carlo methods to simulate biased sampling via uniform sampling: Sample queries uniformly from P Compute “cardinality weight” for each sample:
Obtain queries sampled by their cardinality
![Page 16: 1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion](https://reader036.vdocuments.us/reader036/viewer/2022062303/5514c800550346b0338b4c21/html5/thumbnails/16.jpg)
16
Dealing with Overflowing Queries
Problem: Some queries may overflow (card(q) > k) Bias towards highly ranked documents
Solutions: Select a pool P in which overflowing queries are rare
(e.g., phrase queries) Skip overflowing queries Adapt rejection sampling to deal with approximate weights
Theorem:
Samples of PB sampler are at most -away from uniform. ( = overflow probability of P)
![Page 17: 1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion](https://reader036.vdocuments.us/reader036/viewer/2022062303/5514c800550346b0338b4c21/html5/thumbnails/17.jpg)
17
Bias towards Long Documents
0%
10%
20%
30%
40%
50%
60%
1 2 3 4 5 6 7 8 9 10
Deciles of documents ordered by size
Perc
ent
of
docu
ments
fro
m s
am
ple
.
Pool Based
Random Walk
Bharat-Broder
![Page 18: 1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion](https://reader036.vdocuments.us/reader036/viewer/2022062303/5514c800550346b0338b4c21/html5/thumbnails/18.jpg)
18
Relative Sizes of Google, MSN and Yahoo!
Google = 1
Yahoo! = 1.28
MSN Search = 0.73
![Page 19: 1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion](https://reader036.vdocuments.us/reader036/viewer/2022062303/5514c800550346b0338b4c21/html5/thumbnails/19.jpg)
19
Conclusions
Two new search engine samplersPool-based samplerRandom walk sampler
Samplers are guaranteed to produce near-uniform samples, under plausible assumptions.
Samplers show no or little bias in experiments.
![Page 20: 1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion](https://reader036.vdocuments.us/reader036/viewer/2022062303/5514c800550346b0338b4c21/html5/thumbnails/20.jpg)
20
Thank You
![Page 21: 1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion](https://reader036.vdocuments.us/reader036/viewer/2022062303/5514c800550346b0338b4c21/html5/thumbnails/21.jpg)
21
Top-Level Domains in Google, MSN and Yahoo!
0%
10%
20%
30%
40%
50%
60%
Top level domain name
Pe
rce
nt
of
do
cu
me
nts
fro
m s
am
ple
.
GoogleMSNYahoo!
![Page 22: 1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion](https://reader036.vdocuments.us/reader036/viewer/2022062303/5514c800550346b0338b4c21/html5/thumbnails/22.jpg)
22
Query Cardinality Distribution
Unrealistic assumptions: Can sample queries from the cardinality distribution
In practice, don’t know a priori card(q) for all q P q P, 1 ≤ card(q) ≤ k
In practice, some queries underflow (card(q) = 0) oroverflow (card(q) > k)
results(q) = { documents returned on query q } card(q) = |results(q)| Cardinality distribution:
![Page 23: 1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion](https://reader036.vdocuments.us/reader036/viewer/2022062303/5514c800550346b0338b4c21/html5/thumbnails/23.jpg)
23
Degree Distribution Sampler
Search EngineSearch Engine
results(q)
xCardinality Distribution Sampler
Cardinality Distribution Sampler
Sample x uniformly from results(q)
Sample x uniformly from results(q)
q
Degree Distribution Sampler
Query sampled from cardinality
distribution
Document sampled from
degree distribution
![Page 24: 1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion](https://reader036.vdocuments.us/reader036/viewer/2022062303/5514c800550346b0338b4c21/html5/thumbnails/24.jpg)
24
Cardinality Distribution Sampler
Search EngineSearch Engine
q
Cardinality Distribution Sampler
Uniform Query Sampler
Uniform Query Sampler
Rejection Sampling
Rejection Sampling
q1,q2,…card(q1), card(q2),…
(q1,card(q1)/k),(q2,card(q2)/k),…
Sample from cardinality distribution
Uniform samples from P
![Page 25: 1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion](https://reader036.vdocuments.us/reader036/viewer/2022062303/5514c800550346b0338b4c21/html5/thumbnails/25.jpg)
25
Degree Distribution Sampler
Degree Distribution Sampler
Complete Pool-Based Sampler
Search EngineSearch Engine
Rejection Sampling
Rejection Sampling
x(x,1/deg(x)),…
Uniform document
sample
Documents sampled from degree distribution with corresponding weights
Uniform Query Sampler
Uniform Query Sampler
Rejection Sampling
Rejection Sampling
(q,card(q)),…
Uniform query
sampleQuery
sampled from cardinality distribution
(q,results(q)),…
![Page 26: 1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion](https://reader036.vdocuments.us/reader036/viewer/2022062303/5514c800550346b0338b4c21/html5/thumbnails/26.jpg)
26
A random walk sampler Define a graph G over the indexed documents
(x,y) E iff results(x) ∩ results(y) ≠
Run a random walk on G Limit distribution = degree distribution Use MCMC methods to make limit distribution uniform.
Metropolis-Hastings Maximum-Degree
Does not need a preprocessing step Less efficient than the pool-based sampler