cs 347notes101 cs 347 parallel and distributed data processing distributed information retrieval...

37
CS 347 Notes10 1 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi

Upload: ira-stone

Post on 05-Jan-2016

220 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi

CS 347 Notes10 1

CS 347 Parallel and Distributed

Data Processing

DistributedInformation Retrieval

Hector Garcia-MolinaZoltan Gyongyi

Page 2: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi

CS 347 Notes10 2

Web Search Engine

• Crawling• Indexing• Computing ranking features• Serving queries

Page 3: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi

CS 347 Notes10 3

Crawling• Fetch content of web pages

web

init

get next URL

get page

extract URLs

seed URLs

URLs to visit

visited URLs

web pages

Page 4: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi

CS 347 Notes10 4

Issues

• Scope and freshness– Not enough space/time to crawl “all”

pages– Page importance, quality, and update

frequency– Site mirrors and (near) duplicate pages– Dynamic content and crawler traps

• Load at visited web sites– Rules in robots.txt– Limit number of visits per day– Limit depth of crawl

Page 5: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi

CS 347 Notes10 5

Issues

• Load at crawler– Variance of fetch latency/bandwidth– Parallelization and scalability

Multiple agents Partitioning URL lists Communication between agents Recovering from agent failure

Page 6: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi

CS 347 Notes10 6

Crawl Partitioning

• Requirements– Each URL assigned to a single agent– Locally computable URL-to-agent

mapping– Balanced distribution of URLs across

agents– Contravariance

Page 7: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi

CS 347 Notes10 7

Contravariance

Agent A

url1url3url5

Agent B

url2url4url6

Agent A

url1url2

Agent B

url3

url4

Agent C

url5url6

Page 8: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi

CS 347 Notes10 8

Contravariance

Agent A

url1url3url5

Agent B

url2url4url6

Agent A

url1url2

Agent B

url3

url4

Agent C

url5url6

Agent A

url1url3

Agent B

url2url4

Agent C

url5url6

Page 9: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi

CS 347 Notes10 9

Assignment

• Consistent hashing– Hash function: URL agent– Each agent “replicated” k times– Each replica mapped randomly on unit

circle Mapping persistent across agent restarts

– Lookup: map URL on unit circle; find closest live replica

Page 10: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi

CS 347 Notes10 10

Assignment

A

AB

B

url6

Page 11: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi

CS 347 Notes10 11

Assignment

A

AB

B

url6 A

AB

B

url6

C

C

• Balancing • Contravariance

Page 12: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi

CS 347 Notes10 12

Crawl Partitioning

• Ideas– URL normalization

E.g., relative to absolute URL

– Host-based partitioning Reduces communication between agents Small vs. large hosts

– Geographic distribution

Page 13: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi

CS 347 Notes10 13

Fault Tolerance

• Repartitioning • Permanent failure

– Recovering list of URLs to visit Checkpoints Communication logs

• Transient failure– Avoiding re-visiting URLs

Before fetch, check with near neighbor agents

Page 14: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi

CS 347 Notes10 14

Indexing

• Build term-document index

● ● ●

● ●

● ●

● ●

d1 d2 d3 d4 d5 d6 dn

t1

t2

t3

t4

t5

●t6

● ●tm

Posting for t1

Lexicon

Collection

Page 15: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi

CS 347 Notes10 15

Architecture

Web pages

Distributors Indexers Query servers

Intermediate runs

Inverted index files

Reduce

Map

Page 16: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi

CS 347 Notes10 16

Issues

• Index partitioning– Efficient query processing

Query routing Result retrieval

Page 17: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi

CS 347 Notes10 17

Document Partitioningd1 d2 d3 d4 d5 d6 dn

t1

t2

t3

t4

t5

t6

tm

d1 d2 d3

t1

t2

t3

t4

t5

t6

tm

d4 d5 d6

t1

t2

t3

t4

t5

t6

tm

dn-2 dn-1 dn

t1

t2

t3

t4

t5

t6

tm

Page 18: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi

CS 347 Notes10 18

Document Partitioning

• Split the collection of documents• Advantages

– Easy to add new documents– Load balanced– High processing throughput

• Disadvantages– Communication with all query servers

Page 19: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi

CS 347 Notes10 19

Term Partitioning

d1 d2 d3 d4 d5 d6 dn

t1

t2

t3

t4

t5

t6

tm

d1 d2 d3 d4 d5 d6 dn

t1

t2

t3

d1 d2 d3 d4 d5 d6 dn

t4

t5

t6

d1 d2 d3 d4 d5 d6 dn

tm-2

tm-1

tm

Page 20: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi

CS 347 Notes10 20

Term Partitioning

• Split the lexicon• Advantages

– Reduced communication with query servers

• Disadvantages– More processing before partitioning– Adding new documents is hard– Load balancing is hard– Processing throughput limited by query

length

Page 21: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi

CS 347 Notes10 21

Advanced Partitioning

• Topical partitioning using clustering– Documents clustered by term-similarity– Partitions made up of one or more

clusters

• Usage-induced partitioning– Queries extracted from logs– Documents clustered by query-similarity– Partitions made up of one or more

clusters

Page 22: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi

CS 347 Notes10 22

Ranking Feature Computation

• Parallel/distributed computation tasks– Text/language processing– Document classification/clustering– Web graph analysis

Page 23: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi

CS 347 Notes10 23

Example: PageRank

• Link-based global (query-independent) importance metric

• Random surfer model– Start at a random page– With probability d, navigate to new page

by following a random link on current page

– With probability (1 – d), restart at a random pagePageRank score = expected

fraction of time spent at a page

Page 24: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi

CS 347 Notes10 24

Formula

p(x) = d ∙ Σ p(y) / out(y) + (1 – d) / n yx

Page 25: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi

CS 347 Notes10 25

Formula

p(x) = d ∙ Σ p(y) / out(y) + (1 – d) / n

PageRank of page x

yx

PageRank of y, where y links to x

Out-degree of page y

Probability of random restart at x

Page 26: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi

CS 347 Notes10 26

Algorithm

i = 0p[i](x) = (1 – d) / nrepeat

i += 1p[i](x) = (1 – d) / nfor all yx

p[i](x) += d ∙ p[i–1](y) / out(y)until | p[i] – p[i–1] | < ε

Page 27: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi

CS 347 Notes10 27

Implementation

• Two vectors, current and next• Initialize vectors• Iterate over all pages y, distribute

PageRank from current(y) to next(x) for all links yx

• current = next, re-initialize next• Go back to iteration over pages or

stop

Page 28: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi

CS 347 Notes10 28

Distribution

• MapReduce for each iteration i• Map

– Take <y, (current(y), edges(y))>– For each yx in edges(y)

emit <x, current(y) / | edges(y) |>– Also emit <y, edges(y)>

• Reduce– Take <x, val> and <x, edges(x)>– Sum (d ∙ val) into next(x), add (1 – d) / n– Emit <x, (next(x), edges(x))>

Page 29: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi

CS 347 Notes10 29

Distribution

<y, (current(y), edges(y))>

Map

<x, val>

<x, val>

Reduce

<x, (next(x), edges(x))>

Page 30: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi

CS 347 Notes10 30

Query Processing

• Locate, retrieve, process, and serve query results

Inverted index files

Query coordinat

orQuery

servers

Cache

Query

Results

Page 31: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi

CS 347 Notes10 31

Architecture

• Multiple sites connected by WAN– Site = coordinator + servers + cache

• Partitioning– Parallel processing– Distributed storage of data– E.g., index partitioning

• Replication– Availability– Throughput– Response time

Page 32: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi

CS 347 Notes10 32

Issues

• Routing the query– To sites

E.g., identical sites + routing by dynamic DNS lookup

– Within sites

• Merging the results• Caching

Page 33: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi

CS 347 Notes10 33

Issues

Routing Merging

Document partition

All servers

Results selected by servers; ranking by coordinator

Term partition

Servers containing query terms

Selection and ranking by coordinator

Page 34: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi

CS 347 Notes10 34

Caching

• What to cache?– Query answers– Term postings

Page 35: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi

CS 347 Notes10 35

Caching

• What to cache?– Query answers

Faster response

– Term postings More hits

Query terms repeated more frequently than whole queries

Page 36: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi

CS 347 Notes10 36

Caching Policy

• Terms most frequent in queries high hit ratio

• Terms most frequent in documents require more cache space

(longer postings)• Use static caching based on

query/document frequency ratio

Page 37: CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi

CS 347 Notes10 37

Summary

• Crawling– Partitioning: balancing and contravariance– Consistent hashing

• Indexing– Document, term, topical, and usage-induced

partitioning

• Computing ranking features– PageRank with MapReduce

• Serving queries– Routing queries, merging results, and caching

postings