efficient p2p searches using result-caching from u. of maryland. presented by lintao liu 2/24/03

Efficient P2P Searches Using Result-Caching

From U. of Maryland. Presented by Lintao Liu

2/24/03

Motivation & Query ModelAvoid duplicating work and data movement by caching previous query resultsObservation:

ai && aj && ak = (ai && aj) && ak

So We can keep the result for (ai &&aj ) as materialized view.Query Model: (ai && aj && ak) || (bi && bj && bk ) ||…

View and View Treea view is the cached result for a previous query.Where to store the views: Using the underlying P2P system mechanism

for example, in Chord, the view “a && b” is stored at the successor of Hash(“a&&b”)

But it can’t be used to efficiently answer view queries.

Why? For “a1 && a2 &&..&& ak “, there are 2k possible views. And you don’t know which one exists .

View and View Tree (Cont.)

Another possible way: Centralized, consistent list of views Easy to locate Problems:

Frequent updates Storage requirements

Proposed Solution: View Tree Implemented as a trie,

A tree for storing strings in which there is one node for every common prefix. The strings are stored in extra leaf nodes.

scalable, stateless

View Tree

All nodes are at level 1. (single-attribute views)

A canonical order on the attributes is defined and used to uniquely identify equivalent conjunctive queries.

“a && b” and “b && a” are both recorded as “a && b” (alphabetical order)

Answering Queries Finding a smallest set of views to evaluate a query is NP-hardInstead, the following method is used: Exact match, if such a match exists Forward progress:

For each node accessed, at least one attribute must be located which does not occur in the views located so far.

Example on answering queries

Query: “cbagekhilo”Step 1: match prefix “cbag”Step 2: “cbage” is not found,

but “cbagh” exists and is useful for the query (forward progress)…..

Algorithm: Search(n, q)

Creating a balanced View Tree

For a query “a && b && c && .. && x”, there exist a lot of equivalent views , each corresponds to a position at View Tree. And any of them can be used to represent the result of the search query.Which one to choose and how to make a balanced Tree? Deterministically pick a permutation P

uniformly at random among all possible permutations.

Maintaining the View TreeThe owner of a new node needs to update all attribute indexed corresponding to attributes of the new node. (Q: Cross the whole network? All related views need to be updated? Isn’t it too expensive?)Heartbeat is used to check the presence of child nodes and parent node in the view tree.Insertion of new view is less expensive: some child pointers need to be reassigned.

Example of a new view join

Preliminary Results: Data source and methodology: Document: TREC-Web data set HTML pages with keyword meta-tag 64K different pages for each

experiment Queries: generated using the statistical

characteristics from search.com

Preliminary results:

Caching Benefit Query locality Benefit

On the Feasibility of P2P Web Indexing and Search

From UC Berkeley & MIT

Motivation:

Is P2P web search likely to work?two keyword-search techniques: Flooding (Gnutella): not discussed in this

paper Intersection of index lists

This paper presents a feasibility analysis based on the resource constraints and workload.

IntroductionWhy interested in P2P searching:

A good stress test for P2P architectures More resistant to censoring and manipulated ranking More robust from single node failure

550 billion documents on the web Google indexes more than 2 billion of them

Gnutella and KaZaA: flooding search 500 Million files Typically music files Search: file meta-data such as titles and artist

DHT-based keyword searching: Good full-text search performance with about

100,000 files (Duke Univ.)

Fundamental Constraints of the real world

We assumed the following parametersWeb documents: 3 billionWords per file: 1000An inverted index would have: 3*109*1000 unique docIDs.docID: 20 bytes (hash of the file content)Inverted index size: 6*1013 bytesQueries per second: 1000 (google)

Fundamental constraints (cont.)

Storage Constraints 1 GB for each PC, at least 60000 PCs.

Communication Constraints: Assume web search consume 10% (after

comparison with the traffic for DNS) 1999, Internet backbone of US: 100Gbits 1000 queries/sec, 10Mbits can be used

for each query

Basic Cost analysisAssumption: DHT-based P2P systems Two-term query (each search has 2 keyword)

In MIT, 1.7 Million Web pages, 81000 queries, 300,000 bytes are moved for each query.Scale to Internet (3 billion page), it might require 530 Mbytes for each query(Q: Is that true?)

Optimizations:All the optimizations used the 81000 queries from mit.eduCaching and Precomputation Caching received posting lists: reduce

communication cost by 38% Computing and storing the intersection

of different posting lists in advance: 3% of all possible term pairs is precomputed, the communication cost is reduced by 50%(Zipf distribution, most popular words)

CompressionBloom Filters: Two-round Bloom intersection (one node sends

the bloom filter of its posting list to another node, which returns the result): compression ratio 13

4-round Bloom intersection: Compression ratio 40 Compressed Bloom filters: 30% improvement

Gap Compression: Effective when the gaps between sorted

docIDs are small, ? So less bits are required for docID ?

Compression (cont.)Adaptive Set Intersection: Exploit the structure in the posting lists to

avoid transferring entire lists Example: {1, 3, 4, 7}&&{8, 10, 20, 30}

requires one element exchange since 7<8

Clustering Similar documents are grouped together

based on their term occurrences and assigned adjacent docIDs, which improves the compression ration of adaptive set intersecton with gap compression to 75

Optimization Tech. And Improvements

Still one order of magnitude higher than the budget

Two compromises And Conclusion

Compromising Result QualityCompromising P2P StructureConclusion: Naïve implementations of P2P Web search are not

feasible. The most effective optimizations bring the

problem to within an order of magnitude of feasibility.

Two possible compromises are proposed All of them combined together will bring us within

feasibility range for P2P Web search. (Q: Sure?)

efficient p2p searches using result-caching from u. of maryland. presented by lintao liu 2/24/03

Documents

view treea view

view queries

view treeimplemented

materialized view

view tree cont

insertion of new view

balanced view treefor

search query