Download - Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03
Motivation & Query ModelAvoid duplicating work and data movement by caching previous query resultsObservation:
ai && aj && ak = (ai && aj) && ak
So We can keep the result for (ai &&aj ) as materialized view.Query Model: (ai && aj && ak) || (bi && bj && bk ) ||…
View and View Treea view is the cached result for a previous query.Where to store the views: Using the underlying P2P system mechanism
for example, in Chord, the view “a && b” is stored at the successor of Hash(“a&&b”)
But it can’t be used to efficiently answer view queries.
Why? For “a1 && a2 &&..&& ak “, there are 2k possible views. And you don’t know which one exists .
View and View Tree (Cont.)
Another possible way: Centralized, consistent list of views Easy to locate Problems:
Frequent updates Storage requirements
Proposed Solution: View Tree Implemented as a trie,
A tree for storing strings in which there is one node for every common prefix. The strings are stored in extra leaf nodes.
scalable, stateless
View Tree
All nodes are at level 1. (single-attribute views)
A canonical order on the attributes is defined and used to uniquely identify equivalent conjunctive queries.
“a && b” and “b && a” are both recorded as “a && b” (alphabetical order)
Answering Queries Finding a smallest set of views to evaluate a query is NP-hardInstead, the following method is used: Exact match, if such a match exists Forward progress:
For each node accessed, at least one attribute must be located which does not occur in the views located so far.
Example on answering queries
Query: “cbagekhilo”Step 1: match prefix “cbag”Step 2: “cbage” is not found,
but “cbagh” exists and is useful for the query (forward progress)…..
Creating a balanced View Tree
For a query “a && b && c && .. && x”, there exist a lot of equivalent views , each corresponds to a position at View Tree. And any of them can be used to represent the result of the search query.Which one to choose and how to make a balanced Tree? Deterministically pick a permutation P
uniformly at random among all possible permutations.
Maintaining the View TreeThe owner of a new node needs to update all attribute indexed corresponding to attributes of the new node. (Q: Cross the whole network? All related views need to be updated? Isn’t it too expensive?)Heartbeat is used to check the presence of child nodes and parent node in the view tree.Insertion of new view is less expensive: some child pointers need to be reassigned.
Preliminary Results: Data source and methodology: Document: TREC-Web data set HTML pages with keyword meta-tag 64K different pages for each
experiment Queries: generated using the statistical
characteristics from search.com
Motivation:
Is P2P web search likely to work?two keyword-search techniques: Flooding (Gnutella): not discussed in this
paper Intersection of index lists
This paper presents a feasibility analysis based on the resource constraints and workload.
IntroductionWhy interested in P2P searching:
A good stress test for P2P architectures More resistant to censoring and manipulated ranking More robust from single node failure
550 billion documents on the web Google indexes more than 2 billion of them
Gnutella and KaZaA: flooding search 500 Million files Typically music files Search: file meta-data such as titles and artist
DHT-based keyword searching: Good full-text search performance with about
100,000 files (Duke Univ.)
Fundamental Constraints of the real world
We assumed the following parametersWeb documents: 3 billionWords per file: 1000An inverted index would have: 3*109*1000 unique docIDs.docID: 20 bytes (hash of the file content)Inverted index size: 6*1013 bytesQueries per second: 1000 (google)
Fundamental constraints (cont.)
Storage Constraints 1 GB for each PC, at least 60000 PCs.
Communication Constraints: Assume web search consume 10% (after
comparison with the traffic for DNS) 1999, Internet backbone of US: 100Gbits 1000 queries/sec, 10Mbits can be used
for each query
Basic Cost analysisAssumption: DHT-based P2P systems Two-term query (each search has 2 keyword)
In MIT, 1.7 Million Web pages, 81000 queries, 300,000 bytes are moved for each query.Scale to Internet (3 billion page), it might require 530 Mbytes for each query(Q: Is that true?)
Optimizations:All the optimizations used the 81000 queries from mit.eduCaching and Precomputation Caching received posting lists: reduce
communication cost by 38% Computing and storing the intersection
of different posting lists in advance: 3% of all possible term pairs is precomputed, the communication cost is reduced by 50%(Zipf distribution, most popular words)
CompressionBloom Filters: Two-round Bloom intersection (one node sends
the bloom filter of its posting list to another node, which returns the result): compression ratio 13
4-round Bloom intersection: Compression ratio 40 Compressed Bloom filters: 30% improvement
Gap Compression: Effective when the gaps between sorted
docIDs are small, ? So less bits are required for docID ?
Compression (cont.)Adaptive Set Intersection: Exploit the structure in the posting lists to
avoid transferring entire lists Example: {1, 3, 4, 7}&&{8, 10, 20, 30}
requires one element exchange since 7<8
Clustering Similar documents are grouped together
based on their term occurrences and assigned adjacent docIDs, which improves the compression ration of adaptive set intersecton with gap compression to 75
Two compromises And Conclusion
Compromising Result QualityCompromising P2P StructureConclusion: Naïve implementations of P2P Web search are not
feasible. The most effective optimizations bring the
problem to within an order of magnitude of feasibility.
Two possible compromises are proposed All of them combined together will bring us within
feasibility range for P2P Web search. (Q: Sure?)