delis highlights: efficient and intelligent top-k search in peer-to-peer systems presented by...
Post on 18-Dec-2015
218 views
TRANSCRIPT
DELIS Highlights:
Efficient and Intelligent Top-k Search
in Peer-to-Peer Systems
presented by Gerhard Weikum (Max-Planck Institute of Computer Science)
Gerhard Weikum (MPII)
Data Management on Dynamic P2PSubproject 6
Introduction
• Proof of Concept for Scalable & Self-Organizing Data Structures and Algorithms (e.g., DHTs, Randomized Overlay Networks, Epidemic Spreading)
• Powerful Search Methods for Each Peer(Concept-based Search, Query Expansion, Personalization, etc.)
• Leverage Intellectual Input at Each Peer (Bookmarks, Feedback, Query Logs, Click Streams, Evolving Web, etc.)
• Collaboration among Peers(Query Routing, Incentives, Fairness, Anonymity, etc.)
• Better Search Result Quality (Precision, Recall, etc.)
• Breaking Information Monopolies
• Testbed for CS Models, Algorithms, Technologies and Experimental Platform
Why Peer-to-Peer Web Search?Vision: Self-organizing P2P Web Search Engine with Google-or-better functionality
Gerhard Weikum (MPII)
Data Management on Dynamic P2PSubproject 6
Introduction
What Google Can‘t Do
Killer queries (disregarding NLP QA, multilingual, multimedia):drama with three women making a prophecy to a British nobleman that he will become king
Gerhard Weikum (MPII)
Data Management on Dynamic P2PSubproject 6
Introduction
VisionDemo
• Efficient Top-k Search
•
• Ontology-based Query Expansion
Outline
• Exploiting User Behavior • Isolating Selfish Peers
Gerhard Weikum (MPII)
Data Management on Dynamic P2PSubproject 6
Introduction
VisionDemo
• Efficient Top-k Search
• Ontology-based Query Expansion
Outline
• Exploiting User Behavior • Isolating Selfish Peers
Gerhard Weikum (MPII)
Data Management on Dynamic P2PSubproject 6
Efficient Top-k Search
Index listsIndex lists
s(t1,d1) = 0.7…s(tm,d1) = 0.2
s(t1,d1) = 0.7…s(tm,d1) = 0.2
…
Data items: d1, …, dn
Query: q = (t1, t2, t3)
Rank Doc Worst-score
Best-score
1 d78 0.9 2.4
2 d64 0.8 2.4
3 d10 0.7 2.4
Rank Doc Worst-score
Best-score
1 d78 1.4 2.0
2 d23 1.4 1.9
3 d64 0.8 2.1
4 d10 0.7 2.1
Rank Doc Worst-score
Best-score
1 d10 2.1 2.1
2 d78 1.4 2.0
3 d23 1.4 1.8
4 d64 1.2 2.0
…
…
t1d780.9
d10.7
d880.2
d100.2
d780.1
d990.2
d340.1
d230.8
d100.8
d1d1
t2d640.8
d230.6
d100.6
t3d100.7
d780.5
d640.4
STOP!STOP!
TA with sorted access only (NRA) (Fagin 01, Güntzer/Kießling/Balke 01):can index lists; consider d at posi in Li;E(d) := E(d) {i}; highi := s(ti,d);worstscore(d) := aggr{s(t,d) | E(d)};bestscore(d) := aggr{worstscore(d), aggr{high | E(d)}};if worstscore(d) > min-k then add d to top-k min-k := min{worstscore(d’) | d’ top-k};else if bestscore(d) > min-k then cand := cand {d}; sthreshold := max {bestscore(d’) | d’ cand};if threshold min-k then exit;
TA: efficient & principled top-k query processingwith monotonic score aggr.
Scan depth 1Scan
depth 1Scan
depth 2Scan
depth 2Scan
depth 3Scan
depth 3
k = 1
Efficient Top-k Search
Ex. Google:> 10 mio. terms> 8 bio. docs> 4 TB index
Gerhard Weikum (MPII)
Data Management on Dynamic P2PSubproject 6
Probabilistic Pruning
Probabilistic Pruning of Top-k Candidates
scan depth
drop dfrom priority queue
• Approximate top-k with probabilistic guarantees:
bestscore(d)
worstscore(d)
min-k
score
?? Add d to top-k result, if
worstscore(d) > min-k Drop d only if
bestscore(d) < min-k, otherwise keep in PQ
TA family of algorithms based on invariant (with sum as aggr)i i i
i E( d ) i E( d ) i E( d )s ( d ) s( d ) s ( d ) high
worstscore(d) bestscore(d)
i ii E( d ) i E( d )
p( d ) : P [ s ( d ) S ]
Often overly conservative (deep scans, high memory for PQ)
discard candidates d from queue if p(d)
score predictor can useLSTs & Chernoff bounds,Poisson approximations,or histogram convolution
E[rel. precision@k] = 1
Gerhard Weikum (MPII)
Data Management on Dynamic P2PSubproject 6
Experiments with TREC-12 Web Track
Experiments with TREC-12 Web-Track Benchmark
TA-sorted Prob-sorted (smart)#sorted accesses 2,263,652 527,980elapsed time [s] 148.7 15.9max queue size 10849 400relative precision 1 0.87rank distance 0 39.5score error 0 0.031
on .GOV corpus from TREC-12 Web track:1.25 Mio. docs (html, pdf, etc.)
50 keyword queries, e.g.: • „Lewis Clark expedition“, • „juvenile delinquency“, • „legalization Marihuana“, • „air bag safety reducing injuries death facts“
speedup by factor 10at high precision/recall(relative to TA-sorted);
aggressive queue mgt.even yields factor 100at 30-50 % prec./recall
Gerhard Weikum (MPII)
Data Management on Dynamic P2PSubproject 6
Introduction
VisionDemo
Efficient Top-k Search
• Ontology-based Query Expansion
Outline
• Exploiting User Behavior • Isolating Selfish Peers
Gerhard Weikum (MPII)
Data Management on Dynamic P2PSubproject 6
Query Expansion
Threshold-based query expansion:substitute ~w by (c1 | ... | ck) with all ci for which sim(w, ci)
„Old hat“ in IR; highly disputed for danger of topic dilution
Approach to careful expansion:• determine phrases from query or best initial query results (e.g., forming 3-grams and looking up ontology/thesaurus entries)• if uniquely mapped to one concept then expand with synonyms and weighted hyponyms• alternatively use statistical learning methods for word sense disambiguation
Problem: choice of threshold
Query Expansion
Gerhard Weikum (MPII)
Data Management on Dynamic P2PSubproject 6
Query Expansion Example
Title: International Organized Crime
Description: Identify organizations that participate in international criminal activity, the activity, and, if possible, collaborating organizations and the countries involved.
From TREC 2004 Robust Track:
135530 sorted accesses in 11.073s.
Results:1. Interpol Chief on Fight Against Narcotics 2. Economic Counterintelligence Tasks Viewed 3. Dresden Conference Views Growth of Organized Crime in Europe 4. Report on Drug, Weapons Seizures in Southwest Border Region 5. SWITZERLAND CALLED SOFT ON CRIME ...
A parliamentary commission accused Swiss prosecutors today of doing little to stop drug and money-laundering international networks from pumping billions of dollars through Swiss companies. ...
Query = {international[0.145|1.00], ~META[1.00|1.00][{gangdom[1.00|1.00], gangland[0.742|1.00], "organ[0.213|1.00] & crime[0.312|1.00]", camorra[0.254|1.00], maffia[0.318|1.00], mafia[0.154|1.00], "sicilian[0.201|1.00] & mafia[0.154|1.00]", "black[0.066|1.00] & hand[0.053|1.00]", mob[0.123|1.00], syndicate[0.093|1.00]}], organ[0.213|1.00], crime[0.312|1.00], collabor[0.415|0.20], columbian[0.686|0.20], cartel[0.466|0.20], ...}}
Let us take, for example, the case of Medellin cartel's boss Pablo Escobar. Will the fact that he was eliminated change anything at all? No, it may perhaps have a psychological effect on other drug dealers but, ...
... for organizing the illicit export of metals and import of arms. It is extremely difficult for the law-enforcement organs to investigate and stamp out corruption among leading officials. ...
Query Expansion Example
Gerhard Weikum (MPII)
Data Management on Dynamic P2PSubproject 6
Top-k with Query Expansion
responsetime: 0.737: 0.944: 0.8
...
22: 0.723: 0.651: 0.652: 0.6
throughput: 0.6
92: 0.967: 0.9
...52: 0.944: 0.855: 0.8
algorithm
B+ tree index on terms
57: 0.644: 0.4
...
performance
52: 0.433: 0.375: 0.3
12: 0.914: 0.8
...
28: 0.617: 0.5561: 0.544: 0.5
44: 0.4
thesaurus / meta-index
iq {max jonto(i) { sim(i,j)*sj(d)) }}
performance
response time: 0.7throughput: 0.6queueing: 0.3delay: 0.25...
consider expandable query „algorithm and ~performance“ with score
dynamic query expansion with incremental on-demand merging of additional index lists
+ much more efficient than threshold-based expansion+ no threshold tuning+ no topic drift
Top-k Query Processing with Query Expansion
Gerhard Weikum (MPII)
Data Management on Dynamic P2PSubproject 6
Experiments with TREC-13 Robust Track
on Acquaint corpus (news articles):528 000 docs, 2 GB raw data, 8 GB for all indexes
no exp. static exp. static exp. incr. merge (=0.1) (=0.3, (=0.3, (=0.1)
=0.0) =0.1)
#sorted acc. 1,333,756 10,586,175 3,622,686 5,671,493#random acc. 0 555,176 49,783 34,895elapsed time [s] 9.3 156.6 79.6 43.8max #terms 4 59 59 59relative prec. 0.934 1.0 0.541 0.786precision@10 0.248 0.286 0.238 0.298MAP 0.091 0.111 0.086 0.110
with Okapi BM25 probabilistic scoring model
50 most difficult queries, e.g.: „transportation tunnel disasters“ „Hubble telescope achievements“potentially expanded into: „earthquake, flood, wind, seismology, accident, car, auto, train, ...“ „astronomical, electromagnetic radiation, cosmic source, nebulae, ...“
Experiments with TREC-13 Robust-Track Benchmarkspeedup by factor 4at high precision/recall;no topic drift, no need for threshold tuning;also handles TREC-13Terabyte benchmark
Gerhard Weikum (MPII)
Data Management on Dynamic P2PSubproject 6
Introduction
VisionDemo
Efficient Top-k Search
Ontology-based Query Expansion
Outline
• Exploiting User Behavior • Isolating Selfish Peers
Gerhard Weikum (MPII)
Data Management on Dynamic P2PSubproject 6
Exploiting User Behavior
Exploiting Query Logs and Click Streams
from PageRank: uniformly random choice of links + random jumps
PR( q ) j( q ) ( 1 )
p IN ( q )PR( p ) t( p,q )
Authority (page q) = stationary prob. of visiting q
Gerhard Weikum (MPII)
Data Management on Dynamic P2PSubproject 6
Exploiting User Behavior
from PageRank: uniformly random choice of links + random jumpsto QRank: + query-doc transitions + query-query transitions + doc-doc transitions on implicit links (w/ thesaurus)with probabilities estimated from log statistics
a ba
xyz
PR( q ) j( q ) ( 1 )
p IN ( q )PR( p ) t( p,q )
QR( q ) j( q ) ( 1 )
p exp licitIN ( q )PR( p ) t( p,q )
p implicitIN ( q )
( 1 ) PR( p ) sim( p,q )
Exploiting Query Logs and Click Streams
Gerhard Weikum (MPII)
Data Management on Dynamic P2PSubproject 6
Exploiting User Behavior
Setup:70 000 Wikipedia docs, 18 volunteers posing Trivial-Pursuit queriesca. 500 queries, ca. 300 refinements, ca. 1000 positive clicksca. 15 000 implicit links based on doc-doc similarity
Results (assessment by blind-test users):• QRank top-10 result preferred over PageRank in 81% of all cases• QRank has 50.3% precision@10, PageRank has 33.9%
Untrained example query „philosophy“:
PageRank QRank x 1. Philosophy Philosophy2. GNU free doc. license GNU free doc. license3. Free software foundation Early modern philosophy4. Richard Stallman Mysticism5. Debian Aristotle
Preliminary Experiments
Gerhard Weikum (MPII)
Data Management on Dynamic P2PSubproject 6
Introduction
VisionDemo
Efficient Top-k Search
Ontology-based Query Expansion
Outline
Exploiting User Behavior • Isolating Selfish Peers
Gerhard Weikum (MPII)
Data Management on Dynamic P2PSubproject 6
Self-Organization for Isolating Selfish Peers
query peer P0
local index X0
book-marksB0
term g: 13, 11, 45, ...term a: 17, 11, 92, ...term f: 43, 65, 92, ...
peer lists (directory)
term g: 13, 11, 45, ...
term c: 13, 92, 45, ...url x: 37, 44, 12, ...
url y: 75, 43, 12, ...
url z: 54, 128, 7, ...
? ?
?
Collaborative P2P Search
Susceptible to misbehavior!How do we identify and penalize or isolate selfish/malicious peers?
Gerhard Weikum (MPII)
Data Management on Dynamic P2PSubproject 6
Self-Organization for Isolating Selfish Peers
Self-Organization for Isolating Selfish Peers
Rationale: • mimic evolution in biological / social networks• tag selfish vs. altruistic peers and bias interactions towards similar peers
Algorithm:periodically do
each peer compares its “utility” with a random peer if the other peer has higher utility then
copy that peer’s strategy and links (reproduction) mutate with small probability: change behavior, change links
Gerhard Weikum (MPII)
Data Management on Dynamic P2PSubproject 6
Self-Organization for Isolating Selfish Peers
Simulation Results for P2P File Sharing
typical run for 104 peers
Selfishness reduces
Average performance increases
• peers generate queries and answer queries based on P [0,1] with extreme behaviors: selfish P = 1.0 and altruistic P = 0.0• peer utility = # hits (queries answered)• mutation: change P randomly
0
10
20
30
40
50
60
0 20 40 60 80 100 cycles
aver
age
per
nod
e
queries generated hits
Gerhard Weikum (MPII)
Data Management on Dynamic P2PSubproject 6
The End
Thank you!