delis highlights: efficient and intelligent top-k search in peer-to-peer systems presented by...

DELIS Highlights:

Efficient and Intelligent Top-k Search

in Peer-to-Peer Systems

presented by Gerhard Weikum (Max-Planck Institute of Computer Science)

Gerhard Weikum (MPII)

Data Management on Dynamic P2PSubproject 6

Introduction

• Proof of Concept for Scalable & Self-Organizing Data Structures and Algorithms (e.g., DHTs, Randomized Overlay Networks, Epidemic Spreading)

• Powerful Search Methods for Each Peer(Concept-based Search, Query Expansion, Personalization, etc.)

• Leverage Intellectual Input at Each Peer (Bookmarks, Feedback, Query Logs, Click Streams, Evolving Web, etc.)

• Collaboration among Peers(Query Routing, Incentives, Fairness, Anonymity, etc.)

• Better Search Result Quality (Precision, Recall, etc.)

• Breaking Information Monopolies

• Testbed for CS Models, Algorithms, Technologies and Experimental Platform

Why Peer-to-Peer Web Search?Vision: Self-organizing P2P Web Search Engine with Google-or-better functionality



Introduction

What Google Can‘t Do

Killer queries (disregarding NLP QA, multilingual, multimedia):drama with three women making a prophecy to a British nobleman that he will become king



Introduction

VisionDemo

• Efficient Top-k Search

•

• Ontology-based Query Expansion

Outline

• Exploiting User Behavior • Isolating Selfish Peers



Introduction

VisionDemo

• Efficient Top-k Search


Outline




Efficient Top-k Search

Index listsIndex lists

s(t1,d1) = 0.7…s(tm,d1) = 0.2

s(t1,d1) = 0.7…s(tm,d1) = 0.2

…

Data items: d1, …, dn

Query: q = (t1, t2, t3)

Rank Doc Worst-score

Best-score

1 d78 0.9 2.4

2 d64 0.8 2.4

3 d10 0.7 2.4


Best-score

1 d78 1.4 2.0

2 d23 1.4 1.9

3 d64 0.8 2.1

4 d10 0.7 2.1


Best-score

1 d10 2.1 2.1

2 d78 1.4 2.0

3 d23 1.4 1.8

4 d64 1.2 2.0

…

…

t1d780.9

d10.7

d880.2

d100.2

d780.1

d990.2

d340.1

d230.8

d100.8

d1d1

t2d640.8

d230.6

d100.6

t3d100.7

d780.5

d640.4

STOP!STOP!

TA with sorted access only (NRA) (Fagin 01, Güntzer/Kießling/Balke 01):can index lists; consider d at posi in Li;E(d) := E(d) {i}; highi := s(ti,d);worstscore(d) := aggr{s(t,d) | E(d)};bestscore(d) := aggr{worstscore(d), aggr{high | E(d)}};if worstscore(d) > min-k then add d to top-k min-k := min{worstscore(d’) | d’ top-k};else if bestscore(d) > min-k then cand := cand {d}; sthreshold := max {bestscore(d’) | d’ cand};if threshold min-k then exit;

TA: efficient & principled top-k query processingwith monotonic score aggr.

Scan depth 1Scan

depth 1Scan

depth 2Scan

depth 2Scan

depth 3Scan

depth 3

k = 1


Ex. Google:> 10 mio. terms> 8 bio. docs> 4 TB index



Probabilistic Pruning

Probabilistic Pruning of Top-k Candidates

scan depth

drop dfrom priority queue

• Approximate top-k with probabilistic guarantees:

bestscore(d)

worstscore(d)

min-k

score

?? Add d to top-k result, if

worstscore(d) > min-k Drop d only if

bestscore(d) < min-k, otherwise keep in PQ

TA family of algorithms based on invariant (with sum as aggr)i i i

i E( d ) i E( d ) i E( d )s ( d ) s( d ) s ( d ) high

worstscore(d) bestscore(d)

i ii E( d ) i E( d )

p( d ) : P [ s ( d ) S ]

Often overly conservative (deep scans, high memory for PQ)

discard candidates d from queue if p(d)

score predictor can useLSTs & Chernoff bounds,Poisson approximations,or histogram convolution

E[rel. precision@k] = 1



Experiments with TREC-12 Web Track

Experiments with TREC-12 Web-Track Benchmark

TA-sorted Prob-sorted (smart)#sorted accesses 2,263,652 527,980elapsed time [s] 148.7 15.9max queue size 10849 400relative precision 1 0.87rank distance 0 39.5score error 0 0.031

on .GOV corpus from TREC-12 Web track:1.25 Mio. docs (html, pdf, etc.)

50 keyword queries, e.g.: • „Lewis Clark expedition“, • „juvenile delinquency“, • „legalization Marihuana“, • „air bag safety reducing injuries death facts“

speedup by factor 10at high precision/recall(relative to TA-sorted);

aggressive queue mgt.even yields factor 100at 30-50 % prec./recall



Introduction

VisionDemo



Outline




Query Expansion

Threshold-based query expansion:substitute ~w by (c1 | ... | ck) with all ci for which sim(w, ci)

„Old hat“ in IR; highly disputed for danger of topic dilution

Approach to careful expansion:• determine phrases from query or best initial query results (e.g., forming 3-grams and looking up ontology/thesaurus entries)• if uniquely mapped to one concept then expand with synonyms and weighted hyponyms• alternatively use statistical learning methods for word sense disambiguation

Problem: choice of threshold

Query Expansion



Query Expansion Example

Title: International Organized Crime

Description: Identify organizations that participate in international criminal activity, the activity, and, if possible, collaborating organizations and the countries involved.

From TREC 2004 Robust Track:

135530 sorted accesses in 11.073s.

Results:1. Interpol Chief on Fight Against Narcotics 2. Economic Counterintelligence Tasks Viewed 3. Dresden Conference Views Growth of Organized Crime in Europe 4. Report on Drug, Weapons Seizures in Southwest Border Region 5. SWITZERLAND CALLED SOFT ON CRIME ...

A parliamentary commission accused Swiss prosecutors today of doing little to stop drug and money-laundering international networks from pumping billions of dollars through Swiss companies. ...

Query = {international[0.145|1.00], ~META[1.00|1.00][{gangdom[1.00|1.00], gangland[0.742|1.00], "organ[0.213|1.00] & crime[0.312|1.00]", camorra[0.254|1.00], maffia[0.318|1.00], mafia[0.154|1.00], "sicilian[0.201|1.00] & mafia[0.154|1.00]", "black[0.066|1.00] & hand[0.053|1.00]", mob[0.123|1.00], syndicate[0.093|1.00]}], organ[0.213|1.00], crime[0.312|1.00], collabor[0.415|0.20], columbian[0.686|0.20], cartel[0.466|0.20], ...}}

Let us take, for example, the case of Medellin cartel's boss Pablo Escobar. Will the fact that he was eliminated change anything at all? No, it may perhaps have a psychological effect on other drug dealers but, ...

... for organizing the illicit export of metals and import of arms. It is extremely difficult for the law-enforcement organs to investigate and stamp out corruption among leading officials. ...

Query Expansion Example



Top-k with Query Expansion

responsetime: 0.737: 0.944: 0.8

...

22: 0.723: 0.651: 0.652: 0.6

throughput: 0.6

92: 0.967: 0.9

...52: 0.944: 0.855: 0.8

algorithm

B+ tree index on terms

57: 0.644: 0.4

...

performance

52: 0.433: 0.375: 0.3

12: 0.914: 0.8

...

28: 0.617: 0.5561: 0.544: 0.5

44: 0.4

thesaurus / meta-index

iq {max jonto(i) { sim(i,j)*sj(d)) }}

performance

response time: 0.7throughput: 0.6queueing: 0.3delay: 0.25...

consider expandable query „algorithm and ~performance“ with score

dynamic query expansion with incremental on-demand merging of additional index lists

+ much more efficient than threshold-based expansion+ no threshold tuning+ no topic drift

Top-k Query Processing with Query Expansion



Experiments with TREC-13 Robust Track

on Acquaint corpus (news articles):528 000 docs, 2 GB raw data, 8 GB for all indexes

no exp. static exp. static exp. incr. merge (=0.1) (=0.3, (=0.3, (=0.1)

=0.0) =0.1)

#sorted acc. 1,333,756 10,586,175 3,622,686 5,671,493#random acc. 0 555,176 49,783 34,895elapsed time [s] 9.3 156.6 79.6 43.8max #terms 4 59 59 59relative prec. 0.934 1.0 0.541 0.786precision@10 0.248 0.286 0.238 0.298MAP 0.091 0.111 0.086 0.110

with Okapi BM25 probabilistic scoring model

50 most difficult queries, e.g.: „transportation tunnel disasters“ „Hubble telescope achievements“potentially expanded into: „earthquake, flood, wind, seismology, accident, car, auto, train, ...“ „astronomical, electromagnetic radiation, cosmic source, nebulae, ...“

Experiments with TREC-13 Robust-Track Benchmarkspeedup by factor 4at high precision/recall;no topic drift, no need for threshold tuning;also handles TREC-13Terabyte benchmark



Introduction

VisionDemo


Ontology-based Query Expansion

Outline




Exploiting User Behavior

Exploiting Query Logs and Click Streams

from PageRank: uniformly random choice of links + random jumps

PR( q ) j( q ) ( 1 )

p IN ( q )PR( p ) t( p,q )

Authority (page q) = stationary prob. of visiting q




from PageRank: uniformly random choice of links + random jumpsto QRank: + query-doc transitions + query-query transitions + doc-doc transitions on implicit links (w/ thesaurus)with probabilities estimated from log statistics

a ba

xyz

PR( q ) j( q ) ( 1 )

p IN ( q )PR( p ) t( p,q )

QR( q ) j( q ) ( 1 )

p exp licitIN ( q )PR( p ) t( p,q )

p implicitIN ( q )

( 1 ) PR( p ) sim( p,q )

Exploiting Query Logs and Click Streams




Setup:70 000 Wikipedia docs, 18 volunteers posing Trivial-Pursuit queriesca. 500 queries, ca. 300 refinements, ca. 1000 positive clicksca. 15 000 implicit links based on doc-doc similarity

Results (assessment by blind-test users):• QRank top-10 result preferred over PageRank in 81% of all cases• QRank has 50.3% precision@10, PageRank has 33.9%

Untrained example query „philosophy“:

PageRank QRank x 1. Philosophy Philosophy2. GNU free doc. license GNU free doc. license3. Free software foundation Early modern philosophy4. Richard Stallman Mysticism5. Debian Aristotle

Preliminary Experiments



Introduction

VisionDemo


Ontology-based Query Expansion

Outline

Exploiting User Behavior • Isolating Selfish Peers



Self-Organization for Isolating Selfish Peers

query peer P0

local index X0

book-marksB0

term g: 13, 11, 45, ...term a: 17, 11, 92, ...term f: 43, 65, 92, ...

peer lists (directory)

term g: 13, 11, 45, ...

term c: 13, 92, 45, ...url x: 37, 44, 12, ...

url y: 75, 43, 12, ...

url z: 54, 128, 7, ...

? ?

?

Collaborative P2P Search

Susceptible to misbehavior!How do we identify and penalize or isolate selfish/malicious peers?





Rationale: • mimic evolution in biological / social networks• tag selfish vs. altruistic peers and bias interactions towards similar peers

Algorithm:periodically do

each peer compares its “utility” with a random peer if the other peer has higher utility then

copy that peer’s strategy and links (reproduction) mutate with small probability: change behavior, change links




Simulation Results for P2P File Sharing

typical run for 104 peers

Selfishness reduces

Average performance increases

• peers generate queries and answer queries based on P [0,1] with extreme behaviors: selfish P = 1.0 and altruistic P = 0.0• peer utility = # hits (queries answered)• mutation: change P randomly

0

10

20

30

40

50

60

0 20 40 60 80 100 cycles

aver

age

per

nod

e

queries generated hits



The End

Thank you!

delis highlights: efficient and intelligent top-k search in peer-to-peer systems presented by...

Documents

d worstscored

worstscored d

d n query

bestscored d cand

scan depth

search index

peer web search

dynamic p2psubproject