querying the internet with pier (pier = peer-to-peer information exchange and retrieval)

Querying the Internet with PIER(PIER = Peer-to-peer Information Exchange and Retrieval)

Ryan HuebschJoe Hellerstein, Nick Lanham, Boon Thau Loo, Timothy Roscoe, Scott Shenker, Ion

Stoica

[email protected] Berkeley, CS Division

Intel Berkeley Research 4/14/03

2

Outline Motivation General Architecture Brief look at the Algorithms Potential Applications Current Status Future Research Conclusion

3

Information in the Wide Area

The Internet enables massive information collection and dissemination

Lots of data is naturally distributed Files – MP3s (Napster, Kazaa, AudioGalaxy) Logs – software, network, virus, usage Instant Messaging/Blogs – (AIM, LiveJournal) Sensors data – (TinyDB, IrisNet)

It is hard finding exactly what you want

We need Internet-scale query processors to find the data the user wants

4

What is an Internet Scale Query Processor? Environment/Requirements

Thousands to millions of nodes Set of nodes constantly changing Streaming data, streaming queries In situ data processing Easy to use set of queries are not static No central authority or administrator

So what about Oracle, DB2, SQL Server? What about Distributed Hash Tables (DHTs)? Can these solve the problem?

5

Databases are Cool, but… Relational databases bring a declarative

interface to the user/application. Ask for what you want, not how to get it

Database community is not new to parallel and distributed systems Parallel: Centralized, one administrator, one

point of failure Distributed: Did not catch on, complicated,

never really scaled above 100’s of machines “VLDB” currently means 100’s of machines

Internet needs many more

6

P2P DHTs are Cool, but… Lots of effort is put into making DHTs

Decentralized Control Scalable (thousands millions of nodes) Reliable (every imaginable failure) Security (anonymity, encryption, etc.) Efficient (fast access with minimal state) Load balancing, and others

Still only a hash table interface, put and get

Hard (but not impossible) to build real applications using only the basic primitives

7

Databases + P2P DHTsMarriage Made in Heaven? Well, databases carry a lot of other

(expensive) baggage ACID transactions Consistency above all else

So we just want to unite the query processor with DHTs DHTs + Relational Query Processing = PIER

Bring complex queries to DHTs foundation for real applications

8


9

Architecture DHT is divided into 3

modules We’ve chosen one way

to do this, but this may change with time and experience

Goal is to make each simple and replaceable

PIER has one primary module Add-ons can make it

look more database like.

CoreRelationalExecution

Engine

ProviderStorageManager

OverlayRouting

CatalogManager

QueryOptimizer

Various User Applications

PIER

DHT

Apps

10

Architecture: DHT: Routing

lookup(key) ipaddrjoin(landmarkNode)leave()CALLBACK: locationMapChange()


OverlayRouting

Very simple interface Plug in any routing

algorithm here: CAN, Chord, Pastry, Tapestry, Kademlia, SkipNet, Viceroy, etc.

11

Architecture: DHT: Routing


OverlayRouting

lookup(key) ipaddr

route(key, msg)

join(landmarkNode)

leave()

CALLBACK: locationMapChange()CALLBACK: inRoute(key, msg)

Version 2 of API

(in progress)

Very simple interface Plug in any routing

algorithm here: CAN, Chord, Pastry, Tapestry, Kademlia, SkipNet, Viceroy, etc.

12

Architecture: DHT: Storage Currently we use a

simple in-memory storage system, no reason a more complex one couldn’t be used


OverlayRouting

store(key, item)

retrieve(key) item

remove(key)

13

Architecture: DHT: Provider Connects the pieces,

and provides the ‘DHT’ interface


OverlayRouting

get(ns, rid) item

put(ns, rid, iid, item, lifetime)

renew(ns, rid, iid, lifetime) success?

multicast(ns, item)

unicast(ns, rid, item)

lscan(ns) items

CALLBACK: newData(ns, item)

14

Architecture: DHT: Provider Connects the pieces,

and provides the ‘DHT’ interface


OverlayRouting

get(ns, rid) item

put(ns, rid, iid, item, lifetime)

renew(ns, rid, iid, lifetime) success?

multicast(ns, item)

unicast(ns, rid, item, route?)

lscan(ns) items

CALLBACK: newData(ns, item)

Version 2 of API

(in progress)

15

Architecture: PIER Currently, consists only of the relational

execution engine Executes a pre-optimized query plan Query plan is a box-and-arrow description

of how to connect basic operators together selection, projection, join,

group-by/aggregation, and some DHT specific operators such as rehash

Traditional DBs use an optimizer + catalog to take SQL and generate the query plan, those are “just” add-ons to PIER Research!

16


17

Joins: The Core of Query Processing A relational join can be used to calculate:

The intersection of two sets Correlate information Find matching data

Goal: Get tuples that have the same value for a

particular attribute(s) (the join attribute(s)) to the same site, then append tuples together.

Algorithms come from existing database literature, minor adaptations to use DHT.

18

Joins: Symmetric Hash Join (SHJ)

Algorithm for each site (Scan) Use two lscan calls to retrieve all data

stored at that site from the source tables (Rehash) put a copy of each eligible tuple with the

hash key based on the value of the join attribute (Listen) use newData to see the rehashed tuples (Compute) Run standard one-site join algorithm on

the tuples as they arrive Scan/Rehash steps must be run on all sites

that store source data Listen/Compute steps can be run on fewer

nodes by choosing the hash key differently

19

Joins: Fetch Matches (FM) Algorithm for each site

(Scan) Use lscan to retrieve all data from ONE table

(Get) Based on the value for the join attribute, issue a get for the possible matching tuples from the other table

Note, one table (the one we issue the gets for) must already be hashed/stored on the join attribute

Big picture: SHJ is put based FM is get based

20

Joins: Additional Strategies Bloom Filters

Use of bloom filters can be used to reduce the amount of data rehashed in the SHJ

Symmetric Semi-Join Run a SHJ on the source data projected to

only have the hash key and join attributes. Use the results of this mini-join as source for

two FM joins to retrieve the other attributes for tuples that are likely to be in the answer set

Big Picture: Tradeoff bandwidth (extra rehashing) for

latency (time to exchange filters)

21

Naïve Group-By/Aggregation A group-by/aggregation can be used to

calculate: Split data into groups based on value Max, Min, Sum, Count, etc.

Goal: Get tuples that have the same value for a

particular attribute(s) (group-by attribute(s)) to the same site, then summarize data (aggregation).

22

Naïve Group-By/Aggregation

At each site (Scan) lscan the source table

Determine group tuple belongs in Add tuple’s data to that group’s partial summary

(Rehash) for each group represented at the site, rehash the summary tuple with hash key based on group-by attribute

(Combine) use newData to get partial summaries, combine and produce final result after specified time, number of partial results, or rate of input

Hierarchical Aggregation: Can add multiple layers of rehash/combine to reduce fan-in. Subdivide groups in subgroups by randomly

appending a number to the group’s key

23

Naïve Group-By/Aggregation

Sources

Root

Application Overlay

Sources

Root

…

Each message may take multiple hops

Each levelfewer nodes participate

24

Smarter Aggregation Naïve method has multiple hops in

overlay network between each level Idea: Aggregate along the in overlay

path from the source to the root Every node with data is a leaf Aggregate local data and route towards a

predetermined node Nodes along the path (who may also be

leaves) intercept these messages (via new API callback)

These nodes wait for a period of time, aggregating all the messages they see, then continue routing towards the root

25

Smarter Aggregation

Sources

Root

Sources

Root

Sources

Sources

Application Overlay

Send message to root

Along the overlay route, combine messages

26

Smarter Aggregation Sounds like TAG? Ideas can we borrow? Details

How to choose the root When do nodes send their data

Long wait: Fewer messages, Slow response Short wait: More messages, Fast response

How do the different DHTs perform Chord: Binomial Tree CAN: k-ary Tree

For reliability, what happens when we choose multiple roots, are the trees diverse enough?

Lots of questions… few answers right now

27


28

Going from a DHT Query Processor Application Correlation, Intersection

Joins Summarize, Aggregation, Compress

Group-By/Aggregation Probably not as efficient as custom

designed solution for a single particular problem

Common infrastructure for fast application development/deployment

29

Network Monitoring(Nick, Ryan, Timothy, Brent) Lot’s of data, naturally distributed,

almost always summarized aggregation

Intrusion Detection usually involves correlating information from multiple sites join

Data comes from many sources nmap, snort, ganglia, firewalls, web logs, etc.

PlanetLab is our natural test bed Current Status

Connecting PIER to the various data sources Determining what are the interesting queries

30

Enhanced File Searching(Boon)

First step: Take over Gnutella Just make PlanetLab look like an ultraPeer(s) on

the outside, but run PIER on the inside Queries that intersect document lists Join For efficiency use Bloom Filters Bloom Join

Objectives Generate (positive?) publicity for Planetlab Generate interesting workloads that will stress

PIER (and DHTs) Study the effectiveness of ultrapeers on recall and

bandwidth reduction Determine whether a PIER/Gnutella hybrid network

would scale while maintaining sufficiently high recall of results

31

Enhanced File Searching(Boon) Long term: Value added services

Better searching, utilize all of the MP3 ID tags

Reputations Combine with network monitoring data to

better estimate download times Current status

Currently able to run Gnutella traces using PIER over Millennium and Planetlab.

Next step is to integrate with popular Gnutella open source Limewire.

Expect a "live" system by end of summer

32

Network Services i3-style Network services

Mobility and Multicast Sender is a publisher Receiver(s) issue a continuous query looking for new

data Service Composition

Services issue a continuous query for data looking to be processed

After processing data, they publish it back into the network

Looking forward: More complex services Composition of e-service providers, e.g. via SOAP Dataflow programs in the wide area

33


34

Codebase Approximately 17,600 lines of NCSS Java Code Same code (overlay components/pier) run on the

simulator or over a real network without changes Runs simple simulations with up to 10k nodes

Limiting factor: 2GB addressable memory for the JVM (in Linux)

Runs on Millennium and Planet Lab up to 64 nodes Limiting factor: Available/working nodes & setup time

Code: Basic implementations of Chord and CAN Selection, projection, joins (4 methods), and naïve

aggregation. Non-continuous queries

35

Seems to scaleSimulations of 1 SHJ Join

Warehousing

Full Parallelization

36

Some real-world results1 SHJ Join on Millennium Cluster

37

Codebase (The Other Story)

It’s Java! Network is slow

C program can get 340Mbps between to Millennium nodes Java (nio library) gets about 100Mbps without serialization With Serialization about 5Mbps!

Memory Management Use object pools to save on object creation/garbage

collection Allowed fast prototyping, “lessons” were postponed

Code was written to generate graphs for papers To make it work for REAL people

Error Handling User interfaces Use real data, real queries

38


39

Future Research Most of the current work has been developing

the infrastructure Enables the juicy research:

Smart Hierarchical Aggregations Query Optimization (adaptive?) and catalogs Improved resilience (answer quality) Routing, Storage and Layering Range Predicates Continuous Queries over Streams Semi-structured Data (XML)

Applications, Applications, Applications…

40


41

Conclusion Distributed data needs a distributed query

processor DHTs too simple, databases too complex

PIER occupies a point in the middle of this new design space

Infrastructure for real applications Current algorithms are simple, lets see how far

that goes For real efficiency and reliability more

research needed into algorithms Still some work needed to get things running

in the real world

querying the internet with pier (pier = peer-to-peer information exchange and retrieval)

Documents

internet scale query

changingstreaming data

disseminationlots of

s of machines internet

livejournalsensors data

wide areathe internet

complex queries

dhts foundation