approved for public release, distribution unlimited quicksilver: middleware for scalable...

55
Approved for Public Release, Distribution Unlimited QuickSilver: Middleware for Scalable Self-Regenerative Systems Cornell University Ken Birman, Johannes Gehrke, Paul Francis, Robbert van Renesse, Werner Vogels Raytheon Corporation Lou DiPalma, Paul Work

Upload: brittney-edwards

Post on 01-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Approved for Public Release, Distribution Unlimited

QuickSilver: Middleware for Scalable Self-Regenerative Systems

Cornell UniversityKen Birman, Johannes Gehrke, Paul Francis,Robbert van Renesse, Werner Vogels

Raytheon CorporationLou DiPalma, Paul Work

July 21, 2004 2Approved for Public Release, Distribution Unlimited

Our topic Computing systems are growing

… larger, … and more complex, … and we are hoping to use them in a

more and more “unattended” manner But the technology for managing

growth and complexity is lagging

July 21, 2004 3Approved for Public Release, Distribution Unlimited

Our goal Build a new platform in support of

massively scalable, self-regenerative applications

Demonstrate it by offering a specific military application interface

Work with Raytheon to apply in other military settings

July 21, 2004 4Approved for Public Release, Distribution Unlimited

Representative scenarios Massive data centers maintained by the

military (or by companies like Amazon) Enormous publish-subscribe information

bus systems (broadly, OSD calls these GIG and NCES systems)

Deployments of large numbers of lightweight sensors

New network architectures to control autonomous vehicles over media shared with other “mundane” applications

July 21, 2004 5Approved for Public Release, Distribution Unlimited

How to approach the problem? Web Services architecture has

emerged as a likely standard for large systems

But WS is “document oriented,” lacks High availability (or any kind of quick

response guarantees) A convincing scalability story… Self-monitoring/adaptation features

July 21, 2004 6Approved for Public Release, Distribution Unlimited

Signs of trouble? Most technologies are way beyond

their normal scalability limits in this kind of center: we are “good” at small clusters but not huge ones

Pub-sub was a big hit. No longer… Curious side-bar: used heavily for point-

to-point communication! (Why?) Extremely hard to diagnose problems

July 21, 2004 7Approved for Public Release, Distribution Unlimited

We lack the right tools! Today, our applications navigate in the dark

They lack a way to find things They lack a way to sense system state There are no rules for adaptation, if/when

needed In effect: We are starting to build very big

systems, yet doing so in the usual client-server manner

This denies applications any information about system state, configuration, loads, etc

July 21, 2004 8Approved for Public Release, Distribution Unlimited

QuickSilver

QuickSilver: A platform to help developers build these massive new systems

It has four major components Astrolabe: a novel kind of “virtual database” Bimodal Multicast: for faster “few to many”

data transfer patterns Kelips: A fast “lookup” mechanism Group replication technologies based on

virtual synchrony or other similar models

July 21, 2004 9Approved for Public Release, Distribution Unlimited

QuickSilver Architecture

Pub-sub (JMS, JBI) Native API

Massively Scalable Group Communication

ComposableMicroprotocolStacks

MonitoringIndexing

MessageRepository

Overlay Networks

Distributed query,event detection

ASTROLABE

Astrolabe’s role is to

collect and report system state, which is used for many

purposes including self-configuration

and repair.

July 21, 2004 11Approved for Public Release, Distribution Unlimited

What does Astrolabe do? Astrolabe’s role is to track

information residing at a vast number of sources

Structured to look like a database Approach: “peer to peer gossip”.

Basically, each machine has a piece of a jigsaw puzzle. Assemble it on the fly.

July 21, 2004 12Approved for Public Release, Distribution Unlimited

Astrolabe in a single domain

Name Load Weblogic?

SMTP?

Word Version

swift 2.0 0 1 6.2

falcon 1.5 1 0 4.1

cardinal 4.5 1 0 6.0

Row can have many columns Total size should be k-bytes, not megabytes Configuration certificate determines what

data is pulled into the table (and can change)

3.1

5.3

0.9

1.9

3.6

0.8

2.1

2.7

1.1

1.8

July 21, 2004 13Approved for Public Release, Distribution Unlimited

So how does it work? Each computer has

Its own row Replicas of some objects (configuration

certificate, other rows, etc) Periodically, but at a fixed rate, pick a

friend “pseudo-randomly” and exchange states efficiently (bound the size of data exchanged) States converge exponentially rapidly. Loads are low and constant and protocol is

robust against all sorts of disruptions!

July 21, 2004 14Approved for Public Release, Distribution Unlimited

State Merge: Core of Astrolabe epidemic

Name Time Load Weblogic?

SMTP? Word Version

swift 2003 .67 0 1 6.2

falcon 1976 2.7 1 0 4.1

cardinal 2201 3.5 1 1 6.0

Name Time Load Weblogic?

SMTP? Word Versi

on

swift 2011 2.0 0 1 6.2

falcon 1971 1.5 1 0 4.1

cardinal 2004 4.5 1 0 6.0

swift.cs.cornell.edu

cardinal.cs.cornell.edu

July 21, 2004 15Approved for Public Release, Distribution Unlimited

State Merge: Core of Astrolabe epidemic

Name Time Load Weblogic?

SMTP? Word Version

swift 2003 .67 0 1 6.2

falcon 1976 2.7 1 0 4.1

cardinal 2201 3.5 1 1 6.0

Name Time Load Weblogic?

SMTP? Word Versi

on

swift 2011 2.0 0 1 6.2

falcon 1971 1.5 1 0 4.1

cardinal 2004 4.5 1 0 6.0

swift.cs.cornell.edu

cardinal.cs.cornell.edu

swift 2011 2.0

cardinal 2201 3.5

July 21, 2004 16Approved for Public Release, Distribution Unlimited

State Merge: Core of Astrolabe epidemic

Name Time Load Weblogic?

SMTP? Word Version

swift 2011 2.0 0 1 6.2

falcon 1976 2.7 1 0 4.1

cardinal 2201 3.5 1 1 6.0

Name Time Load Weblogic?

SMTP? Word Versi

on

swift 2011 2.0 0 1 6.2

falcon 1971 1.5 1 0 4.1

cardinal 2201 3.5 1 0 6.0

swift.cs.cornell.edu

cardinal.cs.cornell.edu

July 21, 2004 17Approved for Public Release, Distribution Unlimited

Observations Merge protocol has constant cost

One message sent, received (on avg) per unit time.

The data changes slowly, so no need to run it quickly – we usually run it every five seconds or so

Information spreads in O(log N) time But this assumes bounded region size

In Astrolabe, we limit them to 50-100 rows

July 21, 2004 18Approved for Public Release, Distribution Unlimited

Scaling up… and up…

With a stack of domains, we don’t want every system to “see” every domain Cost would be huge

So instead, we’ll see a summaryName Time Load Weblogic?

SMTP? Word Version

swift 2011 2.0 0 1 6.2

falcon 1976 2.7 1 0 4.1

cardinal 2201 3.5 1 1 6.0

cardinal.cs.cornell.edu

Name Time Load Weblogic?

SMTP? Word Version

swift 2011 2.0 0 1 6.2

falcon 1976 2.7 1 0 4.1

cardinal 2201 3.5 1 1 6.0

Name Time Load Weblogic?

SMTP? Word Version

swift 2011 2.0 0 1 6.2

falcon 1976 2.7 1 0 4.1

cardinal 2201 3.5 1 1 6.0

Name Time Load Weblogic?

SMTP? Word Version

swift 2011 2.0 0 1 6.2

falcon 1976 2.7 1 0 4.1

cardinal 2201 3.5 1 1 6.0

Name Time Load Weblogic?

SMTP? Word Version

swift 2011 2.0 0 1 6.2

falcon 1976 2.7 1 0 4.1

cardinal 2201 3.5 1 1 6.0

Name Time Load Weblogic?

SMTP? Word Version

swift 2011 2.0 0 1 6.2

falcon 1976 2.7 1 0 4.1

cardinal 2201 3.5 1 1 6.0

Name Time Load Weblogic?

SMTP? Word Version

swift 2011 2.0 0 1 6.2

falcon 1976 2.7 1 0 4.1

cardinal 2201 3.5 1 1 6.0

July 21, 2004 19Approved for Public Release, Distribution Unlimited

Build a hierarchy using a P2P protocol that “assembles the puzzle” without any servers

Name Load Weblogic? SMTP? Word Version

swift 2.0 0 1 6.2

falcon 1.5 1 0 4.1

cardinal 4.5 1 0 6.0

Name Load Weblogic? SMTP? Word Version

gazelle 1.7 0 0 4.5

zebra 3.2 0 1 6.2

gnu .5 1 0 6.2

Name Avg Load

WL contact SMTP contact

SF 2.6 123.45.61.3

123.45.61.17

NJ 1.8 127.16.77.6

127.16.77.11

Paris 3.1 14.66.71.8 14.66.71.12

San Francisco

New Jersey

SQL query “summarizes”

data

Dynamically changing query output is visible system-wide

July 21, 2004 20Approved for Public Release, Distribution Unlimited

(1) Query goes out… (2) Compute locally… (3) results flow to top level of the hierarchy

Name Load Weblogic? SMTP? Word Version

swift 2.0 0 1 6.2

falcon 1.5 1 0 4.1

cardinal 4.5 1 0 6.0

Name Load Weblogic? SMTP? Word Version

gazelle 1.7 0 0 4.5

zebra 3.2 0 1 6.2

gnu .5 1 0 6.2

Name Avg Load

WL contact SMTP contact

SF 2.6 123.45.61.3

123.45.61.17

NJ 1.8 127.16.77.6

127.16.77.11

Paris 3.1 14.66.71.8 14.66.71.12

San Francisco

New Jersey

1

3 3

1

2 2

July 21, 2004 21Approved for Public Release, Distribution Unlimited

Hierarchy is virtual… data is replicated

Name Load Weblogic? SMTP? Word Version

swift 2.0 0 1 6.2

falcon 1.5 1 0 4.1

cardinal 4.5 1 0 6.0

Name Load Weblogic? SMTP? Word Version

gazelle 1.7 0 0 4.5

zebra 3.2 0 1 6.2

gnu .5 1 0 6.2

Name Avg Load

WL contact SMTP contact

SF 2.6 123.45.61.3

123.45.61.17

NJ 1.8 127.16.77.6

127.16.77.11

Paris 3.1 14.66.71.8 14.66.71.12

San Francisco

New Jersey

July 21, 2004 22Approved for Public Release, Distribution Unlimited

Hierarchy is virtual… data is replicated

Name Load Weblogic? SMTP? Word Version

swift 2.0 0 1 6.2

falcon 1.5 1 0 4.1

cardinal 4.5 1 0 6.0

Name Load Weblogic? SMTP? Word Version

gazelle 1.7 0 0 4.5

zebra 3.2 0 1 6.2

gnu .5 1 0 6.2

Name Avg Load

WL contact SMTP contact

SF 2.6 123.45.61.3

123.45.61.17

NJ 1.8 127.16.77.6

127.16.77.11

Paris 3.1 14.66.71.8 14.66.71.12

San Francisco

New Jersey

July 21, 2004 23Approved for Public Release, Distribution Unlimited

The key to self-* properties! A flexible, reprogrammable mechanism

Which clustered services are experiencing timeouts, and what were they waiting for when they happened?

Find 12 idle machines with the NMR-3D package that can download a 20MB dataset rapidly

Which machines have inventory for warehouse 9?

Where’s the cheapest gasoline in the area? Think of aggregation functions as small

agents that look for information

July 21, 2004 24Approved for Public Release, Distribution Unlimited

What about security? Astrolabe requires

Read permissions to see database Write permissions to contribute data Administrative permission to change

aggregation or configuration certificates Users decide what data Astrolabe can see A VPN setup can be used to hide

Astrolabe’s internal messages from intruders

Byzantine Agreement based on threshold crypto used to secure aggregation functions

New!

July 21, 2004 25Approved for Public Release, Distribution Unlimited

Data Mining Quite a hot area, usually done by

collecting information to a centralized node, then “querying” within that node

Astrolabe is doing the comparable thing, but its query evaluation occurs in a decentralized manner This is incredibly parallel, hence faster And more robust against disruption too!

July 21, 2004 26Approved for Public Release, Distribution Unlimited

Cool Astrolabe Properties Parallel. Everyone does a tiny bit

work, so we accomplish huge tasks in seconds

Flexible. Decentralized query evaluation, in seconds

One aggregate can answer lots of questions. E.g. “where’s the nearest supply shed?” – the hierarchy encodes many answers in one tree!

July 21, 2004 27Approved for Public Release, Distribution Unlimited

Aggregation and Hierarchy Nearby information

Maintained in more detail, can query it directly

Changes seen sooner Remote information summarized

High quality aggregated data This also changes as information

evolves

July 21, 2004 28Approved for Public Release, Distribution Unlimited

Astrolabe summary Scalable: could support millions of machines Flexible: can easily extend domain hierarchy,

define new columns or eliminate old ones. Adapts as conditions evolve.

Secure: Uses keys for authentication and can even encrypt Handles firewalls gracefully, including issues of IP

address re-use behind firewalls Performs well: updates propagate in seconds Cheap to run: tiny load, small memory impact

July 21, 2004 29Approved for Public Release, Distribution Unlimited

Bimodal Multicast A quick glimpse of scalable

multicast Think about really large Internet

configurations A data center as the data source Typical “publication” might be going

to thousands of client systems

July 21, 2004 30Approved for Public Release, Distribution Unlimited

Swiss Stock Exchange Problem: Vsync. multicast is “fragile”

Most members are healthy….

… but one is slow

Most members are healthy….

July 21, 2004 31Approved for Public Release, Distribution Unlimited

Performance degrades as the system scales up

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

50

100

150

200

250Virtually synchronous Ensemble multicast protocols

perturb rate

aver

age

thro

ug

hp

ut

on

no

np

ertu

rbed

mem

ber

s group size: 32group size: 64group size: 96

32

96

July 21, 2004 32Approved for Public Release, Distribution Unlimited

Why doesn’t multicast scale? With weak semantics…

Faulty behavior may occur more often as system size increases (think “the Internet”)

With stronger reliability semantics… Encounter a system-wide cost (e.g. membership

reconfiguration, congestion control) That can be triggered more often as a function of

scale (more failures, or more network “events”, or bigger latencies)

Similar observation led Jim Gray to speculate that parallel databases scale as O(n2)

July 21, 2004 33Approved for Public Release, Distribution Unlimited

But none of this is inevitable Recent work on probabilistic

solutions suggests that gossip-based repair strategy scales quite well

Also gives very steady throughput And can take advantage of

hardware support for multicast, if available

Start by using unreliable multicast to rapidly distribute the message. But some messages may not get through, and some processes may be faulty. So initial state involves partial distribution of multicast(s)

Periodically (e.g. every 100ms) each process sends a digest describing its state to some randomly selected group member. The digest identifies messages. It doesn’t include them.

Recipient checks the gossip digest against its own history and solicits a copy of any missing message from the process that sent the gossip

Processes respond to solicitations received during a round of gossip by retransmitting the requested message. The round lasts much longer than a typical RPC time.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

20

40

60

80

100

120

140

160

180

200Low bandwidth comparison of pbcast performance at faulty and correct hosts

perturb rate

aver

age

thro

ughp

ut

traditional w/1 perturbed pbcast w/1 perturbed

throughput for traditional, measured at perturbed hostthroughput for pbcast measured at perturbed host

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

20

40

60

80

100

120

140

160

180

200High bandwidth comparison of pbcast performance at faulty and correct hosts

perturb rate

aver

age

thro

ughp

ut

traditional: at unperturbed hostpbcast: at unperturbed host

traditional: at perturbed host pbcast: at perturbed host

This solves our problem!

Bimodal Multicast rides out

disturbances!

July 21, 2004 39Approved for Public Release, Distribution Unlimited

Bimodal Multicast Summary An extremely scalable technology Remains steady and reliable

Even with high rates of message loss (in our tests as high as 20%)

Even with large numbers of perturbed processes (we tested with up to 25%)

Even with router failures Even when IP multicast fails

And we’ve secured it using digital signatures

July 21, 2004 40Approved for Public Release, Distribution Unlimited

Kelips Third in our set of tools A P2P “index”

Put(“name”, value) Get(“name”)

Kelips can do lookups with one RPC, is self-stabilizing after disruption

Unlike Astrolabe, nodes can put varying amounts of data out there.

July 21, 2004 41Approved for Public Release, Distribution Unlimited

Kelips

30

110

230 202

Take a a collection of

“nodes”

July 21, 2004 42Approved for Public Release, Distribution Unlimited

Kelips

0 1 2

30

110

230 202

Affinity Groups:peer membership thru consistent hash

1N

N

members per affinity group

Map nodes to affinity

groups

July 21, 2004 43Approved for Public Release, Distribution Unlimited

Kelips

0 1 2

30

110

230 202

Affinity Groups:peer membership thru consistent hash

1N

Affinity group pointers

N

members per affinity group

id hbeat rtt

30 234 90ms

230 322 30ms

Affinity group view

110 knows about other members – 230, 30…

July 21, 2004 44Approved for Public Release, Distribution Unlimited

Kelips

0 1 2

30

110

230 202

Affinity Groups:peer membership thru consistent hash

1N

Contact pointers

N

members per affinity group

id hbeat rtt

30 234 90ms

230 322 30ms

Affinity group view

group contactNode

… …

2 202

Contacts

202 is a “contact” for 110 in group

2

July 21, 2004 45Approved for Public Release, Distribution Unlimited

Kelips

0 1 2

30

110

230 202

Affinity Groups:peer membership thru consistent hash

1N

Gossip protocol replicates data

cheaply

N

members per affinity group

id hbeat rtt

30 234 90ms

230 322 30ms

Affinity group view

group contactNode

… …

2 202

Contacts

resource info

… …

dot.com 110

Resource Tuples

“dot.com” maps to group 2. So 110 tells group 2 to

“route” inquiries about dot.com to it.

July 21, 2004 46Approved for Public Release, Distribution Unlimited

Kelips

0 1 2

30

110

230 202

Affinity Groups:peer membership thru consistent hash

1N

N

members per affinity group

To look up “dot.com”, just ask

some contact in group 2. It returns “110” (or forwards

your request).

July 21, 2004 47Approved for Public Release, Distribution Unlimited

Kelips summary Split the system into N subgroups

Map (key,value) pairs to some subgroup, by hashing the key

Replicate within that subgroup Each node tracks

Its own group membership k members of each of the other groups

To lookup a key, hash it and ask one or more of your contacts if they know the value

July 21, 2004 48Approved for Public Release, Distribution Unlimited

Kelips summary O(N) storage overhead, which is

higher than for other DHT’s Same space overhead for member list,

contact list, and replicated data itself Heuristic is used to keep contacts fresh

and avoid contacts that seem to churn This buys us O(1) lookup cost And background overhead is constant

July 21, 2004 49Approved for Public Release, Distribution Unlimited

Virtual Synchrony Last piece of the puzzle Outcome of a decade of DARPA-

funded work, technology core of AEGIS “integrated” console New York and Swiss Stock Exchange French Air Traffic Control System Florida Electric Power and Light

System

July 21, 2004 50Approved for Public Release, Distribution Unlimited

crashG0={p,q} G1={p,q,r,s} G2={q,r,s} G3={q,r,s,t}

p

q

r

s

t

r, s request to joinr,s added; state xfer

t added, state xfert requests to join

p fails

Virtual Synchrony Model

July 21, 2004 51Approved for Public Release, Distribution Unlimited

Roles in QuickSilver? Provides way for groups of

components to Replicate data, synchronize Perform tasks in parallel (like parallel

database lookups, for improved speed) Detect failures and reconfigure to

compensate by regenerating lost functionality

July 21, 2004 52Approved for Public Release, Distribution Unlimited

Replication: Key to understanding QuickSilver

crashG0={p,q} G1={p,q,r,s} G2={q,r,s} G3={q,r,s,t}

p

q

r

s

t

r, s request to joinr,s added; state xfer

t added, state xfert requests to join

p fails

6.0

4.1

6.2

Word Version

014.5cardinal

011.5falcon

102.0swift

…SMTP?Weblogic?LoadName

6.2

6.2

4.5

Word Version

01.5gnu

103.2zebra

001.7gazelle

…SMTP?Weblogic?LoadName

14.66.71.1214.66.71.83.1Paris

127.16.77.11127.16.77.61.8NJ

123.45.61.17123.45.61.32.6SF

SMTP contactWL contactAvgLoad

Name

San Francisco New Jersey

Kelips Virtual Synchrony

Astrolabe Bimodal Multicast

0 1 2

30

110

230 202

Gossip protocol tracks membership. Hash

1N

Nmembers per affinity group

query

each member to an “affinity group”

Gossip protocolreplicates data cheaply

July 21, 2004 53Approved for Public Release, Distribution Unlimited

Metrics We plan to look at several:

Robustness to externally imposed stress, overload: expect to demonstrate significant improvements

Scalability: Graph performance/overheads as function of scale, load, etc

End-user power: Implement JBI, sensor networks, data-center mgt. platform

Total cost: With Raytheon, explore impact on real military applications

Under DURIP funding we have acquired a clustered evaluation platform.

July 21, 2004 54Approved for Public Release, Distribution Unlimited

Our plan Integrate these core components Then

Build a JBI layer over the system Integrate Johannes Gehrke’s data

mining technology into the platform Support scalable overlay multicast

(Francis) Raytheon: Teaming with us to tackle

military applications, notably Navy

July 21, 2004 55Approved for Public Release, Distribution Unlimited

More information?

www.cs.cornell.edu/Info/Projects/QuickSilver