approved for public release, distribution unlimited quicksilver: middleware for scalable...

Approved for Public Release, Distribution Unlimited

QuickSilver: Middleware for Scalable Self-Regenerative Systems

Cornell UniversityKen Birman, Johannes Gehrke, Paul Francis,Robbert van Renesse, Werner Vogels

Raytheon CorporationLou DiPalma, Paul Work

July 21, 2004 2Approved for Public Release, Distribution Unlimited

Our topic Computing systems are growing

… larger, … and more complex, … and we are hoping to use them in a

more and more “unattended” manner But the technology for managing

growth and complexity is lagging


Our goal Build a new platform in support of

massively scalable, self-regenerative applications

Demonstrate it by offering a specific military application interface

Work with Raytheon to apply in other military settings


Representative scenarios Massive data centers maintained by the

military (or by companies like Amazon) Enormous publish-subscribe information

bus systems (broadly, OSD calls these GIG and NCES systems)

Deployments of large numbers of lightweight sensors

New network architectures to control autonomous vehicles over media shared with other “mundane” applications


How to approach the problem? Web Services architecture has

emerged as a likely standard for large systems

But WS is “document oriented,” lacks High availability (or any kind of quick

response guarantees) A convincing scalability story… Self-monitoring/adaptation features


Signs of trouble? Most technologies are way beyond

their normal scalability limits in this kind of center: we are “good” at small clusters but not huge ones

Pub-sub was a big hit. No longer… Curious side-bar: used heavily for point-

to-point communication! (Why?) Extremely hard to diagnose problems


We lack the right tools! Today, our applications navigate in the dark

They lack a way to find things They lack a way to sense system state There are no rules for adaptation, if/when

needed In effect: We are starting to build very big

systems, yet doing so in the usual client-server manner

This denies applications any information about system state, configuration, loads, etc


QuickSilver

QuickSilver: A platform to help developers build these massive new systems

It has four major components Astrolabe: a novel kind of “virtual database” Bimodal Multicast: for faster “few to many”

data transfer patterns Kelips: A fast “lookup” mechanism Group replication technologies based on

virtual synchrony or other similar models


QuickSilver Architecture

Pub-sub (JMS, JBI) Native API

Massively Scalable Group Communication

ComposableMicroprotocolStacks

MonitoringIndexing

MessageRepository

Overlay Networks

Distributed query,event detection

ASTROLABE

Astrolabe’s role is to

collect and report system state, which is used for many

purposes including self-configuration

and repair.


What does Astrolabe do? Astrolabe’s role is to track

information residing at a vast number of sources

Structured to look like a database Approach: “peer to peer gossip”.

Basically, each machine has a piece of a jigsaw puzzle. Assemble it on the fly.


Astrolabe in a single domain

Name Load Weblogic?

SMTP?

Word Version

…

swift 2.0 0 1 6.2

falcon 1.5 1 0 4.1

cardinal 4.5 1 0 6.0

Row can have many columns Total size should be k-bytes, not megabytes Configuration certificate determines what

data is pulled into the table (and can change)

3.1

5.3

0.9

1.9

3.6

0.8

2.1

2.7

1.1

1.8


So how does it work? Each computer has

Its own row Replicas of some objects (configuration

certificate, other rows, etc) Periodically, but at a fixed rate, pick a

friend “pseudo-randomly” and exchange states efficiently (bound the size of data exchanged) States converge exponentially rapidly. Loads are low and constant and protocol is

robust against all sorts of disruptions!


State Merge: Core of Astrolabe epidemic

Name Time Load Weblogic?

SMTP? Word Version

swift 2003 .67 0 1 6.2

falcon 1976 2.7 1 0 4.1

cardinal 2201 3.5 1 1 6.0


SMTP? Word Versi

on

swift 2011 2.0 0 1 6.2

falcon 1971 1.5 1 0 4.1

cardinal 2004 4.5 1 0 6.0

swift.cs.cornell.edu

cardinal.cs.cornell.edu




SMTP? Word Version

swift 2003 .67 0 1 6.2

falcon 1976 2.7 1 0 4.1

cardinal 2201 3.5 1 1 6.0


SMTP? Word Versi

on

swift 2011 2.0 0 1 6.2

falcon 1971 1.5 1 0 4.1

cardinal 2004 4.5 1 0 6.0



swift 2011 2.0

cardinal 2201 3.5




SMTP? Word Version

swift 2011 2.0 0 1 6.2

falcon 1976 2.7 1 0 4.1

cardinal 2201 3.5 1 1 6.0


SMTP? Word Versi

on

swift 2011 2.0 0 1 6.2

falcon 1971 1.5 1 0 4.1

cardinal 2201 3.5 1 0 6.0




Observations Merge protocol has constant cost

One message sent, received (on avg) per unit time.

The data changes slowly, so no need to run it quickly – we usually run it every five seconds or so

Information spreads in O(log N) time But this assumes bounded region size

In Astrolabe, we limit them to 50-100 rows


Scaling up… and up…

With a stack of domains, we don’t want every system to “see” every domain Cost would be huge

So instead, we’ll see a summaryName Time Load Weblogic?

SMTP? Word Version

swift 2011 2.0 0 1 6.2

falcon 1976 2.7 1 0 4.1

cardinal 2201 3.5 1 1 6.0



SMTP? Word Version

swift 2011 2.0 0 1 6.2

falcon 1976 2.7 1 0 4.1

cardinal 2201 3.5 1 1 6.0


SMTP? Word Version

swift 2011 2.0 0 1 6.2

falcon 1976 2.7 1 0 4.1

cardinal 2201 3.5 1 1 6.0


SMTP? Word Version

swift 2011 2.0 0 1 6.2

falcon 1976 2.7 1 0 4.1

cardinal 2201 3.5 1 1 6.0


SMTP? Word Version

swift 2011 2.0 0 1 6.2

falcon 1976 2.7 1 0 4.1

cardinal 2201 3.5 1 1 6.0


SMTP? Word Version

swift 2011 2.0 0 1 6.2

falcon 1976 2.7 1 0 4.1

cardinal 2201 3.5 1 1 6.0


SMTP? Word Version

swift 2011 2.0 0 1 6.2

falcon 1976 2.7 1 0 4.1

cardinal 2201 3.5 1 1 6.0


Build a hierarchy using a P2P protocol that “assembles the puzzle” without any servers

Name Load Weblogic? SMTP? Word Version

…

swift 2.0 0 1 6.2

falcon 1.5 1 0 4.1



…

gazelle 1.7 0 0 4.5

zebra 3.2 0 1 6.2

gnu .5 1 0 6.2

Name Avg Load

WL contact SMTP contact

SF 2.6 123.45.61.3

123.45.61.17

NJ 1.8 127.16.77.6

127.16.77.11

Paris 3.1 14.66.71.8 14.66.71.12

San Francisco

New Jersey

SQL query “summarizes”

data

Dynamically changing query output is visible system-wide


(1) Query goes out… (2) Compute locally… (3) results flow to top level of the hierarchy


…

swift 2.0 0 1 6.2

falcon 1.5 1 0 4.1



…

gazelle 1.7 0 0 4.5

zebra 3.2 0 1 6.2

gnu .5 1 0 6.2

Name Avg Load


SF 2.6 123.45.61.3

123.45.61.17

NJ 1.8 127.16.77.6

127.16.77.11

Paris 3.1 14.66.71.8 14.66.71.12

San Francisco

New Jersey

1

3 3

1

2 2


Hierarchy is virtual… data is replicated


…

swift 2.0 0 1 6.2

falcon 1.5 1 0 4.1



…

gazelle 1.7 0 0 4.5

zebra 3.2 0 1 6.2

gnu .5 1 0 6.2

Name Avg Load


SF 2.6 123.45.61.3

123.45.61.17

NJ 1.8 127.16.77.6

127.16.77.11

Paris 3.1 14.66.71.8 14.66.71.12

San Francisco

New Jersey


The key to self-* properties! A flexible, reprogrammable mechanism

Which clustered services are experiencing timeouts, and what were they waiting for when they happened?

Find 12 idle machines with the NMR-3D package that can download a 20MB dataset rapidly

Which machines have inventory for warehouse 9?

Where’s the cheapest gasoline in the area? Think of aggregation functions as small

agents that look for information


What about security? Astrolabe requires

Read permissions to see database Write permissions to contribute data Administrative permission to change

aggregation or configuration certificates Users decide what data Astrolabe can see A VPN setup can be used to hide

Astrolabe’s internal messages from intruders

Byzantine Agreement based on threshold crypto used to secure aggregation functions

New!


Data Mining Quite a hot area, usually done by

collecting information to a centralized node, then “querying” within that node

Astrolabe is doing the comparable thing, but its query evaluation occurs in a decentralized manner This is incredibly parallel, hence faster And more robust against disruption too!


Cool Astrolabe Properties Parallel. Everyone does a tiny bit

work, so we accomplish huge tasks in seconds

Flexible. Decentralized query evaluation, in seconds

One aggregate can answer lots of questions. E.g. “where’s the nearest supply shed?” – the hierarchy encodes many answers in one tree!


Aggregation and Hierarchy Nearby information

Maintained in more detail, can query it directly

Changes seen sooner Remote information summarized

High quality aggregated data This also changes as information

evolves


Astrolabe summary Scalable: could support millions of machines Flexible: can easily extend domain hierarchy,

define new columns or eliminate old ones. Adapts as conditions evolve.

Secure: Uses keys for authentication and can even encrypt Handles firewalls gracefully, including issues of IP

address re-use behind firewalls Performs well: updates propagate in seconds Cheap to run: tiny load, small memory impact


Bimodal Multicast A quick glimpse of scalable

multicast Think about really large Internet

configurations A data center as the data source Typical “publication” might be going

to thousands of client systems


Swiss Stock Exchange Problem: Vsync. multicast is “fragile”

Most members are healthy….

… but one is slow

Most members are healthy….


Performance degrades as the system scales up

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

50

100

150

200

250Virtually synchronous Ensemble multicast protocols

perturb rate

aver

age

thro

ug

hp

ut

on

no

np

ertu

rbed

mem

ber

s group size: 32group size: 64group size: 96

32

96


Why doesn’t multicast scale? With weak semantics…

Faulty behavior may occur more often as system size increases (think “the Internet”)

With stronger reliability semantics… Encounter a system-wide cost (e.g. membership

reconfiguration, congestion control) That can be triggered more often as a function of

scale (more failures, or more network “events”, or bigger latencies)

Similar observation led Jim Gray to speculate that parallel databases scale as O(n2)


But none of this is inevitable Recent work on probabilistic

solutions suggests that gossip-based repair strategy scales quite well

Also gives very steady throughput And can take advantage of

hardware support for multicast, if available

Start by using unreliable multicast to rapidly distribute the message. But some messages may not get through, and some processes may be faulty. So initial state involves partial distribution of multicast(s)

Periodically (e.g. every 100ms) each process sends a digest describing its state to some randomly selected group member. The digest identifies messages. It doesn’t include them.

Recipient checks the gossip digest against its own history and solicits a copy of any missing message from the process that sent the gossip

Processes respond to solicitations received during a round of gossip by retransmitting the requested message. The round lasts much longer than a typical RPC time.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

20

40

60

80

100

120

140

160

180

200Low bandwidth comparison of pbcast performance at faulty and correct hosts

perturb rate

aver

age

thro

ughp

ut

traditional w/1 perturbed pbcast w/1 perturbed

throughput for traditional, measured at perturbed hostthroughput for pbcast measured at perturbed host

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

20

40

60

80

100

120

140

160

180

200High bandwidth comparison of pbcast performance at faulty and correct hosts

perturb rate

aver

age

thro

ughp

ut

traditional: at unperturbed hostpbcast: at unperturbed host

traditional: at perturbed host pbcast: at perturbed host

This solves our problem!

Bimodal Multicast rides out

disturbances!


Bimodal Multicast Summary An extremely scalable technology Remains steady and reliable

Even with high rates of message loss (in our tests as high as 20%)

Even with large numbers of perturbed processes (we tested with up to 25%)

Even with router failures Even when IP multicast fails

And we’ve secured it using digital signatures


Kelips Third in our set of tools A P2P “index”

Put(“name”, value) Get(“name”)

Kelips can do lookups with one RPC, is self-stabilizing after disruption

Unlike Astrolabe, nodes can put varying amounts of data out there.


Kelips

30

110

230 202

Take a a collection of

“nodes”


Kelips

0 1 2

30

110

230 202

Affinity Groups:peer membership thru consistent hash

1N

N

members per affinity group

Map nodes to affinity

groups


Kelips

0 1 2

30

110

230 202


1N

Affinity group pointers

N


id hbeat rtt

30 234 90ms

230 322 30ms

Affinity group view

110 knows about other members – 230, 30…


Kelips

0 1 2

30

110

230 202


1N

Contact pointers

N


id hbeat rtt

30 234 90ms

230 322 30ms

Affinity group view

group contactNode

… …

2 202

Contacts

202 is a “contact” for 110 in group

2


Kelips

0 1 2

30

110

230 202


1N

Gossip protocol replicates data

cheaply

N


id hbeat rtt

30 234 90ms

230 322 30ms

Affinity group view

group contactNode

… …

2 202

Contacts

resource info

… …

dot.com 110

Resource Tuples

“dot.com” maps to group 2. So 110 tells group 2 to

“route” inquiries about dot.com to it.


Kelips

0 1 2

30

110

230 202


1N

N


To look up “dot.com”, just ask

some contact in group 2. It returns “110” (or forwards

your request).


Kelips summary Split the system into N subgroups

Map (key,value) pairs to some subgroup, by hashing the key

Replicate within that subgroup Each node tracks

Its own group membership k members of each of the other groups

To lookup a key, hash it and ask one or more of your contacts if they know the value


Kelips summary O(N) storage overhead, which is

higher than for other DHT’s Same space overhead for member list,

contact list, and replicated data itself Heuristic is used to keep contacts fresh

and avoid contacts that seem to churn This buys us O(1) lookup cost And background overhead is constant


Virtual Synchrony Last piece of the puzzle Outcome of a decade of DARPA-

funded work, technology core of AEGIS “integrated” console New York and Swiss Stock Exchange French Air Traffic Control System Florida Electric Power and Light

System


crashG0={p,q} G1={p,q,r,s} G2={q,r,s} G3={q,r,s,t}

p

q

r

s

t

r, s request to joinr,s added; state xfer

t added, state xfert requests to join

p fails

Virtual Synchrony Model


Roles in QuickSilver? Provides way for groups of

components to Replicate data, synchronize Perform tasks in parallel (like parallel

database lookups, for improved speed) Detect failures and reconfigure to

compensate by regenerating lost functionality


Replication: Key to understanding QuickSilver

crashG0={p,q} G1={p,q,r,s} G2={q,r,s} G3={q,r,s,t}

p

q

r

s

t

r, s request to joinr,s added; state xfer

t added, state xfert requests to join

p fails

6.0

4.1

6.2

Word Version

014.5cardinal

011.5falcon

102.0swift

…SMTP?Weblogic?LoadName

6.2

6.2

4.5

Word Version

01.5gnu

103.2zebra

001.7gazelle

…SMTP?Weblogic?LoadName

14.66.71.1214.66.71.83.1Paris

127.16.77.11127.16.77.61.8NJ

123.45.61.17123.45.61.32.6SF

SMTP contactWL contactAvgLoad

Name

San Francisco New Jersey

Kelips Virtual Synchrony

Astrolabe Bimodal Multicast

0 1 2

30

110

230 202

Gossip protocol tracks membership. Hash

1N

Nmembers per affinity group

query

each member to an “affinity group”

Gossip protocolreplicates data cheaply


Metrics We plan to look at several:

Robustness to externally imposed stress, overload: expect to demonstrate significant improvements

Scalability: Graph performance/overheads as function of scale, load, etc

End-user power: Implement JBI, sensor networks, data-center mgt. platform

Total cost: With Raytheon, explore impact on real military applications

Under DURIP funding we have acquired a clustered evaluation platform.


Our plan Integrate these core components Then

Build a JBI layer over the system Integrate Johannes Gehrke’s data

mining technology into the platform Support scalable overlay multicast

(Francis) Raytheon: Teaming with us to tackle

military applications, notably Navy


More information?

www.cs.cornell.edu/Info/Projects/QuickSilver

approved for public release, distribution unlimited quicksilver: middleware for scalable...

Documents