approved for public release, distribution unlimited quicksilver: middleware for scalable...
TRANSCRIPT
Approved for Public Release, Distribution Unlimited
QuickSilver: Middleware for Scalable Self-Regenerative Systems
Cornell UniversityKen Birman, Johannes Gehrke, Paul Francis,Robbert van Renesse, Werner Vogels
Raytheon CorporationLou DiPalma, Paul Work
July 21, 2004 2Approved for Public Release, Distribution Unlimited
Our topic Computing systems are growing
… larger, … and more complex, … and we are hoping to use them in a
more and more “unattended” manner But the technology for managing
growth and complexity is lagging
July 21, 2004 3Approved for Public Release, Distribution Unlimited
Our goal Build a new platform in support of
massively scalable, self-regenerative applications
Demonstrate it by offering a specific military application interface
Work with Raytheon to apply in other military settings
July 21, 2004 4Approved for Public Release, Distribution Unlimited
Representative scenarios Massive data centers maintained by the
military (or by companies like Amazon) Enormous publish-subscribe information
bus systems (broadly, OSD calls these GIG and NCES systems)
Deployments of large numbers of lightweight sensors
New network architectures to control autonomous vehicles over media shared with other “mundane” applications
July 21, 2004 5Approved for Public Release, Distribution Unlimited
How to approach the problem? Web Services architecture has
emerged as a likely standard for large systems
But WS is “document oriented,” lacks High availability (or any kind of quick
response guarantees) A convincing scalability story… Self-monitoring/adaptation features
July 21, 2004 6Approved for Public Release, Distribution Unlimited
Signs of trouble? Most technologies are way beyond
their normal scalability limits in this kind of center: we are “good” at small clusters but not huge ones
Pub-sub was a big hit. No longer… Curious side-bar: used heavily for point-
to-point communication! (Why?) Extremely hard to diagnose problems
July 21, 2004 7Approved for Public Release, Distribution Unlimited
We lack the right tools! Today, our applications navigate in the dark
They lack a way to find things They lack a way to sense system state There are no rules for adaptation, if/when
needed In effect: We are starting to build very big
systems, yet doing so in the usual client-server manner
This denies applications any information about system state, configuration, loads, etc
July 21, 2004 8Approved for Public Release, Distribution Unlimited
QuickSilver
QuickSilver: A platform to help developers build these massive new systems
It has four major components Astrolabe: a novel kind of “virtual database” Bimodal Multicast: for faster “few to many”
data transfer patterns Kelips: A fast “lookup” mechanism Group replication technologies based on
virtual synchrony or other similar models
July 21, 2004 9Approved for Public Release, Distribution Unlimited
QuickSilver Architecture
Pub-sub (JMS, JBI) Native API
Massively Scalable Group Communication
ComposableMicroprotocolStacks
MonitoringIndexing
MessageRepository
Overlay Networks
Distributed query,event detection
ASTROLABE
Astrolabe’s role is to
collect and report system state, which is used for many
purposes including self-configuration
and repair.
July 21, 2004 11Approved for Public Release, Distribution Unlimited
What does Astrolabe do? Astrolabe’s role is to track
information residing at a vast number of sources
Structured to look like a database Approach: “peer to peer gossip”.
Basically, each machine has a piece of a jigsaw puzzle. Assemble it on the fly.
July 21, 2004 12Approved for Public Release, Distribution Unlimited
Astrolabe in a single domain
Name Load Weblogic?
SMTP?
Word Version
…
swift 2.0 0 1 6.2
falcon 1.5 1 0 4.1
cardinal 4.5 1 0 6.0
Row can have many columns Total size should be k-bytes, not megabytes Configuration certificate determines what
data is pulled into the table (and can change)
3.1
5.3
0.9
1.9
3.6
0.8
2.1
2.7
1.1
1.8
July 21, 2004 13Approved for Public Release, Distribution Unlimited
So how does it work? Each computer has
Its own row Replicas of some objects (configuration
certificate, other rows, etc) Periodically, but at a fixed rate, pick a
friend “pseudo-randomly” and exchange states efficiently (bound the size of data exchanged) States converge exponentially rapidly. Loads are low and constant and protocol is
robust against all sorts of disruptions!
July 21, 2004 14Approved for Public Release, Distribution Unlimited
State Merge: Core of Astrolabe epidemic
Name Time Load Weblogic?
SMTP? Word Version
swift 2003 .67 0 1 6.2
falcon 1976 2.7 1 0 4.1
cardinal 2201 3.5 1 1 6.0
Name Time Load Weblogic?
SMTP? Word Versi
on
swift 2011 2.0 0 1 6.2
falcon 1971 1.5 1 0 4.1
cardinal 2004 4.5 1 0 6.0
swift.cs.cornell.edu
cardinal.cs.cornell.edu
July 21, 2004 15Approved for Public Release, Distribution Unlimited
State Merge: Core of Astrolabe epidemic
Name Time Load Weblogic?
SMTP? Word Version
swift 2003 .67 0 1 6.2
falcon 1976 2.7 1 0 4.1
cardinal 2201 3.5 1 1 6.0
Name Time Load Weblogic?
SMTP? Word Versi
on
swift 2011 2.0 0 1 6.2
falcon 1971 1.5 1 0 4.1
cardinal 2004 4.5 1 0 6.0
swift.cs.cornell.edu
cardinal.cs.cornell.edu
swift 2011 2.0
cardinal 2201 3.5
July 21, 2004 16Approved for Public Release, Distribution Unlimited
State Merge: Core of Astrolabe epidemic
Name Time Load Weblogic?
SMTP? Word Version
swift 2011 2.0 0 1 6.2
falcon 1976 2.7 1 0 4.1
cardinal 2201 3.5 1 1 6.0
Name Time Load Weblogic?
SMTP? Word Versi
on
swift 2011 2.0 0 1 6.2
falcon 1971 1.5 1 0 4.1
cardinal 2201 3.5 1 0 6.0
swift.cs.cornell.edu
cardinal.cs.cornell.edu
July 21, 2004 17Approved for Public Release, Distribution Unlimited
Observations Merge protocol has constant cost
One message sent, received (on avg) per unit time.
The data changes slowly, so no need to run it quickly – we usually run it every five seconds or so
Information spreads in O(log N) time But this assumes bounded region size
In Astrolabe, we limit them to 50-100 rows
July 21, 2004 18Approved for Public Release, Distribution Unlimited
Scaling up… and up…
With a stack of domains, we don’t want every system to “see” every domain Cost would be huge
So instead, we’ll see a summaryName Time Load Weblogic?
SMTP? Word Version
swift 2011 2.0 0 1 6.2
falcon 1976 2.7 1 0 4.1
cardinal 2201 3.5 1 1 6.0
cardinal.cs.cornell.edu
Name Time Load Weblogic?
SMTP? Word Version
swift 2011 2.0 0 1 6.2
falcon 1976 2.7 1 0 4.1
cardinal 2201 3.5 1 1 6.0
Name Time Load Weblogic?
SMTP? Word Version
swift 2011 2.0 0 1 6.2
falcon 1976 2.7 1 0 4.1
cardinal 2201 3.5 1 1 6.0
Name Time Load Weblogic?
SMTP? Word Version
swift 2011 2.0 0 1 6.2
falcon 1976 2.7 1 0 4.1
cardinal 2201 3.5 1 1 6.0
Name Time Load Weblogic?
SMTP? Word Version
swift 2011 2.0 0 1 6.2
falcon 1976 2.7 1 0 4.1
cardinal 2201 3.5 1 1 6.0
Name Time Load Weblogic?
SMTP? Word Version
swift 2011 2.0 0 1 6.2
falcon 1976 2.7 1 0 4.1
cardinal 2201 3.5 1 1 6.0
Name Time Load Weblogic?
SMTP? Word Version
swift 2011 2.0 0 1 6.2
falcon 1976 2.7 1 0 4.1
cardinal 2201 3.5 1 1 6.0
July 21, 2004 19Approved for Public Release, Distribution Unlimited
Build a hierarchy using a P2P protocol that “assembles the puzzle” without any servers
Name Load Weblogic? SMTP? Word Version
…
swift 2.0 0 1 6.2
falcon 1.5 1 0 4.1
cardinal 4.5 1 0 6.0
Name Load Weblogic? SMTP? Word Version
…
gazelle 1.7 0 0 4.5
zebra 3.2 0 1 6.2
gnu .5 1 0 6.2
Name Avg Load
WL contact SMTP contact
SF 2.6 123.45.61.3
123.45.61.17
NJ 1.8 127.16.77.6
127.16.77.11
Paris 3.1 14.66.71.8 14.66.71.12
San Francisco
New Jersey
SQL query “summarizes”
data
Dynamically changing query output is visible system-wide
July 21, 2004 20Approved for Public Release, Distribution Unlimited
(1) Query goes out… (2) Compute locally… (3) results flow to top level of the hierarchy
Name Load Weblogic? SMTP? Word Version
…
swift 2.0 0 1 6.2
falcon 1.5 1 0 4.1
cardinal 4.5 1 0 6.0
Name Load Weblogic? SMTP? Word Version
…
gazelle 1.7 0 0 4.5
zebra 3.2 0 1 6.2
gnu .5 1 0 6.2
Name Avg Load
WL contact SMTP contact
SF 2.6 123.45.61.3
123.45.61.17
NJ 1.8 127.16.77.6
127.16.77.11
Paris 3.1 14.66.71.8 14.66.71.12
San Francisco
New Jersey
1
3 3
1
2 2
July 21, 2004 21Approved for Public Release, Distribution Unlimited
Hierarchy is virtual… data is replicated
Name Load Weblogic? SMTP? Word Version
…
swift 2.0 0 1 6.2
falcon 1.5 1 0 4.1
cardinal 4.5 1 0 6.0
Name Load Weblogic? SMTP? Word Version
…
gazelle 1.7 0 0 4.5
zebra 3.2 0 1 6.2
gnu .5 1 0 6.2
Name Avg Load
WL contact SMTP contact
SF 2.6 123.45.61.3
123.45.61.17
NJ 1.8 127.16.77.6
127.16.77.11
Paris 3.1 14.66.71.8 14.66.71.12
San Francisco
New Jersey
July 21, 2004 22Approved for Public Release, Distribution Unlimited
Hierarchy is virtual… data is replicated
Name Load Weblogic? SMTP? Word Version
…
swift 2.0 0 1 6.2
falcon 1.5 1 0 4.1
cardinal 4.5 1 0 6.0
Name Load Weblogic? SMTP? Word Version
…
gazelle 1.7 0 0 4.5
zebra 3.2 0 1 6.2
gnu .5 1 0 6.2
Name Avg Load
WL contact SMTP contact
SF 2.6 123.45.61.3
123.45.61.17
NJ 1.8 127.16.77.6
127.16.77.11
Paris 3.1 14.66.71.8 14.66.71.12
San Francisco
New Jersey
July 21, 2004 23Approved for Public Release, Distribution Unlimited
The key to self-* properties! A flexible, reprogrammable mechanism
Which clustered services are experiencing timeouts, and what were they waiting for when they happened?
Find 12 idle machines with the NMR-3D package that can download a 20MB dataset rapidly
Which machines have inventory for warehouse 9?
Where’s the cheapest gasoline in the area? Think of aggregation functions as small
agents that look for information
July 21, 2004 24Approved for Public Release, Distribution Unlimited
What about security? Astrolabe requires
Read permissions to see database Write permissions to contribute data Administrative permission to change
aggregation or configuration certificates Users decide what data Astrolabe can see A VPN setup can be used to hide
Astrolabe’s internal messages from intruders
Byzantine Agreement based on threshold crypto used to secure aggregation functions
New!
July 21, 2004 25Approved for Public Release, Distribution Unlimited
Data Mining Quite a hot area, usually done by
collecting information to a centralized node, then “querying” within that node
Astrolabe is doing the comparable thing, but its query evaluation occurs in a decentralized manner This is incredibly parallel, hence faster And more robust against disruption too!
July 21, 2004 26Approved for Public Release, Distribution Unlimited
Cool Astrolabe Properties Parallel. Everyone does a tiny bit
work, so we accomplish huge tasks in seconds
Flexible. Decentralized query evaluation, in seconds
One aggregate can answer lots of questions. E.g. “where’s the nearest supply shed?” – the hierarchy encodes many answers in one tree!
July 21, 2004 27Approved for Public Release, Distribution Unlimited
Aggregation and Hierarchy Nearby information
Maintained in more detail, can query it directly
Changes seen sooner Remote information summarized
High quality aggregated data This also changes as information
evolves
July 21, 2004 28Approved for Public Release, Distribution Unlimited
Astrolabe summary Scalable: could support millions of machines Flexible: can easily extend domain hierarchy,
define new columns or eliminate old ones. Adapts as conditions evolve.
Secure: Uses keys for authentication and can even encrypt Handles firewalls gracefully, including issues of IP
address re-use behind firewalls Performs well: updates propagate in seconds Cheap to run: tiny load, small memory impact
July 21, 2004 29Approved for Public Release, Distribution Unlimited
Bimodal Multicast A quick glimpse of scalable
multicast Think about really large Internet
configurations A data center as the data source Typical “publication” might be going
to thousands of client systems
July 21, 2004 30Approved for Public Release, Distribution Unlimited
Swiss Stock Exchange Problem: Vsync. multicast is “fragile”
Most members are healthy….
… but one is slow
Most members are healthy….
July 21, 2004 31Approved for Public Release, Distribution Unlimited
Performance degrades as the system scales up
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90
50
100
150
200
250Virtually synchronous Ensemble multicast protocols
perturb rate
aver
age
thro
ug
hp
ut
on
no
np
ertu
rbed
mem
ber
s group size: 32group size: 64group size: 96
32
96
July 21, 2004 32Approved for Public Release, Distribution Unlimited
Why doesn’t multicast scale? With weak semantics…
Faulty behavior may occur more often as system size increases (think “the Internet”)
With stronger reliability semantics… Encounter a system-wide cost (e.g. membership
reconfiguration, congestion control) That can be triggered more often as a function of
scale (more failures, or more network “events”, or bigger latencies)
Similar observation led Jim Gray to speculate that parallel databases scale as O(n2)
July 21, 2004 33Approved for Public Release, Distribution Unlimited
But none of this is inevitable Recent work on probabilistic
solutions suggests that gossip-based repair strategy scales quite well
Also gives very steady throughput And can take advantage of
hardware support for multicast, if available
Start by using unreliable multicast to rapidly distribute the message. But some messages may not get through, and some processes may be faulty. So initial state involves partial distribution of multicast(s)
Periodically (e.g. every 100ms) each process sends a digest describing its state to some randomly selected group member. The digest identifies messages. It doesn’t include them.
Recipient checks the gossip digest against its own history and solicits a copy of any missing message from the process that sent the gossip
Processes respond to solicitations received during a round of gossip by retransmitting the requested message. The round lasts much longer than a typical RPC time.
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90
20
40
60
80
100
120
140
160
180
200Low bandwidth comparison of pbcast performance at faulty and correct hosts
perturb rate
aver
age
thro
ughp
ut
traditional w/1 perturbed pbcast w/1 perturbed
throughput for traditional, measured at perturbed hostthroughput for pbcast measured at perturbed host
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90
20
40
60
80
100
120
140
160
180
200High bandwidth comparison of pbcast performance at faulty and correct hosts
perturb rate
aver
age
thro
ughp
ut
traditional: at unperturbed hostpbcast: at unperturbed host
traditional: at perturbed host pbcast: at perturbed host
This solves our problem!
Bimodal Multicast rides out
disturbances!
July 21, 2004 39Approved for Public Release, Distribution Unlimited
Bimodal Multicast Summary An extremely scalable technology Remains steady and reliable
Even with high rates of message loss (in our tests as high as 20%)
Even with large numbers of perturbed processes (we tested with up to 25%)
Even with router failures Even when IP multicast fails
And we’ve secured it using digital signatures
July 21, 2004 40Approved for Public Release, Distribution Unlimited
Kelips Third in our set of tools A P2P “index”
Put(“name”, value) Get(“name”)
Kelips can do lookups with one RPC, is self-stabilizing after disruption
Unlike Astrolabe, nodes can put varying amounts of data out there.
July 21, 2004 41Approved for Public Release, Distribution Unlimited
Kelips
30
110
230 202
Take a a collection of
“nodes”
July 21, 2004 42Approved for Public Release, Distribution Unlimited
Kelips
0 1 2
30
110
230 202
Affinity Groups:peer membership thru consistent hash
1N
N
members per affinity group
Map nodes to affinity
groups
July 21, 2004 43Approved for Public Release, Distribution Unlimited
Kelips
0 1 2
30
110
230 202
Affinity Groups:peer membership thru consistent hash
1N
Affinity group pointers
N
members per affinity group
id hbeat rtt
30 234 90ms
230 322 30ms
Affinity group view
110 knows about other members – 230, 30…
July 21, 2004 44Approved for Public Release, Distribution Unlimited
Kelips
0 1 2
30
110
230 202
Affinity Groups:peer membership thru consistent hash
1N
Contact pointers
N
members per affinity group
id hbeat rtt
30 234 90ms
230 322 30ms
Affinity group view
group contactNode
… …
2 202
Contacts
202 is a “contact” for 110 in group
2
July 21, 2004 45Approved for Public Release, Distribution Unlimited
Kelips
0 1 2
30
110
230 202
Affinity Groups:peer membership thru consistent hash
1N
Gossip protocol replicates data
cheaply
N
members per affinity group
id hbeat rtt
30 234 90ms
230 322 30ms
Affinity group view
group contactNode
… …
2 202
Contacts
resource info
… …
dot.com 110
Resource Tuples
“dot.com” maps to group 2. So 110 tells group 2 to
“route” inquiries about dot.com to it.
July 21, 2004 46Approved for Public Release, Distribution Unlimited
Kelips
0 1 2
30
110
230 202
Affinity Groups:peer membership thru consistent hash
1N
N
members per affinity group
To look up “dot.com”, just ask
some contact in group 2. It returns “110” (or forwards
your request).
July 21, 2004 47Approved for Public Release, Distribution Unlimited
Kelips summary Split the system into N subgroups
Map (key,value) pairs to some subgroup, by hashing the key
Replicate within that subgroup Each node tracks
Its own group membership k members of each of the other groups
To lookup a key, hash it and ask one or more of your contacts if they know the value
July 21, 2004 48Approved for Public Release, Distribution Unlimited
Kelips summary O(N) storage overhead, which is
higher than for other DHT’s Same space overhead for member list,
contact list, and replicated data itself Heuristic is used to keep contacts fresh
and avoid contacts that seem to churn This buys us O(1) lookup cost And background overhead is constant
July 21, 2004 49Approved for Public Release, Distribution Unlimited
Virtual Synchrony Last piece of the puzzle Outcome of a decade of DARPA-
funded work, technology core of AEGIS “integrated” console New York and Swiss Stock Exchange French Air Traffic Control System Florida Electric Power and Light
System
July 21, 2004 50Approved for Public Release, Distribution Unlimited
crashG0={p,q} G1={p,q,r,s} G2={q,r,s} G3={q,r,s,t}
p
q
r
s
t
r, s request to joinr,s added; state xfer
t added, state xfert requests to join
p fails
Virtual Synchrony Model
July 21, 2004 51Approved for Public Release, Distribution Unlimited
Roles in QuickSilver? Provides way for groups of
components to Replicate data, synchronize Perform tasks in parallel (like parallel
database lookups, for improved speed) Detect failures and reconfigure to
compensate by regenerating lost functionality
July 21, 2004 52Approved for Public Release, Distribution Unlimited
Replication: Key to understanding QuickSilver
crashG0={p,q} G1={p,q,r,s} G2={q,r,s} G3={q,r,s,t}
p
q
r
s
t
r, s request to joinr,s added; state xfer
t added, state xfert requests to join
p fails
6.0
4.1
6.2
Word Version
014.5cardinal
011.5falcon
102.0swift
…SMTP?Weblogic?LoadName
6.2
6.2
4.5
Word Version
01.5gnu
103.2zebra
001.7gazelle
…SMTP?Weblogic?LoadName
14.66.71.1214.66.71.83.1Paris
127.16.77.11127.16.77.61.8NJ
123.45.61.17123.45.61.32.6SF
SMTP contactWL contactAvgLoad
Name
San Francisco New Jersey
Kelips Virtual Synchrony
Astrolabe Bimodal Multicast
0 1 2
30
110
230 202
Gossip protocol tracks membership. Hash
1N
Nmembers per affinity group
query
each member to an “affinity group”
Gossip protocolreplicates data cheaply
July 21, 2004 53Approved for Public Release, Distribution Unlimited
Metrics We plan to look at several:
Robustness to externally imposed stress, overload: expect to demonstrate significant improvements
Scalability: Graph performance/overheads as function of scale, load, etc
End-user power: Implement JBI, sensor networks, data-center mgt. platform
Total cost: With Raytheon, explore impact on real military applications
Under DURIP funding we have acquired a clustered evaluation platform.
July 21, 2004 54Approved for Public Release, Distribution Unlimited
Our plan Integrate these core components Then
Build a JBI layer over the system Integrate Johannes Gehrke’s data
mining technology into the platform Support scalable overlay multicast
(Francis) Raytheon: Teaming with us to tackle
military applications, notably Navy