tackling challenges of scale in highly available computing systems

Tackling Challenges of Scale in Highly Available Computing Systems

Ken BirmanDept. of Computer Science

Cornell University

Members of the group

Ken Birman Robbert van Renesse Einar Vollset Krzystof Ostrowski Mahesh Balakrishnan Maya Haridasan Amar Phanishayee

Our topic Computing systems are growing

… larger, … and more complex, … and we are hoping to use them in a more

and more “unattended” manner Peek under the covers of the toughest,

most powerful systems that exist Then ask: Can we discern a research

agenda?

Some “factoids”

Companies like Amazon, Google, eBay are running data centers with tens of thousands of machines Credit card companies, banks,

brokerages, insurance companies close behind

Rate of growth is staggering Meanwhile, a new rollout of wireless

sensor networks is poised to take off

How are big systems structured?

Typically a “data center” of web servers Some human-generated traffic Some automatic traffic from WS clients

The front-end servers are connected to a pool of clustered back-end application “services”

All of this load-balanced, multi-ported Extensive use of caching for improved

performance and scalability Publish-subscribe very popular

A glimpse inside eStuff.com

Pub-sub combined with point-to-pointcommunication technologies like TCP

LB

service

LB

service

LB

service

LB

service

LB

service

LB

service

“front-end applications”

Hierarchy of sets A set of data centers, each having A set of services, each structured as A set of partitions, each consisting of A set of programs running in a clustered

manner on A set of machines

… raising the obvious question: how well do platforms support hierarchies of sets?

x y z

A RAPS of RACS (Jim Gray)

RAPS: A reliable array of partitioned subservices

RACS: A reliable array of cloned server processes

Ken Birman searching for “digital

camera”

Pmap “B-C”: {x, y, z} (equivalent replicas)

Here, y gets picked, perhaps based on load

A set of RACS

RAPS

RAPS of RACS in Data Centers

Query source Update source

Services are hosted at data centers but accessible system-wide

pmap

pmap

pmap

Server pool

l2P map

Logical partitioning of services

Logical services map to a physical resource pool, perhaps many to one

Data center A Data center B

Operators can control pmap, l2P map, other parameters. Large-scale multicast used to

disseminate updates

Technology needs? Programs will need a way to

Find the “members” of the service Apply the partitioning function to find

contacts within a desired partition Dynamic resource management,

adaptation of RACS size and mapping to hardware

Fault detection Within a RACS we also need to:

Replicate data for scalability, fault tolerance Load balance or parallelize tasks

Scalability makes this hard!

Membership Within RACS Of the service Services in data

centers Communication

Point-to-point Multicast

Resource management Pool of machines Set of services Subdivision into RACS

Fault-tolerance Consistency

… hard in what sense? Sustainable workload often drops at least

linearly in system size And this happens because overheads grow

worse than linearly (quadratic is common)

Reasons vary… but share a pattern: Frequency of “disruptive” events rises with

scale Protocols have property that whole system is

impacted when these events occur

QuickSilver project We’ve been building a scalable

infrastructure addressing these needs Consists of:

Some existing technologies, notably Astrolabe, gossip “repair” protocols

Some new technology, notably a new publish-subscribe message bus and a new way to automatically create a RAPS of RACS for time-critical applications

Gossip 101

Suppose that I know something I’m sitting next to Fred, and I tell him

Now 2 of us “know” Later, he tells Mimi and I tell Anne

Now 4 This is an example of a push

epidemic Push-pull occurs if we exchange data

Gossip scales very nicely

Participants’ loads independent of size

Network load linear in system size Information spreads in log(system

size) time

% in

fect

ed

0.0

1.0

Time

Gossip in distributed systems

We can gossip about membership Need a bootstrap mechanism, but

then discuss failures, new members Gossip to repair faults in replicated

data “I have 6 updates from Charlie”

If we aren’t in a hurry, gossip to replicate data too

Bimodal Multicast

ACM TOCS 1999

Gossip source has a message from

Mimi that I’m missing.

And he seems to be missing two messages from Charlie that I

have.

Here are some messages from

Charlie that might interest you.

Could you send me a copy of Mimi’s 7’th message?

Mimi’s 7’th message was

“The meeting of our Q exam study

group will start late on

Wednesday…”

Send multicasts to report events

Some messages don’t get through

Periodically, but not synchronously, gossip

about messages.

Stock Exchange Problem: Reliable multicast is too “fragile”

Most members are healthy….

… but one is slow

Most members are healthy….

The problem gets worse as the system scales up

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

50

100

150

200

250Virtually synchronous Ensemble multicast protocols

perturb rate

aver

age

thro

ug

hp

ut

on

no

np

ertu

rbed

mem

ber

s group size: 32group size: 64group size: 96

32

96

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

20

40

60

80

100

120

140

160

180

200Low bandwidth comparison of pbcast performance at faulty and correct hosts

perturb rate

aver

age

thro

ughp

ut

traditional w/1 perturbed pbcast w/1 perturbed throughput for traditional, measured at perturbed hostthroughput for pbcast measured at perturbed host

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

20

40

60

80

100

120

140

160

180

200High bandwidth comparison of pbcast performance at faulty and correct hosts

perturb rate

aver

age

thro

ughp

ut

traditional: at unperturbed hostpbcast: at unperturbed host traditional: at perturbed host pbcast: at perturbed host

Bimodal multicast with perturbed processes

Bimodal multicastscales well

Traditional multicast: throughput collapses under stress

Bimodal Multicast Imposes a constant overhead on participants

Many optimizations and tricks needed, but nothing that isn’t practical to implement

Hardest issues involve “biased” gossip to handle LANs connected by WAN long-haul links

Reliability is easy to analyze mathematically using epidemic theory Use the theory to derive optimal parameter

setting Theory also let’s us predict behavior Despite simplified model, the predictions work!

Kelips

A distributed “index” Put(“name”, value) Get(“name”)

Kelips can do lookups with one RPC, is self-stabilizing after disruption

Kelips

30

110

230 202

Take a a collection of

“nodes”

Kelips

0 1 2

30

110

230 202

Affinity Groups:peer membership thru consistent hash

1N

N

members per affinity group

Map nodes to affinity

groups

Kelips

0 1 2

30

110

230 202


1N

Affinity group pointers

N


id hbeat rtt

30 234 90ms

230 322 30ms

Affinity group view

110 knows about other members – 230, 30…

Kelips

0 1 2

30

110

230 202


1N

Contact pointers

N


id hbeat rtt

30 234 90ms

230 322 30ms

Affinity group view

group contactNode

… …

2 202

Contacts

202 is a “contact” for 110 in group

2

Kelips

0 1 2

30

110

230 202


1N

Gossip protocol replicates data

cheaply

N


id hbeat rtt

30 234 90ms

230 322 30ms

Affinity group view

group contactNode

… …

2 202

Contacts

resource info

… …

cnn.com 110

Resource Tuples

“cnn.com” maps to group 2. So 110 tells group 2 to

“route” inquiries about cnn.com to it.

Kelips

0 1 2

30

110

230 202


1N

N


To look up “cnn.com”, just ask

some contact in group 2. It returns “110” (or forwards

your request).

IP2P, ACM TOIS (submitted)

Kelips

Per-participant loads are constant Space required grows as O(√N) Finds an object in “one hop”

Most other DHTs need log(N) hops And isn’t disrupted by churn, either

Most other DHTs are seriously disrupted when churn occurs and might even “fail”

Astrolabe: Distributed Monitoring

Name Load Weblogic?

SMTP?

Word Version

…

swift 2.0 0 1 6.2

falcon 1.5 1 0 4.1

cardinal 4.5 1 0 6.0

Row can have many columns Total size should be k-bytes, not megabytes Configuration certificate determines what

data is pulled into the table (and can change)

3.1

5.3

0.9

1.9

3.6

0.8

2.1

2.7

1.1

1.8

ACM TOCS 2003

State Merge: Core of Astrolabe epidemic

Name Time Load Weblogic?

SMTP? Word Version

swift 2003 .67 0 1 6.2

falcon 1976 2.7 1 0 4.1

cardinal 2201 3.5 1 1 6.0


SMTP? Word Versi

on

swift 2011 2.0 0 1 6.2

falcon 1971 1.5 1 0 4.1

cardinal 2004 4.5 1 0 6.0

swift.cs.cornell.edu

cardinal.cs.cornell.edu



SMTP? Word Version

swift 2003 .67 0 1 6.2

falcon 1976 2.7 1 0 4.1

cardinal 2201 3.5 1 1 6.0


SMTP? Word Versi

on

swift 2011 2.0 0 1 6.2

falcon 1971 1.5 1 0 4.1

cardinal 2004 4.5 1 0 6.0



swift 2011 2.0

cardinal 2201 3.5



SMTP? Word Version

swift 2011 2.0 0 1 6.2

falcon 1976 2.7 1 0 4.1

cardinal 2201 3.5 1 1 6.0


SMTP? Word Versi

on

swift 2011 2.0 0 1 6.2

falcon 1971 1.5 1 0 4.1

cardinal 2201 3.5 1 0 6.0



Scaling up… and up…

With a stack of domains, we don’t want every system to “see” every domain Cost would be huge

So instead, we’ll see a summaryName Time Load Weblogic?

SMTP? Word Version

swift 2011 2.0 0 1 6.2

falcon 1976 2.7 1 0 4.1

cardinal 2201 3.5 1 1 6.0



SMTP? Word Version

swift 2011 2.0 0 1 6.2

falcon 1976 2.7 1 0 4.1

cardinal 2201 3.5 1 1 6.0


SMTP? Word Version

swift 2011 2.0 0 1 6.2

falcon 1976 2.7 1 0 4.1

cardinal 2201 3.5 1 1 6.0


SMTP? Word Version

swift 2011 2.0 0 1 6.2

falcon 1976 2.7 1 0 4.1

cardinal 2201 3.5 1 1 6.0


SMTP? Word Version

swift 2011 2.0 0 1 6.2

falcon 1976 2.7 1 0 4.1

cardinal 2201 3.5 1 1 6.0


SMTP? Word Version

swift 2011 2.0 0 1 6.2

falcon 1976 2.7 1 0 4.1

cardinal 2201 3.5 1 1 6.0


SMTP? Word Version

swift 2011 2.0 0 1 6.2

falcon 1976 2.7 1 0 4.1

cardinal 2201 3.5 1 1 6.0

Build a hierarchy using a P2P protocol that “assembles the puzzle” without any servers

Name Load Weblogic? SMTP? Word Version

…

swift 2.0 0 1 6.2

falcon 1.5 1 0 4.1



…

gazelle 1.7 0 0 4.5

zebra 3.2 0 1 6.2

gnu .5 1 0 6.2

Name Avg Load

WL contact SMTP contact

SF 2.6 123.45.61.3

123.45.61.17

NJ 1.8 127.16.77.6

127.16.77.11

Paris 3.1 14.66.71.8 14.66.71.12

San Francisco

New Jersey

SQL query “summarizes”

data

Dynamically changing query output is visible system-wide

(1) Query goes out… (2) Compute locally… (3) results flow to top level of the hierarchy


…

swift 2.0 0 1 6.2

falcon 1.5 1 0 4.1



…

gazelle 1.7 0 0 4.5

zebra 3.2 0 1 6.2

gnu .5 1 0 6.2

Name Avg Load


SF 2.6 123.45.61.3

123.45.61.17

NJ 1.8 127.16.77.6

127.16.77.11

Paris 3.1 14.66.71.8 14.66.71.12

San Francisco

New Jersey

1

3 3

1

2 2

Hierarchy is virtual… data is replicated


…

swift 2.0 0 1 6.2

falcon 1.5 1 0 4.1



…

gazelle 1.7 0 0 4.5

zebra 3.2 0 1 6.2

gnu .5 1 0 6.2

Name Avg Load


SF 2.6 123.45.61.3

123.45.61.17

NJ 1.8 127.16.77.6

127.16.77.11

Paris 3.1 14.66.71.8 14.66.71.12

San Francisco

New Jersey

ACM TOCS 2003

Astrolabe Load on participants, in worst case,

grows as logrsize(N) Most partipants see a constant, low

load Incredibly robust, self-repairing

Information visible in log time And can reconfigure or change

aggregation query in log time, too Well matched to data mining

QuickSilver: Current work One goal is to offer scalable support for:

Publish(“topic”, data) Subscribe(“topic”, handler)

Topic associated w/ protocol stack, properties

Many topics… hence many protocol stacks (communication groups) Quicksilver scalable multicast is running now

and demonstrates this capability in a web services framework

Primary developer is Krzys Ostrowski

Tempest This project seeks to automate a new

drag-and-drop style of clustered application development

Emphasis is on time-critical response You start with a relatively standard web

service application having good timing properties (inheriting from our data class)

Tempest automatically clones services, places them, load-balances, repairs faults

Uses Ricochet protocol for time-critical multicast

Ricochet Core protocol underlying Tempest Delivers a multicast with

Probabilistically strong timing properties Three orders of magnitude faster than prior record!

Probability-one reliability, if desired Key idea is to use FEC and to exploit

patterns of numerous, heavily overlapping groups.

Available for download from Cornell as a library (coded in Java)

Our system will be used in…

Massive data centers Distributed data mining Sensor networks Grid computing Air Force “Services Infosphere”

Our platform in a datacenter

Query source Update source

Services are hosted at data centers but accessible system-wide

pmap

pmap

pmap

Server pool

l2P map

Logical partitioning of services

Logical services map to a physical resource pool, perhaps many to one

Data center A Data center B

One application can be a source of both queries and updates.

Two Astrolabe Hierarchies monitor the system: one tracks logical services, and the other tracks the physical server pool

Operators can control pmap, l2P map, other parameters. Large-scale multicast used to

disseminate updates

Next major project? We’re starting a completely new effort Goal is to support a new generation of mobile platforms

that can collaborate, learn, and can query a surrounding mesh of sensors using wireless ad-hoc communication

Stefan Pleisch has worked on the mobile query problem. Einar Vollset and Robbert van Renesse are building the new mobile platform software. Epidemic gossip remains our key idea…

Summary Our project builds software

Software that real people will end up running But we tell users when it works and prove it!

The focus lately is on scalability and QoS Theory, engineering, experiments and

simulation For scalability, set probabilistic goals, use

epidemic protocols But outcome will be real systems that we

believe will be widely used.

tackling challenges of scale in highly available computing systems

Documents