tackling challenges of scale in highly available computing systems

45
Tackling Challenges of Scale in Highly Available Computing Systems Ken Birman Dept. of Computer Science Cornell University

Upload: shalin

Post on 13-Jan-2016

16 views

Category:

Documents


0 download

DESCRIPTION

Tackling Challenges of Scale in Highly Available Computing Systems. Ken Birman Dept. of Computer Science Cornell University. Members of the group. Ken Birman Robbert van Renesse Einar Vollset Krzystof Ostrowski Mahesh Balakrishnan Maya Haridasan Amar Phanishayee. Our topic. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Tackling Challenges of Scale in Highly Available Computing Systems

Tackling Challenges of Scale in Highly Available Computing Systems

Ken BirmanDept. of Computer Science

Cornell University

Page 2: Tackling Challenges of Scale in Highly Available Computing Systems

Members of the group

Ken Birman Robbert van Renesse Einar Vollset Krzystof Ostrowski Mahesh Balakrishnan Maya Haridasan Amar Phanishayee

Page 3: Tackling Challenges of Scale in Highly Available Computing Systems

Our topic Computing systems are growing

… larger, … and more complex, … and we are hoping to use them in a more

and more “unattended” manner Peek under the covers of the toughest,

most powerful systems that exist Then ask: Can we discern a research

agenda?

Page 4: Tackling Challenges of Scale in Highly Available Computing Systems

Some “factoids”

Companies like Amazon, Google, eBay are running data centers with tens of thousands of machines Credit card companies, banks,

brokerages, insurance companies close behind

Rate of growth is staggering Meanwhile, a new rollout of wireless

sensor networks is poised to take off

Page 5: Tackling Challenges of Scale in Highly Available Computing Systems

How are big systems structured?

Typically a “data center” of web servers Some human-generated traffic Some automatic traffic from WS clients

The front-end servers are connected to a pool of clustered back-end application “services”

All of this load-balanced, multi-ported Extensive use of caching for improved

performance and scalability Publish-subscribe very popular

Page 6: Tackling Challenges of Scale in Highly Available Computing Systems

A glimpse inside eStuff.com

Pub-sub combined with point-to-pointcommunication technologies like TCP

LB

service

LB

service

LB

service

LB

service

LB

service

LB

service

“front-end applications”

Page 7: Tackling Challenges of Scale in Highly Available Computing Systems

Hierarchy of sets A set of data centers, each having A set of services, each structured as A set of partitions, each consisting of A set of programs running in a clustered

manner on A set of machines

… raising the obvious question: how well do platforms support hierarchies of sets?

Page 8: Tackling Challenges of Scale in Highly Available Computing Systems

x y z

A RAPS of RACS (Jim Gray)

RAPS: A reliable array of partitioned subservices

RACS: A reliable array of cloned server processes

Ken Birman searching for “digital

camera”

Pmap “B-C”: {x, y, z} (equivalent replicas)

Here, y gets picked, perhaps based on load

A set of RACS

RAPS

Page 9: Tackling Challenges of Scale in Highly Available Computing Systems

RAPS of RACS in Data Centers

Query source Update source

Services are hosted at data centers but accessible system-wide

pmap

pmap

pmap

Server pool

l2P map

Logical partitioning of services

Logical services map to a physical resource pool, perhaps many to one

Data center A Data center B

Operators can control pmap, l2P map, other parameters. Large-scale multicast used to

disseminate updates

Page 10: Tackling Challenges of Scale in Highly Available Computing Systems

Technology needs? Programs will need a way to

Find the “members” of the service Apply the partitioning function to find

contacts within a desired partition Dynamic resource management,

adaptation of RACS size and mapping to hardware

Fault detection Within a RACS we also need to:

Replicate data for scalability, fault tolerance Load balance or parallelize tasks

Page 11: Tackling Challenges of Scale in Highly Available Computing Systems

Scalability makes this hard!

Membership Within RACS Of the service Services in data

centers Communication

Point-to-point Multicast

Resource management Pool of machines Set of services Subdivision into RACS

Fault-tolerance Consistency

Page 12: Tackling Challenges of Scale in Highly Available Computing Systems

… hard in what sense? Sustainable workload often drops at least

linearly in system size And this happens because overheads grow

worse than linearly (quadratic is common)

Reasons vary… but share a pattern: Frequency of “disruptive” events rises with

scale Protocols have property that whole system is

impacted when these events occur

Page 13: Tackling Challenges of Scale in Highly Available Computing Systems

QuickSilver project We’ve been building a scalable

infrastructure addressing these needs Consists of:

Some existing technologies, notably Astrolabe, gossip “repair” protocols

Some new technology, notably a new publish-subscribe message bus and a new way to automatically create a RAPS of RACS for time-critical applications

Page 14: Tackling Challenges of Scale in Highly Available Computing Systems

Gossip 101

Suppose that I know something I’m sitting next to Fred, and I tell him

Now 2 of us “know” Later, he tells Mimi and I tell Anne

Now 4 This is an example of a push

epidemic Push-pull occurs if we exchange data

Page 15: Tackling Challenges of Scale in Highly Available Computing Systems

Gossip scales very nicely

Participants’ loads independent of size

Network load linear in system size Information spreads in log(system

size) time

% in

fect

ed

0.0

1.0

Time

Page 16: Tackling Challenges of Scale in Highly Available Computing Systems

Gossip in distributed systems

We can gossip about membership Need a bootstrap mechanism, but

then discuss failures, new members Gossip to repair faults in replicated

data “I have 6 updates from Charlie”

If we aren’t in a hurry, gossip to replicate data too

Page 17: Tackling Challenges of Scale in Highly Available Computing Systems

Bimodal Multicast

ACM TOCS 1999

Gossip source has a message from

Mimi that I’m missing.

And he seems to be missing two messages from Charlie that I

have.

Here are some messages from

Charlie that might interest you.

Could you send me a copy of Mimi’s 7’th message?

Mimi’s 7’th message was

“The meeting of our Q exam study

group will start late on

Wednesday…”

Send multicasts to report events

Some messages don’t get through

Periodically, but not synchronously, gossip

about messages.

Page 18: Tackling Challenges of Scale in Highly Available Computing Systems

Stock Exchange Problem: Reliable multicast is too “fragile”

Most members are healthy….

… but one is slow

Most members are healthy….

Page 19: Tackling Challenges of Scale in Highly Available Computing Systems

The problem gets worse as the system scales up

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

50

100

150

200

250Virtually synchronous Ensemble multicast protocols

perturb rate

aver

age

thro

ug

hp

ut

on

no

np

ertu

rbed

mem

ber

s group size: 32group size: 64group size: 96

32

96

Page 20: Tackling Challenges of Scale in Highly Available Computing Systems

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

20

40

60

80

100

120

140

160

180

200Low bandwidth comparison of pbcast performance at faulty and correct hosts

perturb rate

aver

age

thro

ughp

ut

traditional w/1 perturbed pbcast w/1 perturbed throughput for traditional, measured at perturbed hostthroughput for pbcast measured at perturbed host

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

20

40

60

80

100

120

140

160

180

200High bandwidth comparison of pbcast performance at faulty and correct hosts

perturb rate

aver

age

thro

ughp

ut

traditional: at unperturbed hostpbcast: at unperturbed host traditional: at perturbed host pbcast: at perturbed host

Bimodal multicast with perturbed processes

Bimodal multicastscales well

Traditional multicast: throughput collapses under stress

Page 21: Tackling Challenges of Scale in Highly Available Computing Systems

Bimodal Multicast Imposes a constant overhead on participants

Many optimizations and tricks needed, but nothing that isn’t practical to implement

Hardest issues involve “biased” gossip to handle LANs connected by WAN long-haul links

Reliability is easy to analyze mathematically using epidemic theory Use the theory to derive optimal parameter

setting Theory also let’s us predict behavior Despite simplified model, the predictions work!

Page 22: Tackling Challenges of Scale in Highly Available Computing Systems

Kelips

A distributed “index” Put(“name”, value) Get(“name”)

Kelips can do lookups with one RPC, is self-stabilizing after disruption

Page 23: Tackling Challenges of Scale in Highly Available Computing Systems

Kelips

30

110

230 202

Take a a collection of

“nodes”

Page 24: Tackling Challenges of Scale in Highly Available Computing Systems

Kelips

0 1 2

30

110

230 202

Affinity Groups:peer membership thru consistent hash

1N

N

members per affinity group

Map nodes to affinity

groups

Page 25: Tackling Challenges of Scale in Highly Available Computing Systems

Kelips

0 1 2

30

110

230 202

Affinity Groups:peer membership thru consistent hash

1N

Affinity group pointers

N

members per affinity group

id hbeat rtt

30 234 90ms

230 322 30ms

Affinity group view

110 knows about other members – 230, 30…

Page 26: Tackling Challenges of Scale in Highly Available Computing Systems

Kelips

0 1 2

30

110

230 202

Affinity Groups:peer membership thru consistent hash

1N

Contact pointers

N

members per affinity group

id hbeat rtt

30 234 90ms

230 322 30ms

Affinity group view

group contactNode

… …

2 202

Contacts

202 is a “contact” for 110 in group

2

Page 27: Tackling Challenges of Scale in Highly Available Computing Systems

Kelips

0 1 2

30

110

230 202

Affinity Groups:peer membership thru consistent hash

1N

Gossip protocol replicates data

cheaply

N

members per affinity group

id hbeat rtt

30 234 90ms

230 322 30ms

Affinity group view

group contactNode

… …

2 202

Contacts

resource info

… …

cnn.com 110

Resource Tuples

“cnn.com” maps to group 2. So 110 tells group 2 to

“route” inquiries about cnn.com to it.

Page 28: Tackling Challenges of Scale in Highly Available Computing Systems

Kelips

0 1 2

30

110

230 202

Affinity Groups:peer membership thru consistent hash

1N

N

members per affinity group

To look up “cnn.com”, just ask

some contact in group 2. It returns “110” (or forwards

your request).

IP2P, ACM TOIS (submitted)

Page 29: Tackling Challenges of Scale in Highly Available Computing Systems

Kelips

Per-participant loads are constant Space required grows as O(√N) Finds an object in “one hop”

Most other DHTs need log(N) hops And isn’t disrupted by churn, either

Most other DHTs are seriously disrupted when churn occurs and might even “fail”

Page 30: Tackling Challenges of Scale in Highly Available Computing Systems

Astrolabe: Distributed Monitoring

Name Load Weblogic?

SMTP?

Word Version

swift 2.0 0 1 6.2

falcon 1.5 1 0 4.1

cardinal 4.5 1 0 6.0

Row can have many columns Total size should be k-bytes, not megabytes Configuration certificate determines what

data is pulled into the table (and can change)

3.1

5.3

0.9

1.9

3.6

0.8

2.1

2.7

1.1

1.8

ACM TOCS 2003

Page 31: Tackling Challenges of Scale in Highly Available Computing Systems

State Merge: Core of Astrolabe epidemic

Name Time Load Weblogic?

SMTP? Word Version

swift 2003 .67 0 1 6.2

falcon 1976 2.7 1 0 4.1

cardinal 2201 3.5 1 1 6.0

Name Time Load Weblogic?

SMTP? Word Versi

on

swift 2011 2.0 0 1 6.2

falcon 1971 1.5 1 0 4.1

cardinal 2004 4.5 1 0 6.0

swift.cs.cornell.edu

cardinal.cs.cornell.edu

Page 32: Tackling Challenges of Scale in Highly Available Computing Systems

State Merge: Core of Astrolabe epidemic

Name Time Load Weblogic?

SMTP? Word Version

swift 2003 .67 0 1 6.2

falcon 1976 2.7 1 0 4.1

cardinal 2201 3.5 1 1 6.0

Name Time Load Weblogic?

SMTP? Word Versi

on

swift 2011 2.0 0 1 6.2

falcon 1971 1.5 1 0 4.1

cardinal 2004 4.5 1 0 6.0

swift.cs.cornell.edu

cardinal.cs.cornell.edu

swift 2011 2.0

cardinal 2201 3.5

Page 33: Tackling Challenges of Scale in Highly Available Computing Systems

State Merge: Core of Astrolabe epidemic

Name Time Load Weblogic?

SMTP? Word Version

swift 2011 2.0 0 1 6.2

falcon 1976 2.7 1 0 4.1

cardinal 2201 3.5 1 1 6.0

Name Time Load Weblogic?

SMTP? Word Versi

on

swift 2011 2.0 0 1 6.2

falcon 1971 1.5 1 0 4.1

cardinal 2201 3.5 1 0 6.0

swift.cs.cornell.edu

cardinal.cs.cornell.edu

Page 34: Tackling Challenges of Scale in Highly Available Computing Systems

Scaling up… and up…

With a stack of domains, we don’t want every system to “see” every domain Cost would be huge

So instead, we’ll see a summaryName Time Load Weblogic?

SMTP? Word Version

swift 2011 2.0 0 1 6.2

falcon 1976 2.7 1 0 4.1

cardinal 2201 3.5 1 1 6.0

cardinal.cs.cornell.edu

Name Time Load Weblogic?

SMTP? Word Version

swift 2011 2.0 0 1 6.2

falcon 1976 2.7 1 0 4.1

cardinal 2201 3.5 1 1 6.0

Name Time Load Weblogic?

SMTP? Word Version

swift 2011 2.0 0 1 6.2

falcon 1976 2.7 1 0 4.1

cardinal 2201 3.5 1 1 6.0

Name Time Load Weblogic?

SMTP? Word Version

swift 2011 2.0 0 1 6.2

falcon 1976 2.7 1 0 4.1

cardinal 2201 3.5 1 1 6.0

Name Time Load Weblogic?

SMTP? Word Version

swift 2011 2.0 0 1 6.2

falcon 1976 2.7 1 0 4.1

cardinal 2201 3.5 1 1 6.0

Name Time Load Weblogic?

SMTP? Word Version

swift 2011 2.0 0 1 6.2

falcon 1976 2.7 1 0 4.1

cardinal 2201 3.5 1 1 6.0

Name Time Load Weblogic?

SMTP? Word Version

swift 2011 2.0 0 1 6.2

falcon 1976 2.7 1 0 4.1

cardinal 2201 3.5 1 1 6.0

Page 35: Tackling Challenges of Scale in Highly Available Computing Systems

Build a hierarchy using a P2P protocol that “assembles the puzzle” without any servers

Name Load Weblogic? SMTP? Word Version

swift 2.0 0 1 6.2

falcon 1.5 1 0 4.1

cardinal 4.5 1 0 6.0

Name Load Weblogic? SMTP? Word Version

gazelle 1.7 0 0 4.5

zebra 3.2 0 1 6.2

gnu .5 1 0 6.2

Name Avg Load

WL contact SMTP contact

SF 2.6 123.45.61.3

123.45.61.17

NJ 1.8 127.16.77.6

127.16.77.11

Paris 3.1 14.66.71.8 14.66.71.12

San Francisco

New Jersey

SQL query “summarizes”

data

Dynamically changing query output is visible system-wide

Page 36: Tackling Challenges of Scale in Highly Available Computing Systems

(1) Query goes out… (2) Compute locally… (3) results flow to top level of the hierarchy

Name Load Weblogic? SMTP? Word Version

swift 2.0 0 1 6.2

falcon 1.5 1 0 4.1

cardinal 4.5 1 0 6.0

Name Load Weblogic? SMTP? Word Version

gazelle 1.7 0 0 4.5

zebra 3.2 0 1 6.2

gnu .5 1 0 6.2

Name Avg Load

WL contact SMTP contact

SF 2.6 123.45.61.3

123.45.61.17

NJ 1.8 127.16.77.6

127.16.77.11

Paris 3.1 14.66.71.8 14.66.71.12

San Francisco

New Jersey

1

3 3

1

2 2

Page 37: Tackling Challenges of Scale in Highly Available Computing Systems

Hierarchy is virtual… data is replicated

Name Load Weblogic? SMTP? Word Version

swift 2.0 0 1 6.2

falcon 1.5 1 0 4.1

cardinal 4.5 1 0 6.0

Name Load Weblogic? SMTP? Word Version

gazelle 1.7 0 0 4.5

zebra 3.2 0 1 6.2

gnu .5 1 0 6.2

Name Avg Load

WL contact SMTP contact

SF 2.6 123.45.61.3

123.45.61.17

NJ 1.8 127.16.77.6

127.16.77.11

Paris 3.1 14.66.71.8 14.66.71.12

San Francisco

New Jersey

ACM TOCS 2003

Page 38: Tackling Challenges of Scale in Highly Available Computing Systems

Astrolabe Load on participants, in worst case,

grows as logrsize(N) Most partipants see a constant, low

load Incredibly robust, self-repairing

Information visible in log time And can reconfigure or change

aggregation query in log time, too Well matched to data mining

Page 39: Tackling Challenges of Scale in Highly Available Computing Systems

QuickSilver: Current work One goal is to offer scalable support for:

Publish(“topic”, data) Subscribe(“topic”, handler)

Topic associated w/ protocol stack, properties

Many topics… hence many protocol stacks (communication groups) Quicksilver scalable multicast is running now

and demonstrates this capability in a web services framework

Primary developer is Krzys Ostrowski

Page 40: Tackling Challenges of Scale in Highly Available Computing Systems

Tempest This project seeks to automate a new

drag-and-drop style of clustered application development

Emphasis is on time-critical response You start with a relatively standard web

service application having good timing properties (inheriting from our data class)

Tempest automatically clones services, places them, load-balances, repairs faults

Uses Ricochet protocol for time-critical multicast

Page 41: Tackling Challenges of Scale in Highly Available Computing Systems

Ricochet Core protocol underlying Tempest Delivers a multicast with

Probabilistically strong timing properties Three orders of magnitude faster than prior record!

Probability-one reliability, if desired Key idea is to use FEC and to exploit

patterns of numerous, heavily overlapping groups.

Available for download from Cornell as a library (coded in Java)

Page 42: Tackling Challenges of Scale in Highly Available Computing Systems

Our system will be used in…

Massive data centers Distributed data mining Sensor networks Grid computing Air Force “Services Infosphere”

Page 43: Tackling Challenges of Scale in Highly Available Computing Systems

Our platform in a datacenter

Query source Update source

Services are hosted at data centers but accessible system-wide

pmap

pmap

pmap

Server pool

l2P map

Logical partitioning of services

Logical services map to a physical resource pool, perhaps many to one

Data center A Data center B

One application can be a source of both queries and updates.

Two Astrolabe Hierarchies monitor the system: one tracks logical services, and the other tracks the physical server pool

Operators can control pmap, l2P map, other parameters. Large-scale multicast used to

disseminate updates

Page 44: Tackling Challenges of Scale in Highly Available Computing Systems

Next major project? We’re starting a completely new effort Goal is to support a new generation of mobile platforms

that can collaborate, learn, and can query a surrounding mesh of sensors using wireless ad-hoc communication

Stefan Pleisch has worked on the mobile query problem. Einar Vollset and Robbert van Renesse are building the new mobile platform software. Epidemic gossip remains our key idea…

Page 45: Tackling Challenges of Scale in Highly Available Computing Systems

Summary Our project builds software

Software that real people will end up running But we tell users when it works and prove it!

The focus lately is on scalability and QoS Theory, engineering, experiments and

simulation For scalability, set probabilistic goals, use

epidemic protocols But outcome will be real systems that we

believe will be widely used.