an economic approach for scalable and highly-available distributed applications nicolas bonvin,...

1

An economic approach for scalable and highly-available distributed applications

Nicolas Bonvin, Thanasis G. Papaioannou and Karl AbererSchool of Computer and Communication SciencesEcole Polytechnique F´ed´erale de Lausanne (EPFL)

1015 Lausanne, [email protected]

2010 IEEE 3rd International Conference on Cloud Computing

2

Outline

• INTRODUCTION• MOTIVATION - RUNNING EXAMPLE• SCARCE: THE QUEST OF AUTONOMIC APPLICATIONS

• EVALUATION • CONCLUSIONS• COMMENT

3

INTRODUCTION• A successful online application should

– be able to handle traffic spikes and flash crowds efficiently– be resilient to all kinds of failures (software, hardware, rack or even datacenter failures)

• A naive solution against load variations would be static over-provisioning of resources– result into resource underutilization for most of the time

• As the size of the cloud increases its administrative overhead becomes unmanageable– The cloud resources for an application should be self-managed and

adaptive to load variations or failures

4

INTRODUCTION• We propose a middleware (“Scattered Autonomic Resources”, referred to as Scarce)

– to avoid underutilized computational resources – dynamically adapts to changing conditions(failures or load

variations)

• Simplifies the development of online applications composed by multiple independent components (e.g. web services) – following the Service Oriented Architecture (SOA) principles

5

INTRODUCTION• Components are treated as individually rational entities that

rent computational resources from servers

• Components migrate, replicate or stop according to their economic fitness– fitness expresses the difference between the utility offered by a

specific application component and the cost for retaining it in the cloud

• Components of a certain application are dynamically replicated to geographically-diverse servers according to the availability requirements of the application

6

INTRODUCTION

• Our approach combines the following unique characteristics:• Adaptive component replication for accommodating load variations• Geographically-diverse placement of clone component instances• Cost-effective placement of service components • Decentralized self-management of the cloud resources for the

application

7

RUNNING EXAMPLE• Building an application that both provides robust guarantees against

failures (hardware, network, etc.) and handles dynamically load spikes is a non-trivial task

• a simple web application for selling e-tickets composed by 4 independent components

l

the entry point of theapplication and serves the HTML pages to end users

A user manager for managing the profiles of the customers

A ticket manager for managing the amount of availabletickets of an event

An e-ticket generator that produces e-tickets in PDFformat

8

RUNNING EXAMPLE

• Each component can be regarded as a standalone and self-contained web service

• This application – is highly sensitive to traffic spikes– is business-critical (deployed on different geographical regions)

A token (or a session ID) is assigned to each customer’s browser by the web front-end and is passed to each component along with the requests

This token is used as a key in the key-value database to store the detailsof the client’s shopping cart(number of tickets ordered)

9

SCARCE

• THE QUEST OF AUTONOMIC APPLICATIONS:– A. The approach– B. Server agent– C. Routing table– D. Economic model– E. Maintaining high-availability

10

The approach

• We consider applications formed by many independent components – interact together to provide a service to the end user

• A component– is self-managing, self-healing and is hosted by a

server(allowed to host many different components) – stop, migrate or replicate to a new server according to its

load or availability

11

Server agent• The server agent is a special component that resides at each

server – responsible for starting and stopping the components of applications – checking the health of the services 1. verifying if the service process is still running 2. firing a test request ,checking that the corresponding reply is correct

• The agent knows the properties of every service that composes the application– the path of the service executable

• This knowledge is acquired when the agent starts, by contacting another agent (referred to as bootstrap agent)

12

Server agent

• During the startup phase, the agent also retrieve the current routing table from the bootstrap agent

• Assume that a server belongs to a rack, a room, a datacenter, a city, a country and a continent.

• A label of the form “continent-country-city-datacenter-room-rack-server”– For example, a possible label for a server located in a data center in London

could be “EU-UK-LON-D1-C03-R11-S07”

13

Routing table

• Keeps locally a mapping between components and servers

• It is maintained by a gossiping algorithm – where each agent contacts a random subset ( where is the total

number of servers) of remote agents – exchanges information about the services running on server

14

Choosing the replica • we consider 4 different policies that a server may use for

choosing the replica of a component– proximity-based policy: the geographically nearest replica is chosen– rent-based policy: the least loaded server is chosen based on the rent price of the servers– random-based policy: a random replica is chosen– a net benefit-based policy: the geographically closest and least loaded replica

15

Economic model

• Service replication should be highly adaptive to the processing load and to failures– the server agent as an individual optimizer to each component 1. to ascertain the pre-specified availability guarantees 2. to balance its economic fitness

• At each epoch, a service : pays a virtual rent to the servers where it is running

– virtual rent corresponds to the usage of the server resources(CPU, memory, network, disk)

may be replicated , migrated or stopped by the server agent– based on the service demand, the renting cost and the maintenance

of high availability

16

Economic model• The actions performed by the server agent are directly related

to the economic viability, which is given by:

• The utility of a component corresponds to the value

• is a factor computed using the utilization of the server resources by the component

• is a certain threshold that determines when a component should be considered fit enough in order to replicate(currently, this is set to 25% of server usage)

• Some components may be more business critical than others

17

Economic model

• The virtual rent paid by the componentto the serveris given by:

• is a subjective estimation of the server quality and reliability• is a factor that expresses the resource utilization of the server

• Other utility and rent function could be used as long as they were both increasing to the resource usage and result in comparable values

18

Migrate or Stop

• At the beginning of a new epoch, a component may:• migrate or stop: if it has negative balance for the last epochs

– If the availability is satisfactory, the component stops– Otherwise, it tries to find a less expensive server (migration)

• To avoid oscillations of a replica among servers, the migration is only allowed if the following migration conditions:– The minimum availability is still satisfied using the new server– the absolute price difference between the current and the new server– the of the current server is above a soft limit

19

Replicate

• replicate: if it has positive balance for the last epochs• For replication, a component has also to verify that it can

afford the replication by having a positive balance for consecutive epochs:

– Where is the current virtual rent of the candidate server for replication

– the factor accounts for a increase at this rent price in the next epoch of the candidate server

(an upper bound of can typically be assumed)

20

Availability

• The availability of a component should be always kept above a required minimum level – estimating the probability of each server to fail necessitates access to

an large set of historical data and private information of the server

• Express the availability of a serviceby the geographical diversity of the servers that host its replicas

• is the set of servers hosting replicas of the service• are the confidence levels of servers

21

Availability

• The diversity function returns a number calculated based on the geographical distance among each server pairs– This distance can be represented as a - bit number (continent, country, city, data center, room, rack, server)

• When two servers have the same location, their corresponding proximity bit is set to 1, otherwise to 0

• A binary NOT operation is then applied to the proximity to get the diversity value

• having more replicas in distinct servers located even in the same location always results in increased availability

22

Candidate server

• When the availability of a component falls below a new service instance should be started (i.e. replicated) – maximize the net benefit between the diversity of the resulting set of

replica locations for the service and the virtual rent of the new server

– where is the virtual rent of candidate server– is a weight related to the proximity of the server location to the

geographical distribution of the client requests for the service (cf.[3])( tend to replicate closer to the components that heavily rely on the services of the former)

[3]N. Bonvin, T. G. Papaioannou, and K. Aberer, “Cost-efficient and differentiated data availability guarantees in data clouds,” in Proc. of the ICDE, Long Beach, CA, USA, 2010.

23

Candidate server

• The components rank servers according – net benefit (6) – randomly choose the target for replication among the top-k ones( avoid overloading the same destination server at an epoch )

• the availability tends to be increased as much as possible at the minimum cost, while the network latency for the query reply also decreases

• that the same approach according to (6) is used for choosing the candidate server for component migration

24

Experimental Setup

• We employ two different testbed settings:– a single application setup consisting of 7 servers – a multi-application setup consisting of 15 servers

• The hardware specification of each server is Intel Core i7 920 @ 2.67 GHz, 8GB Ram, Linux 2.6.32-trunk-amd64

• We run two databases (MySQL 5.1 and Cassandra 0.5.0)

• One generator of client requests for each application (FunkLoad 1.10, http://funkload.nuxeo.org/) on their own dedicated servers

25

Experimental Setup• Performing the following actions:

– 1) request the main page that contains the list of entertainment events– 2) request the details of an event A– 3) request the details of an event B– 4) request again the details of the event A– 5) login into the application and view user account – 6) update some personal information– 7) buy a ticket for the event A– 8) download the corresponding ticket in PDF

• A client continuously performs this list of actions over a period of 1 minute

26

Experimental Setup

• An epoch is set to 15 seconds• An agent sends gossip messages every 5 seconds

• We consider two different placements of the components:– A static approach each component is assigned to a server by the system administrator– A dynamic approach all components are started on a single server and dynamically migrate / replicate / stop according to the load or the hardware failures

27

Dynamic vs Static Replica Placementthe response time is lower bounded by that of the slowest component (in our case, for generating PDF tickets)

28

Scalabilitythe multi-application experimental setupAssume that all 10 servers reside at 1 datacenter:Increase the number of concurrent users from 150 to 1500randomly routed among the replicas of a component

29

High-availability

A single application setup consisting of 7 servers

Assume that each component has 2 replicas that reside at separate servers

10 concurrent clients continuously send requests for 1 minute

After 30 seconds, one random server between those hosting the replicas of a component fails

30

Adaptation to New Cloud ResourcesWe employ the single-application experimental setup, but the number of available servers in the cloud ranges from 1 to 10

31

Evaluation of Routing PoliciesIn this case, we employ the single-application setupThe 4 servers of the cloud are located in 2 datacenters (2 servers per datacenter). The round-trip time between the datacenters is 50 ms The minimum availability (i.e. number of replica per component) is set to 2

32

CONCLUSIONS

• Proposed an economic approach for dynamic accommodation of load spikes for composite web services deployed in clouds

• Server agents act as individual optimizers and autonomously replicate, migrate or stop based on their economic fitness

• This approach also offers high availability guarantees by maintaining a certain number of the various components in geographically diverse

33

COMMENT

• Rents v.s. Resources– Using rents is a more flexible approach

• Distributed v.s. Centralized

an economic approach for scalable and highly-available distributed applications nicolas bonvin,...

Documents

server agent

components of applications

introduction components

cloud components

specific application

bootstrap agent

certain application

servers components