cloud engineering

Post on 29-Nov-2014

5.539 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

An introduction to datacenter and cloud computing for master students in engineering

TRANSCRIPT

Cloud EngineeringSoftware in DatacentersGwendal SimonDepartment of Computer ScienceInstitut Telecom2009

Gartner Hype Cycle 2009

2 / 43 Gwendal Simon Cloud Engineering

Literature

A book:“The Datacenter as a Computer”, Luiz AndréBarroso and Urs Hölzle

and scientific publications including:ACM Sigops (SOSP) and Usenix OSDIHPDC, EuroPar, OOPSLA, etc.blogs (e.g. http://perspectives.mvdirona.com/)

3 / 43 Gwendal Simon Cloud Engineering

Disclaimer

No discussion about the impact of cloud computing:net neutralityinteractions between CDNs and ISPsprivacy and electronic human rights. . .

Focus here on how cloud computing works

4 / 43 Gwendal Simon Cloud Engineering

Disclaimer

No discussion about the impact of cloud computing:net neutralityinteractions between CDNs and ISPsprivacy and electronic human rights. . .

Focus here on how cloud computing works

4 / 43 Gwendal Simon Cloud Engineering

Introduction

5 / 43 Gwendal Simon Cloud Engineering

In a nutshell

Cloud computing is a model for enabling convenient,on-demand network access to a shared pool ofconfigurable computing resources (e.g., networks,servers, storage, applications, and services) that canbe rapidly provisioned and released with minimalmanagement effort or service provider interaction

NIST: http://csrc.nist.gov/groups/SNS/cloud-computing/index.html

6 / 43 Gwendal Simon Cloud Engineering

Why Cloud Computing

The ∗aaS paradigm:SaaS: Software as a Service

applications for end-users (salesforce.com, Google, etc.)- email, office suite, photos sharing, video storage

PaaS: Platform as a Serviceservices for web app developers (Azure, Google, etc.)- workflow facilities and various basic services (http, database)

IaaS: Infrastructure as a Serviceresources for developers (Amazon, Joyent, etc.)- servers, network equipment, memory, CPU

7 / 43 Gwendal Simon Cloud Engineering

Why Cloud Computing

The ∗aaS paradigm:SaaS: Software as a Service

applications for end-users (salesforce.com, Google, etc.)- email, office suite, photos sharing, video storage

PaaS: Platform as a Serviceservices for web app developers (Azure, Google, etc.)- workflow facilities and various basic services (http, database)

IaaS: Infrastructure as a Serviceresources for developers (Amazon, Joyent, etc.)- servers, network equipment, memory, CPU

7 / 43 Gwendal Simon Cloud Engineering

Why Cloud Computing

The ∗aaS paradigm:SaaS: Software as a Service

applications for end-users (salesforce.com, Google, etc.)- email, office suite, photos sharing, video storage

PaaS: Platform as a Serviceservices for web app developers (Azure, Google, etc.)- workflow facilities and various basic services (http, database)

IaaS: Infrastructure as a Serviceresources for developers (Amazon, Joyent, etc.)- servers, network equipment, memory, CPU

7 / 43 Gwendal Simon Cloud Engineering

An Engineer Vision

Benefits: Outsourcing Infrastructurereduce run time and response timeminimize infrastructure riskease deployment and upgrading

Challenge: Warehouse-Scale Computersthousands of individual computing nodescostly equipments (power, conditioning, cooling)large buildings with engineering teams

8 / 43 Gwendal Simon Cloud Engineering

An Engineer Vision

Benefits: Outsourcing Infrastructurereduce run time and response timeminimize infrastructure riskease deployment and upgrading

Challenge: Warehouse-Scale Computersthousands of individual computing nodescostly equipments (power, conditioning, cooling)large buildings with engineering teams

8 / 43 Gwendal Simon Cloud Engineering

Few Pictures

9 / 43 Gwendal Simon Cloud Engineering

Few Pictures

9 / 43 Gwendal Simon Cloud Engineering

Few Pictures

9 / 43 Gwendal Simon Cloud Engineering

Few Pictures

9 / 43 Gwendal Simon Cloud Engineering

Datacenter is Different

Datacenter vs. Desktop Software:inherent parallelismsoftware controlplatform homogeneityfault-free requirement

Datacenter vs. High-Performance Computingunpredictable inputhigh volume of datanot only computing

10 / 43 Gwendal Simon Cloud Engineering

Datacenter is Different

Datacenter vs. Desktop Software:inherent parallelismsoftware controlplatform homogeneityfault-free requirement

Datacenter vs. High-Performance Computingunpredictable inputhigh volume of datanot only computing

10 / 43 Gwendal Simon Cloud Engineering

Basic Elements ofa Datacenter

Computing ArchitectureEnergyDealing with Failures

11 / 43 Gwendal Simon Cloud Engineering

Basic Elements ofa Datacenter

Computing ArchitectureEnergyDealing with Failures

12 / 43 Gwendal Simon Cloud Engineering

Architecture Basics

13 / 43 Gwendal Simon Cloud Engineering

Main Elements

Storage: distributed file sys. (e.g. GFS) or NAS ?GFS is cheaper and faster for read operations

Network: 1-Gbps switch with 48 ports in a rack1 port per server, 8 ports for cluster rack

→ Oversubscription factor greater than 5scarce cluster-level bandwidthattractive rack-level networking

14 / 43 Gwendal Simon Cloud Engineering

Main Elements

Storage: distributed file sys. (e.g. GFS) or NAS ?GFS is cheaper and faster for read operations

Network: 1-Gbps switch with 48 ports in a rack1 port per server, 8 ports for cluster rack

→ Oversubscription factor greater than 5scarce cluster-level bandwidthattractive rack-level networking

14 / 43 Gwendal Simon Cloud Engineering

System Overview (in 2009)

15 / 43 Gwendal Simon Cloud Engineering

Basic Elements ofa Datacenter

Computing ArchitectureEnergyDealing with Failures

16 / 43 Gwendal Simon Cloud Engineering

Main Electric Components

17 / 43 Gwendal Simon Cloud Engineering

Cooling Challenge

18 / 43 Gwendal Simon Cloud Engineering

Evaluating Energy Efficiency

efficiency =computationtotal energy

Power Usage Effectiveness = total energyenergy in equipment

first generation datacenter PUE was poor (often ≥ 3.0)toward PUE around 1.2

Server PUE = critical component powertotal server power

basic SPUE is 1.7state-of-the-art servers reach 1.2

and computing efficiency

19 / 43 Gwendal Simon Cloud Engineering

Evaluating Energy Efficiency

efficiency =computationtotal energy

Power Usage Effectiveness = total energyenergy in equipment

first generation datacenter PUE was poor (often ≥ 3.0)toward PUE around 1.2

Server PUE = critical component powertotal server power

basic SPUE is 1.7state-of-the-art servers reach 1.2

and computing efficiency

19 / 43 Gwendal Simon Cloud Engineering

Evaluating Energy Efficiency

efficiency =computationtotal energy

Power Usage Effectiveness = total energyenergy in equipment

first generation datacenter PUE was poor (often ≥ 3.0)toward PUE around 1.2

Server PUE = critical component powertotal server power

basic SPUE is 1.7state-of-the-art servers reach 1.2

and computing efficiency19 / 43 Gwendal Simon Cloud Engineering

Computing Efficiency

Benchmarking cluster-level efficiency: on-going work

Benchmarking individual computer is easier

20 / 43 Gwendal Simon Cloud Engineering

Computing Efficiency

Benchmarking cluster-level efficiency: on-going workBenchmarking individual computer is easier

20 / 43 Gwendal Simon Cloud Engineering

Computing Efficiency

Benchmarking cluster-level efficiency: on-going workBenchmarking individual computer is easier

20 / 43 Gwendal Simon Cloud Engineering

Toward Energy-Proportional Servers

21 / 43 Gwendal Simon Cloud Engineering

Basic Elements ofa Datacenter

Computing ArchitectureEnergyDealing with Failures

22 / 43 Gwendal Simon Cloud Engineering

Fault-Tolerant Application-Level

Fault-Tolerant software infrastructure layer:masks failures of lower-layer levelsreduces hardware costeases operational procedures (e.g., upgrade)

But application-level still experiences failuresservice is in degraded modeservice is unreachableservice is corrupted (loss of data)

23 / 43 Gwendal Simon Cloud Engineering

Fault-Tolerant Application-Level

Fault-Tolerant software infrastructure layer:masks failures of lower-layer levelsreduces hardware costeases operational procedures (e.g., upgrade)

But application-level still experiences failuresservice is in degraded modeservice is unreachableservice is corrupted (loss of data)

23 / 43 Gwendal Simon Cloud Engineering

Origin of Impacting Failures

Cause % of eventssoftware 33 %

configuration 28 %human 13 %network 12 %hardware 11 %

other 3 %

hardware faults are masked by fault-tolerant software

24 / 43 Gwendal Simon Cloud Engineering

Origin of Impacting Failures

Cause % of eventssoftware 33 %

configuration 28 %human 13 %network 12 %hardware 11 %

other 3 %

hardware faults are masked by fault-tolerant software

24 / 43 Gwendal Simon Cloud Engineering

Failures and Crash

Average machine availability is ' 99.9%95% of machines restart less than once a month80% of restart events last less than 10 minutes

Software most frequent faults (in one year):DRAM soft-errors: 1% experience uncorrectable errdisk soft-errors: 3% of drives see corrupted sectors

25 / 43 Gwendal Simon Cloud Engineering

Failures and Crash

Average machine availability is ' 99.9%95% of machines restart less than once a month80% of restart events last less than 10 minutes

Software most frequent faults (in one year):DRAM soft-errors: 1% experience uncorrectable errdisk soft-errors: 3% of drives see corrupted sectors

25 / 43 Gwendal Simon Cloud Engineering

SoftwareInfrastructure

FundamentalsCluster-Level: MapReduceApplication-Level: Web Search

26 / 43 Gwendal Simon Cloud Engineering

SoftwareInfrastructure

FundamentalsCluster-Level: MapReduceApplication-Level: Web Search

27 / 43 Gwendal Simon Cloud Engineering

Three Layers

Infrastructure-level software:kernel, operating systems, networking libraries

Cluster-level software (middleware):specific software operating a pool of servers

Application-level software:implementation of the Internet services

28 / 43 Gwendal Simon Cloud Engineering

Three Layers

Infrastructure-level software:kernel, operating systems, networking libraries

Cluster-level software (middleware):specific software operating a pool of servers

Application-level software:implementation of the Internet services

28 / 43 Gwendal Simon Cloud Engineering

Three Layers

Infrastructure-level software:kernel, operating systems, networking libraries

Cluster-level software (middleware):specific software operating a pool of servers

Application-level software:implementation of the Internet services

28 / 43 Gwendal Simon Cloud Engineering

Main Software Components

replication

partitioningload-balancinghealth checkingintegrity check

compressionweak consistency

MapReduceDynamoBigTableHadoopSawzallChubbyDryad

29 / 43 Gwendal Simon Cloud Engineering

Main Software Components

replicationpartitioning

load-balancinghealth checkingintegrity check

compressionweak consistency

MapReduceDynamoBigTableHadoopSawzallChubbyDryad

29 / 43 Gwendal Simon Cloud Engineering

Main Software Components

replicationpartitioning

load-balancing

health checkingintegrity check

compressionweak consistency

MapReduceDynamoBigTableHadoopSawzallChubbyDryad

29 / 43 Gwendal Simon Cloud Engineering

Main Software Components

replicationpartitioning

load-balancinghealth checking

integrity checkcompression

weak consistency

MapReduceDynamoBigTableHadoopSawzallChubbyDryad

29 / 43 Gwendal Simon Cloud Engineering

Main Software Components

replicationpartitioning

load-balancinghealth checkingintegrity check

compressionweak consistency

MapReduceDynamoBigTableHadoopSawzallChubbyDryad

29 / 43 Gwendal Simon Cloud Engineering

Main Software Components

replicationpartitioning

load-balancinghealth checkingintegrity check

compression

weak consistency

MapReduceDynamoBigTableHadoopSawzallChubbyDryad

29 / 43 Gwendal Simon Cloud Engineering

Main Software Components

replicationpartitioning

load-balancinghealth checkingintegrity check

compressionweak consistency

MapReduceDynamoBigTableHadoopSawzallChubbyDryad

29 / 43 Gwendal Simon Cloud Engineering

Main Software Components

replicationpartitioning

load-balancinghealth checkingintegrity check

compressionweak consistency

MapReduceDynamoBigTableHadoopSawzallChubbyDryad

29 / 43 Gwendal Simon Cloud Engineering

OS at a Cluster-Level Scale

Resource Management: mapping tasks to resourcesshould optimize energy usage

Hardware Abstraction: handling hardware elementsshould optimize performances

Deployment Maintenance: upgrading and monitoringshould reduce manual tasks

Programming Frameworks: easing implementationshould increase programmer productivity

30 / 43 Gwendal Simon Cloud Engineering

OS at a Cluster-Level Scale

Resource Management: mapping tasks to resourcesshould optimize energy usage

Hardware Abstraction: handling hardware elementsshould optimize performances

Deployment Maintenance: upgrading and monitoringshould reduce manual tasks

Programming Frameworks: easing implementationshould increase programmer productivity

30 / 43 Gwendal Simon Cloud Engineering

OS at a Cluster-Level Scale

Resource Management: mapping tasks to resourcesshould optimize energy usage

Hardware Abstraction: handling hardware elementsshould optimize performances

Deployment Maintenance: upgrading and monitoringshould reduce manual tasks

Programming Frameworks: easing implementationshould increase programmer productivity

30 / 43 Gwendal Simon Cloud Engineering

OS at a Cluster-Level Scale

Resource Management: mapping tasks to resourcesshould optimize energy usage

Hardware Abstraction: handling hardware elementsshould optimize performances

Deployment Maintenance: upgrading and monitoringshould reduce manual tasks

Programming Frameworks: easing implementationshould increase programmer productivity

30 / 43 Gwendal Simon Cloud Engineering

SoftwareInfrastructure

FundamentalsCluster-Level: MapReduceApplication-Level: Web Search

31 / 43 Gwendal Simon Cloud Engineering

Motivation

Map/Reduce is a software framework for easily writing applications whichprocess vast amounts of data (multi-terabyte data-sets) in-parallel on largeclusters (thousands of nodes) of commodity hardware in a reliable,fault-tolerant manner.

32 / 43 Gwendal Simon Cloud Engineering

Functional Programming

Two fundamentals functions:map: apply a function to a list of elements

map f [] = []| map f [x::xs] = (f x) :: (map f xs)

map square [1,2,5] → [1,4,25]

reduce: build a value from a function and a listreduce f a [] = a| reduce f a [x::xs] = reduce f (f x a) xs

reduce add 0 [1,3,6] → 10

33 / 43 Gwendal Simon Cloud Engineering

Functional Programming

Two fundamentals functions:map: apply a function to a list of elements

map f [] = []| map f [x::xs] = (f x) :: (map f xs)

map square [1,2,5] → [1,4,25]

reduce: build a value from a function and a listreduce f a [] = a| reduce f a [x::xs] = reduce f (f x a) xs

reduce add 0 [1,3,6] → 10

33 / 43 Gwendal Simon Cloud Engineering

MapReduce

Implementing two functions w.r.t data (key,val)map: smaller sub-problems distributed to nodes

map (inKey, inVal) → list (outKey, v)produces intermediate values with an output key

reduce: combines results of sub-problemsreduce (outKey, list v) → outValproduces an output value from intermediate values

34 / 43 Gwendal Simon Cloud Engineering

MapReduce

Implementing two functions w.r.t data (key,val)map: smaller sub-problems distributed to nodes

map (inKey, inVal) → list (outKey, v)produces intermediate values with an output key

reduce: combines results of sub-problemsreduce (outKey, list v) → outValproduces an output value from intermediate values

34 / 43 Gwendal Simon Cloud Engineering

Example: Word Countmap(filename, content):

for each w in content:emitInt(w, 1)

reduce(word, partCount):int result = 0for pc in partCount:

result += pcemit(result)

map(file1, “hello me, goodbye me”)→<hello,1><me,1><goodbye,1><me,1>

map(file2, “hello you, bye you”)→<hello,1><you,1><bye,1><you,1>

a given key is allocated to a given server

reduce(hello,<1,1>) → 2. . .

<hello,2><me,2><you,2><goodbye,1><bye,1>

35 / 43 Gwendal Simon Cloud Engineering

Example: Word Countmap(filename, content):

for each w in content:emitInt(w, 1)

reduce(word, partCount):int result = 0for pc in partCount:

result += pcemit(result)

map(file1, “hello me, goodbye me”)→<hello,1><me,1><goodbye,1><me,1>

map(file2, “hello you, bye you”)→<hello,1><you,1><bye,1><you,1>

a given key is allocated to a given server

reduce(hello,<1,1>) → 2. . .

<hello,2><me,2><you,2><goodbye,1><bye,1>

35 / 43 Gwendal Simon Cloud Engineering

Example: Word Countmap(filename, content):

for each w in content:emitInt(w, 1)

reduce(word, partCount):int result = 0for pc in partCount:

result += pcemit(result)

map(file1, “hello me, goodbye me”)→<hello,1><me,1><goodbye,1><me,1>

map(file2, “hello you, bye you”)→<hello,1><you,1><bye,1><you,1>

a given key is allocated to a given server

reduce(hello,<1,1>) → 2. . .

<hello,2><me,2><you,2><goodbye,1><bye,1>

35 / 43 Gwendal Simon Cloud Engineering

Example: Word Countmap(filename, content):

for each w in content:emitInt(w, 1)

reduce(word, partCount):int result = 0for pc in partCount:

result += pcemit(result)

map(file1, “hello me, goodbye me”)→<hello,1><me,1><goodbye,1><me,1>

map(file2, “hello you, bye you”)→<hello,1><you,1><bye,1><you,1>

a given key is allocated to a given server

reduce(hello,<1,1>) → 2. . .

<hello,2><me,2><you,2><goodbye,1><bye,1>

35 / 43 Gwendal Simon Cloud Engineering

Example: Word Countmap(filename, content):

for each w in content:emitInt(w, 1)

reduce(word, partCount):int result = 0for pc in partCount:

result += pcemit(result)

map(file1, “hello me, goodbye me”)→<hello,1><me,1><goodbye,1><me,1>

map(file2, “hello you, bye you”)→<hello,1><you,1><bye,1><you,1>

a given key is allocated to a given server

reduce(hello,<1,1>) → 2. . .

<hello,2><me,2><you,2><goodbye,1><bye,1>

35 / 43 Gwendal Simon Cloud Engineering

Example: Word Countmap(filename, content):

for each w in content:emitInt(w, 1)

reduce(word, partCount):int result = 0for pc in partCount:

result += pcemit(result)

map(file1, “hello me, goodbye me”)→<hello,1><me,1><goodbye,1><me,1>

map(file2, “hello you, bye you”)→<hello,1><you,1><bye,1><you,1>

a given key is allocated to a given server

reduce(hello,<1,1>) → 2. . .

<hello,2><me,2><you,2><goodbye,1><bye,1>

35 / 43 Gwendal Simon Cloud Engineering

MapReduce Implemented

36 / 43 Gwendal Simon Cloud Engineering

SoftwareInfrastructure

FundamentalsCluster-Level: MapReduceApplication-Level: Web Search

37 / 43 Gwendal Simon Cloud Engineering

Basics

Input:the Web ' 100 billion file ' 400 terabytesthe pagerank algorithma query “w1 AND w2 AND · · · AND wn”

Output:a list of files containing all words wi , i ∈ [1, n]sorted by the pagerank algorithm

38 / 43 Gwendal Simon Cloud Engineering

Basics

Input:the Web ' 100 billion file ' 400 terabytesthe pagerank algorithma query “w1 AND w2 AND · · · AND wn”

Output:a list of files containing all words wi , i ∈ [1, n]sorted by the pagerank algorithm

38 / 43 Gwendal Simon Cloud Engineering

Implementation Overview

Offline task: index managementbased on keywords:

a word is associated with a tablea table contains all occurrences in the web

distributed on thousands of machinesmultiple copies and weak consistency

Online task: query managementfront-end Web servers to a subset of machines:

compute and rank their local resultsall best results are combined

intermediate servers to servers of file replicafrom file pointers to a set of metadata

39 / 43 Gwendal Simon Cloud Engineering

Implementation Overview

Offline task: index managementbased on keywords:

a word is associated with a tablea table contains all occurrences in the web

distributed on thousands of machinesmultiple copies and weak consistency

Online task: query managementfront-end Web servers to a subset of machines:

compute and rank their local resultsall best results are combined

intermediate servers to servers of file replicafrom file pointers to a set of metadata

39 / 43 Gwendal Simon Cloud Engineering

Discussion

The user-perceived latency is less than one second:read-only operationshigh parallelismmany thousands of queries per secondtraffic variations

Networking:tiny size of data exchangespossible packet loss around the front-end servers

40 / 43 Gwendal Simon Cloud Engineering

Discussion

The user-perceived latency is less than one second:read-only operationshigh parallelismmany thousands of queries per secondtraffic variations

Networking:tiny size of data exchangespossible packet loss around the front-end servers

40 / 43 Gwendal Simon Cloud Engineering

Conclusion

41 / 43 Gwendal Simon Cloud Engineering

Key Challenges

Time-scale:datacenter are expected to last 10 yearsInternet apps gain popularity in weeks

Hardware components:processors are faster and more energy efficientmemory systems and networks are not

Server evolution:more cores mean more parallelism

42 / 43 Gwendal Simon Cloud Engineering

Key Challenges

Time-scale:datacenter are expected to last 10 yearsInternet apps gain popularity in weeks

Hardware components:processors are faster and more energy efficientmemory systems and networks are not

Server evolution:more cores mean more parallelism

42 / 43 Gwendal Simon Cloud Engineering

Key Challenges

Time-scale:datacenter are expected to last 10 yearsInternet apps gain popularity in weeks

Hardware components:processors are faster and more energy efficientmemory systems and networks are not

Server evolution:more cores mean more parallelism

42 / 43 Gwendal Simon Cloud Engineering

Personal Thoughts

A new era in the Internet:the industrial era of applicationscomputer science does really matter

About nano-datacenterusing your always-on devices

43 / 43 Gwendal Simon Cloud Engineering

Personal Thoughts

A new era in the Internet:the industrial era of applicationscomputer science does really matter

About nano-datacenterusing your always-on devices

43 / 43 Gwendal Simon Cloud Engineering

top related