cloud engineering
Post on 29-Nov-2014
5.539 Views
Preview:
DESCRIPTION
TRANSCRIPT
Cloud EngineeringSoftware in DatacentersGwendal SimonDepartment of Computer ScienceInstitut Telecom2009
Gartner Hype Cycle 2009
2 / 43 Gwendal Simon Cloud Engineering
Literature
A book:“The Datacenter as a Computer”, Luiz AndréBarroso and Urs Hölzle
and scientific publications including:ACM Sigops (SOSP) and Usenix OSDIHPDC, EuroPar, OOPSLA, etc.blogs (e.g. http://perspectives.mvdirona.com/)
3 / 43 Gwendal Simon Cloud Engineering
Disclaimer
No discussion about the impact of cloud computing:net neutralityinteractions between CDNs and ISPsprivacy and electronic human rights. . .
Focus here on how cloud computing works
4 / 43 Gwendal Simon Cloud Engineering
Disclaimer
No discussion about the impact of cloud computing:net neutralityinteractions between CDNs and ISPsprivacy and electronic human rights. . .
Focus here on how cloud computing works
4 / 43 Gwendal Simon Cloud Engineering
Introduction
5 / 43 Gwendal Simon Cloud Engineering
In a nutshell
Cloud computing is a model for enabling convenient,on-demand network access to a shared pool ofconfigurable computing resources (e.g., networks,servers, storage, applications, and services) that canbe rapidly provisioned and released with minimalmanagement effort or service provider interaction
NIST: http://csrc.nist.gov/groups/SNS/cloud-computing/index.html
6 / 43 Gwendal Simon Cloud Engineering
Why Cloud Computing
The ∗aaS paradigm:SaaS: Software as a Service
applications for end-users (salesforce.com, Google, etc.)- email, office suite, photos sharing, video storage
PaaS: Platform as a Serviceservices for web app developers (Azure, Google, etc.)- workflow facilities and various basic services (http, database)
IaaS: Infrastructure as a Serviceresources for developers (Amazon, Joyent, etc.)- servers, network equipment, memory, CPU
7 / 43 Gwendal Simon Cloud Engineering
Why Cloud Computing
The ∗aaS paradigm:SaaS: Software as a Service
applications for end-users (salesforce.com, Google, etc.)- email, office suite, photos sharing, video storage
PaaS: Platform as a Serviceservices for web app developers (Azure, Google, etc.)- workflow facilities and various basic services (http, database)
IaaS: Infrastructure as a Serviceresources for developers (Amazon, Joyent, etc.)- servers, network equipment, memory, CPU
7 / 43 Gwendal Simon Cloud Engineering
Why Cloud Computing
The ∗aaS paradigm:SaaS: Software as a Service
applications for end-users (salesforce.com, Google, etc.)- email, office suite, photos sharing, video storage
PaaS: Platform as a Serviceservices for web app developers (Azure, Google, etc.)- workflow facilities and various basic services (http, database)
IaaS: Infrastructure as a Serviceresources for developers (Amazon, Joyent, etc.)- servers, network equipment, memory, CPU
7 / 43 Gwendal Simon Cloud Engineering
An Engineer Vision
Benefits: Outsourcing Infrastructurereduce run time and response timeminimize infrastructure riskease deployment and upgrading
Challenge: Warehouse-Scale Computersthousands of individual computing nodescostly equipments (power, conditioning, cooling)large buildings with engineering teams
8 / 43 Gwendal Simon Cloud Engineering
An Engineer Vision
Benefits: Outsourcing Infrastructurereduce run time and response timeminimize infrastructure riskease deployment and upgrading
Challenge: Warehouse-Scale Computersthousands of individual computing nodescostly equipments (power, conditioning, cooling)large buildings with engineering teams
8 / 43 Gwendal Simon Cloud Engineering
Few Pictures
9 / 43 Gwendal Simon Cloud Engineering
Few Pictures
9 / 43 Gwendal Simon Cloud Engineering
Few Pictures
9 / 43 Gwendal Simon Cloud Engineering
Few Pictures
9 / 43 Gwendal Simon Cloud Engineering
Datacenter is Different
Datacenter vs. Desktop Software:inherent parallelismsoftware controlplatform homogeneityfault-free requirement
Datacenter vs. High-Performance Computingunpredictable inputhigh volume of datanot only computing
10 / 43 Gwendal Simon Cloud Engineering
Datacenter is Different
Datacenter vs. Desktop Software:inherent parallelismsoftware controlplatform homogeneityfault-free requirement
Datacenter vs. High-Performance Computingunpredictable inputhigh volume of datanot only computing
10 / 43 Gwendal Simon Cloud Engineering
Basic Elements ofa Datacenter
Computing ArchitectureEnergyDealing with Failures
11 / 43 Gwendal Simon Cloud Engineering
Basic Elements ofa Datacenter
Computing ArchitectureEnergyDealing with Failures
12 / 43 Gwendal Simon Cloud Engineering
Architecture Basics
13 / 43 Gwendal Simon Cloud Engineering
Main Elements
Storage: distributed file sys. (e.g. GFS) or NAS ?GFS is cheaper and faster for read operations
Network: 1-Gbps switch with 48 ports in a rack1 port per server, 8 ports for cluster rack
→ Oversubscription factor greater than 5scarce cluster-level bandwidthattractive rack-level networking
14 / 43 Gwendal Simon Cloud Engineering
Main Elements
Storage: distributed file sys. (e.g. GFS) or NAS ?GFS is cheaper and faster for read operations
Network: 1-Gbps switch with 48 ports in a rack1 port per server, 8 ports for cluster rack
→ Oversubscription factor greater than 5scarce cluster-level bandwidthattractive rack-level networking
14 / 43 Gwendal Simon Cloud Engineering
System Overview (in 2009)
15 / 43 Gwendal Simon Cloud Engineering
Basic Elements ofa Datacenter
Computing ArchitectureEnergyDealing with Failures
16 / 43 Gwendal Simon Cloud Engineering
Main Electric Components
17 / 43 Gwendal Simon Cloud Engineering
Cooling Challenge
18 / 43 Gwendal Simon Cloud Engineering
Evaluating Energy Efficiency
efficiency =computationtotal energy
Power Usage Effectiveness = total energyenergy in equipment
first generation datacenter PUE was poor (often ≥ 3.0)toward PUE around 1.2
Server PUE = critical component powertotal server power
basic SPUE is 1.7state-of-the-art servers reach 1.2
and computing efficiency
19 / 43 Gwendal Simon Cloud Engineering
Evaluating Energy Efficiency
efficiency =computationtotal energy
Power Usage Effectiveness = total energyenergy in equipment
first generation datacenter PUE was poor (often ≥ 3.0)toward PUE around 1.2
Server PUE = critical component powertotal server power
basic SPUE is 1.7state-of-the-art servers reach 1.2
and computing efficiency
19 / 43 Gwendal Simon Cloud Engineering
Evaluating Energy Efficiency
efficiency =computationtotal energy
Power Usage Effectiveness = total energyenergy in equipment
first generation datacenter PUE was poor (often ≥ 3.0)toward PUE around 1.2
Server PUE = critical component powertotal server power
basic SPUE is 1.7state-of-the-art servers reach 1.2
and computing efficiency19 / 43 Gwendal Simon Cloud Engineering
Computing Efficiency
Benchmarking cluster-level efficiency: on-going work
Benchmarking individual computer is easier
20 / 43 Gwendal Simon Cloud Engineering
Computing Efficiency
Benchmarking cluster-level efficiency: on-going workBenchmarking individual computer is easier
20 / 43 Gwendal Simon Cloud Engineering
Computing Efficiency
Benchmarking cluster-level efficiency: on-going workBenchmarking individual computer is easier
20 / 43 Gwendal Simon Cloud Engineering
Toward Energy-Proportional Servers
21 / 43 Gwendal Simon Cloud Engineering
Basic Elements ofa Datacenter
Computing ArchitectureEnergyDealing with Failures
22 / 43 Gwendal Simon Cloud Engineering
Fault-Tolerant Application-Level
Fault-Tolerant software infrastructure layer:masks failures of lower-layer levelsreduces hardware costeases operational procedures (e.g., upgrade)
But application-level still experiences failuresservice is in degraded modeservice is unreachableservice is corrupted (loss of data)
23 / 43 Gwendal Simon Cloud Engineering
Fault-Tolerant Application-Level
Fault-Tolerant software infrastructure layer:masks failures of lower-layer levelsreduces hardware costeases operational procedures (e.g., upgrade)
But application-level still experiences failuresservice is in degraded modeservice is unreachableservice is corrupted (loss of data)
23 / 43 Gwendal Simon Cloud Engineering
Origin of Impacting Failures
Cause % of eventssoftware 33 %
configuration 28 %human 13 %network 12 %hardware 11 %
other 3 %
hardware faults are masked by fault-tolerant software
24 / 43 Gwendal Simon Cloud Engineering
Origin of Impacting Failures
Cause % of eventssoftware 33 %
configuration 28 %human 13 %network 12 %hardware 11 %
other 3 %
hardware faults are masked by fault-tolerant software
24 / 43 Gwendal Simon Cloud Engineering
Failures and Crash
Average machine availability is ' 99.9%95% of machines restart less than once a month80% of restart events last less than 10 minutes
Software most frequent faults (in one year):DRAM soft-errors: 1% experience uncorrectable errdisk soft-errors: 3% of drives see corrupted sectors
25 / 43 Gwendal Simon Cloud Engineering
Failures and Crash
Average machine availability is ' 99.9%95% of machines restart less than once a month80% of restart events last less than 10 minutes
Software most frequent faults (in one year):DRAM soft-errors: 1% experience uncorrectable errdisk soft-errors: 3% of drives see corrupted sectors
25 / 43 Gwendal Simon Cloud Engineering
SoftwareInfrastructure
FundamentalsCluster-Level: MapReduceApplication-Level: Web Search
26 / 43 Gwendal Simon Cloud Engineering
SoftwareInfrastructure
FundamentalsCluster-Level: MapReduceApplication-Level: Web Search
27 / 43 Gwendal Simon Cloud Engineering
Three Layers
Infrastructure-level software:kernel, operating systems, networking libraries
Cluster-level software (middleware):specific software operating a pool of servers
Application-level software:implementation of the Internet services
28 / 43 Gwendal Simon Cloud Engineering
Three Layers
Infrastructure-level software:kernel, operating systems, networking libraries
Cluster-level software (middleware):specific software operating a pool of servers
Application-level software:implementation of the Internet services
28 / 43 Gwendal Simon Cloud Engineering
Three Layers
Infrastructure-level software:kernel, operating systems, networking libraries
Cluster-level software (middleware):specific software operating a pool of servers
Application-level software:implementation of the Internet services
28 / 43 Gwendal Simon Cloud Engineering
Main Software Components
replication
partitioningload-balancinghealth checkingintegrity check
compressionweak consistency
MapReduceDynamoBigTableHadoopSawzallChubbyDryad
29 / 43 Gwendal Simon Cloud Engineering
Main Software Components
replicationpartitioning
load-balancinghealth checkingintegrity check
compressionweak consistency
MapReduceDynamoBigTableHadoopSawzallChubbyDryad
29 / 43 Gwendal Simon Cloud Engineering
Main Software Components
replicationpartitioning
load-balancing
health checkingintegrity check
compressionweak consistency
MapReduceDynamoBigTableHadoopSawzallChubbyDryad
29 / 43 Gwendal Simon Cloud Engineering
Main Software Components
replicationpartitioning
load-balancinghealth checking
integrity checkcompression
weak consistency
MapReduceDynamoBigTableHadoopSawzallChubbyDryad
29 / 43 Gwendal Simon Cloud Engineering
Main Software Components
replicationpartitioning
load-balancinghealth checkingintegrity check
compressionweak consistency
MapReduceDynamoBigTableHadoopSawzallChubbyDryad
29 / 43 Gwendal Simon Cloud Engineering
Main Software Components
replicationpartitioning
load-balancinghealth checkingintegrity check
compression
weak consistency
MapReduceDynamoBigTableHadoopSawzallChubbyDryad
29 / 43 Gwendal Simon Cloud Engineering
Main Software Components
replicationpartitioning
load-balancinghealth checkingintegrity check
compressionweak consistency
MapReduceDynamoBigTableHadoopSawzallChubbyDryad
29 / 43 Gwendal Simon Cloud Engineering
Main Software Components
replicationpartitioning
load-balancinghealth checkingintegrity check
compressionweak consistency
MapReduceDynamoBigTableHadoopSawzallChubbyDryad
29 / 43 Gwendal Simon Cloud Engineering
OS at a Cluster-Level Scale
Resource Management: mapping tasks to resourcesshould optimize energy usage
Hardware Abstraction: handling hardware elementsshould optimize performances
Deployment Maintenance: upgrading and monitoringshould reduce manual tasks
Programming Frameworks: easing implementationshould increase programmer productivity
30 / 43 Gwendal Simon Cloud Engineering
OS at a Cluster-Level Scale
Resource Management: mapping tasks to resourcesshould optimize energy usage
Hardware Abstraction: handling hardware elementsshould optimize performances
Deployment Maintenance: upgrading and monitoringshould reduce manual tasks
Programming Frameworks: easing implementationshould increase programmer productivity
30 / 43 Gwendal Simon Cloud Engineering
OS at a Cluster-Level Scale
Resource Management: mapping tasks to resourcesshould optimize energy usage
Hardware Abstraction: handling hardware elementsshould optimize performances
Deployment Maintenance: upgrading and monitoringshould reduce manual tasks
Programming Frameworks: easing implementationshould increase programmer productivity
30 / 43 Gwendal Simon Cloud Engineering
OS at a Cluster-Level Scale
Resource Management: mapping tasks to resourcesshould optimize energy usage
Hardware Abstraction: handling hardware elementsshould optimize performances
Deployment Maintenance: upgrading and monitoringshould reduce manual tasks
Programming Frameworks: easing implementationshould increase programmer productivity
30 / 43 Gwendal Simon Cloud Engineering
SoftwareInfrastructure
FundamentalsCluster-Level: MapReduceApplication-Level: Web Search
31 / 43 Gwendal Simon Cloud Engineering
Motivation
Map/Reduce is a software framework for easily writing applications whichprocess vast amounts of data (multi-terabyte data-sets) in-parallel on largeclusters (thousands of nodes) of commodity hardware in a reliable,fault-tolerant manner.
32 / 43 Gwendal Simon Cloud Engineering
Functional Programming
Two fundamentals functions:map: apply a function to a list of elements
map f [] = []| map f [x::xs] = (f x) :: (map f xs)
map square [1,2,5] → [1,4,25]
reduce: build a value from a function and a listreduce f a [] = a| reduce f a [x::xs] = reduce f (f x a) xs
reduce add 0 [1,3,6] → 10
33 / 43 Gwendal Simon Cloud Engineering
Functional Programming
Two fundamentals functions:map: apply a function to a list of elements
map f [] = []| map f [x::xs] = (f x) :: (map f xs)
map square [1,2,5] → [1,4,25]
reduce: build a value from a function and a listreduce f a [] = a| reduce f a [x::xs] = reduce f (f x a) xs
reduce add 0 [1,3,6] → 10
33 / 43 Gwendal Simon Cloud Engineering
MapReduce
Implementing two functions w.r.t data (key,val)map: smaller sub-problems distributed to nodes
map (inKey, inVal) → list (outKey, v)produces intermediate values with an output key
reduce: combines results of sub-problemsreduce (outKey, list v) → outValproduces an output value from intermediate values
34 / 43 Gwendal Simon Cloud Engineering
MapReduce
Implementing two functions w.r.t data (key,val)map: smaller sub-problems distributed to nodes
map (inKey, inVal) → list (outKey, v)produces intermediate values with an output key
reduce: combines results of sub-problemsreduce (outKey, list v) → outValproduces an output value from intermediate values
34 / 43 Gwendal Simon Cloud Engineering
Example: Word Countmap(filename, content):
for each w in content:emitInt(w, 1)
reduce(word, partCount):int result = 0for pc in partCount:
result += pcemit(result)
map(file1, “hello me, goodbye me”)→<hello,1><me,1><goodbye,1><me,1>
map(file2, “hello you, bye you”)→<hello,1><you,1><bye,1><you,1>
a given key is allocated to a given server
reduce(hello,<1,1>) → 2. . .
<hello,2><me,2><you,2><goodbye,1><bye,1>
35 / 43 Gwendal Simon Cloud Engineering
Example: Word Countmap(filename, content):
for each w in content:emitInt(w, 1)
reduce(word, partCount):int result = 0for pc in partCount:
result += pcemit(result)
map(file1, “hello me, goodbye me”)→<hello,1><me,1><goodbye,1><me,1>
map(file2, “hello you, bye you”)→<hello,1><you,1><bye,1><you,1>
a given key is allocated to a given server
reduce(hello,<1,1>) → 2. . .
<hello,2><me,2><you,2><goodbye,1><bye,1>
35 / 43 Gwendal Simon Cloud Engineering
Example: Word Countmap(filename, content):
for each w in content:emitInt(w, 1)
reduce(word, partCount):int result = 0for pc in partCount:
result += pcemit(result)
map(file1, “hello me, goodbye me”)→<hello,1><me,1><goodbye,1><me,1>
map(file2, “hello you, bye you”)→<hello,1><you,1><bye,1><you,1>
a given key is allocated to a given server
reduce(hello,<1,1>) → 2. . .
<hello,2><me,2><you,2><goodbye,1><bye,1>
35 / 43 Gwendal Simon Cloud Engineering
Example: Word Countmap(filename, content):
for each w in content:emitInt(w, 1)
reduce(word, partCount):int result = 0for pc in partCount:
result += pcemit(result)
map(file1, “hello me, goodbye me”)→<hello,1><me,1><goodbye,1><me,1>
map(file2, “hello you, bye you”)→<hello,1><you,1><bye,1><you,1>
a given key is allocated to a given server
reduce(hello,<1,1>) → 2. . .
<hello,2><me,2><you,2><goodbye,1><bye,1>
35 / 43 Gwendal Simon Cloud Engineering
Example: Word Countmap(filename, content):
for each w in content:emitInt(w, 1)
reduce(word, partCount):int result = 0for pc in partCount:
result += pcemit(result)
map(file1, “hello me, goodbye me”)→<hello,1><me,1><goodbye,1><me,1>
map(file2, “hello you, bye you”)→<hello,1><you,1><bye,1><you,1>
a given key is allocated to a given server
reduce(hello,<1,1>) → 2. . .
<hello,2><me,2><you,2><goodbye,1><bye,1>
35 / 43 Gwendal Simon Cloud Engineering
Example: Word Countmap(filename, content):
for each w in content:emitInt(w, 1)
reduce(word, partCount):int result = 0for pc in partCount:
result += pcemit(result)
map(file1, “hello me, goodbye me”)→<hello,1><me,1><goodbye,1><me,1>
map(file2, “hello you, bye you”)→<hello,1><you,1><bye,1><you,1>
a given key is allocated to a given server
reduce(hello,<1,1>) → 2. . .
<hello,2><me,2><you,2><goodbye,1><bye,1>
35 / 43 Gwendal Simon Cloud Engineering
MapReduce Implemented
36 / 43 Gwendal Simon Cloud Engineering
SoftwareInfrastructure
FundamentalsCluster-Level: MapReduceApplication-Level: Web Search
37 / 43 Gwendal Simon Cloud Engineering
Basics
Input:the Web ' 100 billion file ' 400 terabytesthe pagerank algorithma query “w1 AND w2 AND · · · AND wn”
Output:a list of files containing all words wi , i ∈ [1, n]sorted by the pagerank algorithm
38 / 43 Gwendal Simon Cloud Engineering
Basics
Input:the Web ' 100 billion file ' 400 terabytesthe pagerank algorithma query “w1 AND w2 AND · · · AND wn”
Output:a list of files containing all words wi , i ∈ [1, n]sorted by the pagerank algorithm
38 / 43 Gwendal Simon Cloud Engineering
Implementation Overview
Offline task: index managementbased on keywords:
a word is associated with a tablea table contains all occurrences in the web
distributed on thousands of machinesmultiple copies and weak consistency
Online task: query managementfront-end Web servers to a subset of machines:
compute and rank their local resultsall best results are combined
intermediate servers to servers of file replicafrom file pointers to a set of metadata
39 / 43 Gwendal Simon Cloud Engineering
Implementation Overview
Offline task: index managementbased on keywords:
a word is associated with a tablea table contains all occurrences in the web
distributed on thousands of machinesmultiple copies and weak consistency
Online task: query managementfront-end Web servers to a subset of machines:
compute and rank their local resultsall best results are combined
intermediate servers to servers of file replicafrom file pointers to a set of metadata
39 / 43 Gwendal Simon Cloud Engineering
Discussion
The user-perceived latency is less than one second:read-only operationshigh parallelismmany thousands of queries per secondtraffic variations
Networking:tiny size of data exchangespossible packet loss around the front-end servers
40 / 43 Gwendal Simon Cloud Engineering
Discussion
The user-perceived latency is less than one second:read-only operationshigh parallelismmany thousands of queries per secondtraffic variations
Networking:tiny size of data exchangespossible packet loss around the front-end servers
40 / 43 Gwendal Simon Cloud Engineering
Conclusion
41 / 43 Gwendal Simon Cloud Engineering
Key Challenges
Time-scale:datacenter are expected to last 10 yearsInternet apps gain popularity in weeks
Hardware components:processors are faster and more energy efficientmemory systems and networks are not
Server evolution:more cores mean more parallelism
42 / 43 Gwendal Simon Cloud Engineering
Key Challenges
Time-scale:datacenter are expected to last 10 yearsInternet apps gain popularity in weeks
Hardware components:processors are faster and more energy efficientmemory systems and networks are not
Server evolution:more cores mean more parallelism
42 / 43 Gwendal Simon Cloud Engineering
Key Challenges
Time-scale:datacenter are expected to last 10 yearsInternet apps gain popularity in weeks
Hardware components:processors are faster and more energy efficientmemory systems and networks are not
Server evolution:more cores mean more parallelism
42 / 43 Gwendal Simon Cloud Engineering
Personal Thoughts
A new era in the Internet:the industrial era of applicationscomputer science does really matter
About nano-datacenterusing your always-on devices
43 / 43 Gwendal Simon Cloud Engineering
Personal Thoughts
A new era in the Internet:the industrial era of applicationscomputer science does really matter
About nano-datacenterusing your always-on devices
43 / 43 Gwendal Simon Cloud Engineering
top related