2012 in4392 lecture-5 cloudprogrammingmodels
TRANSCRIPT
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
1/95
IN4392 Cloud Computing
Cloud Programming Models
2012-2013
1
Cloud Computing (IN4392)
D..!. "pema and #. Iosup. $it% &ontri'utions rom Claudio Martella and ogdan *%it.
Parallel and Distri'uted *roup+, Delt
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
2/95
,erms or ,oda/s Dis&ussion
Programming model= language + libraries + runtime system that create
a model of computation (an abstract machine)
Examples message!passing "s shared memory# data! "stask!parallelism# $
#'stra&tion leel
= distance from physical machine
% What is the best abstraction le"el&
'1'!'1 '
Examples Assembly lo*!le"el "s Java is high le"elany design trade!offs performance# ease!of!use#common!task optimi,ation# programming paradigm# $
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
3/95
C%ara&teristi&s o a Cloud Programming Model
1- .ost model (Efficiency) = cost/performance# o"erheads# $
- ca a y- ault!tolerance
2- 0upport for specific ser"ices
3- .ontrol model# e-g-# fine!grained many!task scheduling4- 5ata model# including partitioning and placement# out!of!
6- 0ynchroni,ation model
'1'!'1
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
4/95
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
5/95
,oda/s C%allenges
e0cience
:he ourth 8aradigm :he 5ata 5eluge and 9ig 5ata
8ossibly others
'1'!'1 3
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
6/95
e&ien&e ,%e $%
0cience experiments already cost '3;3< budget $ and perhaps incur 63< of the delays
illions of lines of code *ith similar functionality
e co e reuse across pro ec s an app ca on oma ns $ but last t*o decades? science is "ery similar in structure
ost results difficult to share and reuse .ase!in!point 0loan 5igital 0ky 0ur"ey
digital map of '3< of the sky x spectra2:9+ sky sur"ey data
'+ astro!ob>ects (images)1+ ob>ects *ith spectrum (spectra)@o* to make it *ork for this andthe next generation of scientists&
'1'!'1 4
0ource Aim Bray and Clex 0,alay#e0cience !! C :ransformed 0cientific ethod#http//research-microsoft-com/en!us/um/people/gray/talks/D.!.0:9Fe0cience-ppt
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
7/95
e&ien&e (!o%n ,alor &i.,e&%. 1999)
# ne5 s&ientii& met%od .ombine science *ith 7:
ull scientific process control scientific instrument or produce data
# #results# "isuali,e results
ostly compute!intensi"e# e-g-# simulation of complex phenomena
I, support 7nfrastructure @. Brid# Gpen 0cience Brid# 5C0# DorduBrid# $
"6amples
H physics# 9ioinformatics# aterial science# Engineering# Comp&i'1'!'1 6
% Why is .omp0ci an example here&
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
8/95
,%e 7ourt% Paradigm ,%e $% (#n #ne&dotal "6ample)
,%e 8er5%elming *ro5t% o no5ledge
When 1' men founded theoyal 0ociety in 144# it *aspossible for an educated
1II6'1
1II1II6
Dumber of8ublications
scientific kno*ledge- J$K 7nthe last 3 years# such hasbeen the pace of scientificad"ance that e"en the bestscientists cannot keep up
outside their o*n field-:ony 9lair#8 0peech# ay '' 5ata Ling#:he scientific impact of nations#Dature?2-
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
9/95
,%e 7ourt% Paradigm ,%e $%at
:housand years agoscience *as empiri&al describing natural phenomena
ast fe* hundred years 22
.
4 cGa=
7rom pot%esis to Data
eore &a ranc using models# generali,ations
ast fe* decadesa &omputational branch simulating complex phenomena
:oday (t%e 7ourt% Paradigm)data e6ploration
aa
%1 What is the ourth8aradigm&
# #
5ata captured by instruments or generated by simulator 8rocessed by soft*are 7nformation/Lno*ledge stored in computer 0cientist analy,es results using data management and statistics
'1'!'1 I
0ource Aim Bray and :he ourth 8aradigm#http//research-microsoft-com/en!us/collaboration/fourthparadigm/
%' What are the dangersof the ourth 8aradigm&
% l :
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
10/95
,%e Data Deluge:,%e $%
ME"ery*here you look# the Nuantityof information in the *orld issoaring- Cccording to one
# rexabytes (billion gigabytes) ofdata in '3- :his year# it *illcreate 1#' exabytes- erely
keeping up *ith this flood# andstoring the bits that might beuseful# is difficult enough-
Cnalysing it# to spot patterns and
1
extract use u n ormat on# sharder still-:he 5ata 5eluge# :he Economist#'3 ebruary '1
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
11/95
Data Deluge: ,%e Personal Meme6 "6ample
Oanne"ar 9ush in the 1I2s record your life
7: edia aboratory :he @uman 0peechome8ro>ect/:otalecall# data mining/analysis/"isio
5eb oy and upal 8atel record practically e"ery
*aking moment of their son?s first three years('< pri"acy time$7s this e"en legal&P 0hould it be&P)
!
11
C75 + tapes# 1 computersR 'k hours audio!"ideo 5ata si,e 'B9/day# 1-389 total
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
12/95
Data Deluge: ,%e *aming #nalti&s "6ample
E% 77 ':9/year all logs
1'
a o - ser"e s a s cson player logs
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
13/95
Data Deluge: Datasets in Comp.&i.
5ataset0i,e
http//g*a-e*i-tudelft-nlhttp//g*a-e*i-tudelft-nl
8eer!to!8eer :race Crchi"e 1B9
1B9
1B9
1:9
1:9/yr
8'8:C
Bam:C:he ailure:race
Crchi"e
http//fta-inria-frhttp//fta-inria-fr
1
$ 8WC# 7:C# .CW5C5# $
1#s of scientists rom theory to practice
1
SearTI T1 T11T4
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
14/95
Data Deluge:
,%e Proessional $orld *ets Conne&ted
'1'!'1 12
eb '1'
0ource Oincen,o .osen,a# :he 0tate of inked7n#http//"incos-it/the!state!of!linkedin/
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
15/95
$%at is ig Data:;
Oery large# distributed aggregations of loosely structureddata# often incomplete and inaccessible
Easily exceeds the processing capacity of con"entionaldatabase systems
8rinciple of 9ig 5ata When you can# keep e"erythingP
:oo big# too fast# and doesn?t comply *ith the traditionaldatabase architectures
'11!'1' 13
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
16/95
,%e ,%ree
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
17/95
Data $are%ouse s. ig Data
'11!'1' 16
0ource http//*ikibon-org/
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
18/95
#genda
1- 7ntroduction
-
3. Programming Models or Compute-Intensie$or=loads
1. ags o ,as=s'- Workflo*s
- 8arallel 8rogramming odels
2- 8rogramming odels for 9ig 5ata3- 0ummary
'1'!'1 1Q
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
19/95
$%at is a ag o ,as=s (o,); # stem obs sent by a user$
:ime JunitsK Why 9ag of :asks& rom the perspecti"e
of the user# obs in set are ust tasks of a lar er ob
$that start at most Us afterthe first >ob % What is the user?s "ie*&
C single useful result from the complete 9o: esult can be combination of all tasks# or a selection
of the results of most or e"en a single task'1'!'1 1I
Iosup et al., The Characteristics andPerformance of Groups of Jobs in Grids,Euro-Par, LNCS, ol.!"!#, pp. $%&-$'$, '6.
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
20/95
#ppli&ations o t%e o, Programming Model
8arameter s*eeps #
Oery useful in engineering and simulation!based science
onte .arlo simulations 0imulation *ith random elements fixed time yet limited inaccuracy Oery useful in engineering and simulation!based science
any other types of batch processing 8eriodic computation# .ycle sca"enging
Oery useful to automate operations and reduce *aste
'1'!'1 '
, t% D i t P i
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
21/95
o,s e&ame t%e Dominant ProgrammingModel or *rid Computing
(V0) Brid
(.C) [email protected]:
(EV) EBEE
(V0) .ondor V-Wisc-
(V0) :eraBrid!' D.0C
' 2 4 Q 1
(D) 5C0!'
() Brid3
(DG#0E) DorduBrid
(VL) C
(V0) BGW
7rom >o's ?@A
(V0) Brid
(.C) [email protected]:
(EV) EBEE
(V0) .ondor V-Wisc-(V0) :eraBrid!' D.0C
'1'!'1 '1
' 2 4 Q 1
(D) 5C0!'
() Brid3
(DG#0E) DorduBrid(VL) C
(V0) BGW
7rom CP,ime ?@A
Iosup and Epema( Grid Computin) *or+loads.IEEE Internet Computin) #&( #'-&" &/##
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
22/95
Pra&ti&al #ppli&ations o t%e o, Programming Model
Parameter 5eeps in Condor ?1+4A
0ue the scientist *ants toind the "alue of x# #, for
1 "alues for x and y# and 4 "alues for ,
olution un a parameter s5eep# *ith
1 x 1 x 4 = 4 parameter "alues 8roblem of the solution
!
machine- 7t takes 4 hours- :hat?s 1B0 das uninterrupted computation on 0ue?s machineP
'1'!'1 ''
0ource .ondor :eam# .ondor Vser?s :utorial-http//cs-u*isc-edu/condor
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
23/95
Pra&ti&al #ppli&ations o t%e o, Programming Model
Parameter 5eeps in Condor ?2+4AUniverse = vanilla
Executable = sim.exe
Input = input.txt
u pu = ou pu . x
Error = error.txt
Log = sim.log
Requirements = OpSys == WINNT61 &&Arch == INTEL &&
(Disk >= DiskUsage) && ((Memory * 1024)>=ImageSize)
.omplex 0Cs canbe specified easily
InitialDir = run_$(Process)
Queue 600
'1'!'1 '
0ource .ondor :eam# .ondor Vser?s :utorial-http//cs-u*isc-edu/condor
Clso passed asparameter to sim.exe
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
24/95
Pra&ti&al #ppli&ations o t%e o, Programming Model
Parameter 5eeps in Condor ?3+4A
% condor_submit sim.submit
Submitting job(s)...............................................................................................................................................................................................................................................................
Logging submit event(s)................................................................................................................................................................................................
...............................................................
600 job(s) submitted to cluster 3.
'1'!'1 '2
0ource .ondor :eam# .ondor Vser?s :utorial-http//cs-u*isc-edu/condor
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
25/95
Pra&ti&al #ppli&ations o t%e o, Programming Model
Parameter 5eeps in Condor ?4+4A
% condor_q
-- Submitter: x.cs.wisc.edu : :x.cs.wisc.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD3.0 frieda 4/20 12:08 0+00:00:05 R 0 9.8 sim.exe
3.1 frieda 4/20 12:08 0+00:00:03 I 0 9.8 sim.exe
3.2 frieda 4/20 12:08 0+00:00:01 I 0 9.8 sim.exe
3.3 frieda 4/20 12:08 0+00:00:00 I 0 9.8 sim.exe...
3.598 frieda 4/20 12:08 0+00:00:00 I 0 9.8 sim.exe
3.599 frieda 4 20 12:08 0+00:00:00 I 0 9.8 sim.exe
600 jobs; 599 idle, 1 running, 0 held
'1'!'1 '3
0ource .ondor :eam# .ondor Vser?s :utorial-http//cs-u*isc-edu/condor
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
26/95
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
27/95
$%at is a $o=lo5;
W = set of >obs *ith
(think 5irect Ccyclic Braph)
'1'!'1 '6
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
28/95
#ppli&ations o t%e $or=lo5 ProgrammingModel
.omplex applications .om lex filterin of data
.omplex analysis of instrument measurements
Cpplications created by non!.0 scientistsH
Workflo*s ha"e a natural correspondence in the real!*orld#as descriptions of a scientific procedure
Oisual model of a graph sometimes easier to program
8recursor of the apeduce 8rogramming odel(next slides)
'1'!'1 'Q
HCdapted from .arole Boble and 5a"id de oure# .hapter in :he ourth8aradigm# http//research-microsoft-com/en!us/collaboration/fourthparadigm/
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
29/95
$or=lo5s "6isted in *rids 'ut Did Note&ome a Dominant Programming Model
:races
0elected indings
Braph *ith !2 le"els C"erage W si,e is /22 >obs 63obs or less# I3< are si,ed ' >obs or less
'1'!'1 'I
0stermann et al., 0n the Characteristics of Grid*or+flo1s, CoreG2I3 Inte)rated 2esearch in GridComputin) CGI*, 'Q.
P ti l # li ti t% $7 P i M d l
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
30/95
Pra&ti&al #ppli&ations o t%e $7 Programming Model
ioinormati&s in ,aerna
'1'!'1
0ource .arole Boble and 5a"id de oure# .hapter in :he ourth 8aradigm#http//research-microsoft-com/en!us/collaboration/fourthparadigm/
# d
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
31/95
#genda
1- 7ntroduction
-
3. Programming Models or Compute-Intensie$or=loads
1- 9ags of :asks
'- Workflo*s
3. Parallel Programming Models
2- 8rogramming odels for 9ig 5ata3- 0ummary
'1'!'1 1
P ll l P i M d l
,D &ourse in4049,D &ourse in4049
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
32/95
Parallel Programming Models
Cbstract machines (5istributed) shared memory
,as=,as= (groups o B B minutes)(groups o B B minutes)dis&uss arallel ro rammin in &loudsdis&uss arallel ro rammin in &louds
,D &ourse in4049,D &ourse in4049Introdu&tion to PCIntrodu&tion to PC
str ute memory
.onceptual programming models aster/*orker
5i"ide and conNuer
5ata / :ask parallelism
,as=,as=(inter(inter--group dis&ussion)group dis&ussion)dis&uss.dis&uss.
0ystem!le"el programming models :hreads on B8Vs and other multi!cores
'1'!'1 '
4arbanescu et al.( To1ards an Effectie 5nifiedPro)rammin) 6odel for 6an7-Cores. IP3PS *S &/#&
# d
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
33/95
#genda
1- 7ntroduction
-
- 8rogramming odels for .ompute!7ntensi"e Workloads
4. Programming Models or ig Data
3- 0ummary
'1'!'1
"&osstems o ig-Data Programming Models
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
34/95
&osste s o g ata og a g ode s
0% @i"e 8igAC% 5ryad7D%0cope C%
@igh!e"el anguage
9ig%uerylume 0a*,alleteor
% Where does !on!demand fit&% Where does 8regel!on!B8Vs fit&
5remel
0er"ice:ree
Mapedu&e Model ClgebrixP#C,
87/
Erlang
Nep%ele @yracks5ryadadoop+
#N
@aloop
Pregel
C,ure
Engine
:era
5ataEngine
Execution Engine
lume
Engine
5ataflo*
*irap%
Csterix9!tree
'1'!'1 2
0
D7 .osmos0
Cdapted from 5agstuhl 0eminar on 7nformation anagement in the .loud#http//***-dagstuhl-de/program/calendar/partlist/&semnr=11'1X0VGB
C,ure5ata0tore
:era5ata0tore
0torage Engine
OoldemortB00
H 8lus Yookeeper# .5D# etc-
#genda
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
35/95
#genda
1- 7ntroduction
-
- 8rogramming odels for .ompute!7ntensi"e Workloads
4. Programming Models or ig Data
1. Mapedu&e'- Braph 8rocessing
- Gther 9i 5ata 8ro rammin odels
3- 0ummary
'1'!'1 3
Mapedu&e
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
36/95
Mapedu&e
odel for processing and generating large data sets
Enables a functional!like ro rammin model
0plits computations into independent parallel tasks
akes efficient use of large commodity clusters
@ides the details of paralleli,ation# fault!tolerance#data distribution# monitoring and load balancing
'11!'1' 4
Mapedu&e ,%e Programming Model
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
37/95
Mapedu&e ,%e Programming Model
C programmig model# not a programming languageP1- 7nput/Gutput
'- ap 8hase 8rocesses input key/"alue pair
8roduces set of intermediate pairs
map (in_key, in_value) -> list(out_key, interm_value)
3. Reduce Phase:
v u v y
Produces a set of merged output values
reduce(out_key, list(interm_value)) -> list(out_value)
'11!'1' 6
Mapedu&e in t%e Cloud
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
38/95
acebook use case 0ocial!net*orking ser"ices
Cnaly,e connections in the graph of friendships to recommend ne*
connections Boogle use case
Web!base email ser"ices# Boogle5ocs
Cnaly,e messages and user beha"ior to optimi,e ad selection andplacement
Soutube use case Oideo!sharing sites
Cnaly,e user preferences to gi"e better stream suggestions
'11!'1' Q
$ord&ount "6ample
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
39/95
$ord&ount "6ample
ile 1 :he big data is big- ile ' apeduce tames big data-
Map 8utput apper!1 (:he# 1)# (big# 1)# (data# 1)# (is# 1)# (big# 1)
apper!' (apeduce# 1)# (tames# 1)# (big# 1)# (data# 1)
edu&e Input educer!1 (:he# 1)
educer!' (big# 1)# (big# 1)# (big# 1)
edu&e 8utput educer!1 (:he# 1)
educer!' (big# )
'11!'1' I
educer! (data# 1)# (data# 1)
educer!2 (is# 1)
educer!3 (apeduce# 1)# (apeduce# 1)
educer!4 (tames# 1)
educer! (data# ')
educer!2 (is# 1)
educer!3 (apeduce# ')
educer!4 (tames# 1)
Colored Euare Counter
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
40/95
Colored Euare Counter
'11!'1' 2
Mapedu&e F *oogle "6ample 1
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
41/95
Mapedu&e F *oogle "6ample 1
Benerating language model statistics oun o mes e"ery !*or seNuence occurs n arge corpus
of documents (and keep all those *here count [= 2)
apeduce solution ap extract 3!*ord seNuences =[ count from document
educe combine counts# and keep if count large enough
'11!'1' 21
http//***-slideshare-net/>hammerb/mapreduce!pact4!keynote
Mapedu&e F *oogle "6ample 2
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
42/95
Mapedu&e F *oogle "6ample 2
Aoining *ith other data
Benerate per!doc summary# but include per!host summary(e-g-# Z of pages on host# important terms on host)
apeduce solution ap extract host name from V# lookup per!host info#
combine *ith per!doc data and emit
e uce en y unc on em ey "a ue rec y
'11!'1' 2'
http//***-slideshare-net/>hammerb/mapreduce!pact4!keynote
More "6amples o Mapedu&e #ppli&ations
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
43/95
More "6amples o Mapedu&e #ppli&ations
5istributed Brep
.ount of V Cccess reNuency
!
:erm!Oector per @ost
5istributed 0ort
7n"erted indices 8age ranking
'11!'1' 2
$
"6e&ution 8erie5 (1+2)
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
44/95
( + )
% What is the performance problem raised by this step&
'11!'1' 22
"6e&ution 8erie5 (2+2)
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
45/95
"6e&ution 8erie5 (2+2)
Gne master# many *orkers example 7nput data split into map tasks (42 / 1'Q 9) '# maps
educe phase partitioned into reduce tasks 2# reduces
5ynamically assign tasks to *orkers '# *orkers
aster assigns each map / reduce task to an idle *orker
ap *orker 5ata locality a*areness
ead input and generate local files *ith key/"alue pairs
educe *orker
ead intermediate output from mappers
0ort and reduce to produce the output
'11!'1' 23
7ailures and a&=-up ,as=s
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
46/95
p
'11!'1' 24
$%at is adoop; # Mapedu&e "6e&. "ngine
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
47/95
p p g
7nspired by Boogle# supported by SahooP
5esigned to perform fast and reliable analysis of the big data
arge expansion in many domains such as inance# technology# telecom# media# entertainment# go"ernment#
research institutions
'11!'1' 26
adoop F a%oo
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
48/95
When you "isityahoo# you areinteracting *ithdata rocessed*ith @adoopP
'11!'1' 2Q
https//open&irrus-org/system/files/8penCirrusadoop'I-ppt
adoop se Cases
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
49/95
p
1- 0earch Sahoo# Cma,on# Y"ents
'- o rocessin
acebook# Sahoo# Aoost# ast-fm- ecommendation 0ystems
acebook
2- 5ata Warehouse
acebook# CG
3- Oideo and 7mage Cnalysis De* Sork :imes# Eyealike
'11!'1' 2I
http//cloud-berkeley-edu/data/hdfs-pdf
adoop Distri'uted 7ile stem
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
50/95
Cssumptions and goals 5ata distributed across hundreds or thousands of machines
5etection of faults and automatic reco"ery 5esigned for batch processing "s- interacti"e use
@igh throughput of data access "s- lo* latency of data access
@igh aggregate band*idth and high scalability
Write!once!read!many file access model
o"ing computations is cheaper than mo"ing data minimi,e net*ork
'11!'1' 3
D7 #r&%ite&ture
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
51/95
aster/sla"e architecture
DameDode DD anages the file system namespace egulates access to files by clients
Gpen/close/rename files or directories
apping of blocks to 5ataDodes 5ataDode (5D)
Gne er node in the cluster
anages local storage of the node 9lock creation/deletion/replication initiated by DD
0er"e read/*rite reNuests from clients
'11!'1' 31
D7 Internals
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
52/95
eplica 8lacement 8olicy irst replica on one node in the local rack
0econd replica on a different node in the local rack
r rep ca on a eren no e n a eren rac
impro"ed *rite performance ('/ are on the local rack)
preser"es data reliability and read performance
.ommunication 8rotocols ayered on top of :.8/78
r \
5ataDode 8rotocol 5ataDodes \ DameDode DameDode responds to 8. reNuests issued by 5ataDodes / clients
'11!'1' 3'
adoop &%eduler
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
53/95
Aob di"ided into se"eral independent tasks executed in parallel :he input file is split into chunks of 42 / 1'Q 9 Each chunk is assigned to a map task
e uce as aggrega e e ou pu o e map as s
:he master assigns tasks to the *orkers in 7G order Aob:racker maintains a Nueue of running >obs# the states of the
:ask:rackers# the tasks assignments
:ask:rackers report their state "ia a heartbeat mechanism
5ata ocality execute tasks close to their data 0peculati"e execution re!launch slo* tasks
'11!'1' 3
Mapedu&e "olution ?1+BA
adoop is Maturing Important Contri'utors
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
54/95
adoop is Maturing Important Contri'utors
ches
Dum
berofpat
'1'!'1 32
0ource http//***-theregister-co-uk/'1'/Q/16/communityFhadoop/
Mapedu&e "olution ?2+BA
Gate &%eduler
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
55/95
Gate &%eduler
C:E 0cheduler ongest Cpproximate :ime to End
0peculati"ely execute the task that *ill finish farthest into the future
timeFleft = (1!progress0core) / progressate :asks make progress at a roughly constant rate
obust to node heterogeneity
E.' 0ort running times C:E "s- @adoop "s- Do spec-
'11!'1' 33
8aharia et al.( Improin) 6ap2educe performance inhetero)eneous enironments. 0S3I &//%.
Mapedu&e "olution ?3+BA
7#I &%eduling
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
56/95
7#I &%eduling
7solation and statistical multiplexing :*o!le"el architecture
0econd# each pool allocates its slots among multiple >obs 9ased on a max!min fairness policy
'11!'1' 34
8aharia et al.( 3ela7 schedulin)( a simple techni9ue forachiein) localit7 and fairness in clusterschedulin). EuroS7s &/#/. :lso T2 EECS-&//'-
Mapedu&e "olution ?4+BA
Dela &%eduling
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
57/95
Dela &%eduling
5ata locality issues @ead!of!line scheduling 7G# C7
o* probability for small >obs to achie"e data locality
3Q< of >obs ] C.E9GGL ha"e ^ '3 maps Gnly 3< achie"e node locality "s- 3I< rack locality
'11!'1' 36
8aharia et al.( 3ela7 schedulin)( a simple techni9ue forachiein) localit7 and fairness in clusterschedulin). EuroS7s &/#/. :lso T2 EECS-&//'-
Mapedu&e "olution ?B+BA
Dela &%eduling
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
58/95
Dela &%eduling
0olution 0kip the head!of!line >ob until you find a >ob that can launch a local task
Wait :' seconds before launching tasks off!rack
:1 = :' = 13 seconds =[ Q< data locality
'11!'1' 3Q
8aharia et al.( 3ela7 schedulin)( a simple techni9ue forachiein) localit7 and fairness in clusterschedulin). EuroS7s &/#/. :lso T2 EECS-&//'-
8#G# and Mapedu&e ?1+9A
LGCC unner
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
59/95
LGCC 8lacement X allocation .entral for all clusters
aintains cluster metadata
!unner .onfiguration X deployment cluster monitoring
Bro*/0hrink mechanism
! 8erformance isolation
5ata isolation
ailure isolation
Oersion isolation
'11!'1' 3I
oala and Mapedu&e ?2+9A
esiHing Me&%anism
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
60/95
:*o types of nodes Core nodes fully!functional nodes# *ith :ask:racker and 5ataDode (local disk access)
,ransient nodes compute nodes# *ith :ask:racker
8arameters 7 = Dumber of running tasks per number of a"ailable slots
8redefined 7min and 7ma6 thresholds
8redefined constant step gro5tep/ s%rin=tep
, = time elapsed bet*een t*o successi"e resource offers
ree po c es *ro5-%rin= Poli& (*P) gro*!shrink but maintain bet*een 7min and 7ma6 *reed-*ro5 Poli& (**P) gro*# shrink *hen *orkload done
*reed-*ro5-5it%-Data Poli& (**DP) BB8 *ith core nodes (local disk access)
'11!'1' 4
oala and Mapedu&e ?3+9A
$or=loads
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
61/95
IQ< of >obs ] acebook process 4-I 9 and take less than a minute J1K
Boogle reported in '2 computations *ith :9 of data on 1s of machines J'K
- # - # - # - # ! #pp- 2\34# '1'-
J'K A- 5ean and 0- Bhema*at# apreduce 0implified 5ata 8rocessing on arge .lusters# .omm- of the C.# Ool- 31# no- 1# pp- 16\11# 'Q-
'11!'1' 41
oala and Mapedu&e ?4+9A
$ord&ount (,pe 0 ull sstem)
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
62/95
.8V 5isk
1 B9 input data
Q map tasks executed in 6 *a"es Wordcount is .8V!bound in the map phase
'11!'1' 4'
oala and Mapedu&e ?B+9A
ort (,pe 0 ull sstem)
.8V 5isk
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
63/95
.8V 5isk
8erformed by the frame*ork during the shuffling phase
0hort map phase *ith 2
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
64/95
Gptimum
a) Wordcount b) 0ort(:ype 1) (:ype ')
0peedup relati"e to an cluster *ith 1 core nodes
Close to linear speedup on &ore nodes
'11!'1' 42
oala and Mapedu&e ?J+9A
"6e&ution ,ime s. ,ransient Nodes
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
65/95
7nput data of 2 B9
Wordcount output data = ' L9 0ort output data = 2 B9
Wordcount scales better than 0ort on transient nodes'11!'1' 43
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
66/95
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
67/95
#genda
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
68/95
1- 7ntroduction
-
- 8rogramming odels for .ompute!7ntensi"e Workloads
4. Programming Models or ig Data
1- apeduce
2. *rap% Pro&essing
- Gther 9i 5ata 8ro rammin odels
3- 0ummary
'1'!'1 4Q
*rap% Pro&essing "6ampleingle-our&e %ortest Pat% (P)
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
69/95
5i>kstra?s algorithm
Vpdate neighbors G(_E_ + _O_`log_O_) *ith ibo@eap
7nitial datasetC ^# (9# 3)# (5# )[
. ^inf# (# 3)[5 ^inf# (9# 1)# (.# )# (E# 2)# (# 2)[
---'1'!'1 4I
0ource .laudio artella# 8resentation on Biraph at :V 5elft# Cpr '1'-
*rap% Pro&essing "6ampleP in Mapedu&e
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
70/95
apper output distances
% What is the performance problem&
^ # [# # [# # n [# ---
9ut also graph structure
^C# ^# (9# 3)# (5# )[ ---
educer input distances9 ^inf# 3[# 5 ^inf# [ ---
9ut also graph structure9 ^inf# (E# 1)[ ---
'1'!'1 6
0ource .laudio artella# 8resentation on Biraph at :V 5elft# Cpr '1'-
D>obs# *here D is thegraph diameter
,%e PregelProgramming Model or *rap% Pro&essing
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
71/95
g g p g
9atch!oriented processing
uns in!memory
Oertex!centric C87
ault!tolerant
!
GL# the actual model follo*s in the next slides
'1'!'1 61
0ource .laudio artella# 8resentation on Biraph at :V 5elft# Cpr '1'-
,%e PregelProgramming Model or *rap% Pro&essing
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
72/95
g g p g
9ased on Oaliant?s
8rocessorsocal .omputation
9ulk 0ynchronous 8arallel (908) D processing units *ith fast local memory
0hared communication medium
0eries of upersteps
Blobal 0ynchroni,ation 9arrier
Ends *hen all "ote:o@alt
.ommunication
8regel executes initiali,ation# oneor se"eral supersteps# shutdo*n
'1'!'1 6'
9arrier0ynchroni,ation
0ource .laudio artella# 8resentation on Biraph at :V 5elft# Cpr '1'-
Pregel,%e uperstep
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
73/95
p p
Each Oertex (execution in parallel) ecei"e messages from other "ertices
8erform o*n computation (user!defined function)
odify o*n state or state of outgoing messages
utate topology of the graph
0end messages to other "ertices
:ermination condition Cll "ertices inacti"e
Cll messages ha"e been transmitted
'1'!'1 6
0ource 8regel article-
Pregel ,%e
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
74/95
7nputmessage
7mplementsprocessingalgorithm
Gutput
'1'!'1 62
message
0ource 8regel article-
,%e Pregel #r&%ite&tureMaster-$or=er
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
75/95
aster assigns "ertices to Workers Braph partitioning
aster
r r u r
aster coordinates .heckpoints
Workers execute "ertices compute()
Workers exchange messages directly
Worker1
Workern
1 '3
'1'!'1 63
2
6k
0ource 8regel article-
Pregel Perorman&eP on 1 illion-
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
76/95
'1'!'1 64
0ource 8regel article-
Pregel Perorman&eP on andom *rap%s
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
77/95
'1'!'1 66
0ource 8regel article-
#pa&%e *irap%#n 8pen-our&e Implementation o Pregel
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
78/95
:asktracker :asktracker:asktracker :asktracker
oose implementation of 8regel
DDDD X A:X A:YookeeperYookeeper
0lot0lot 0lot0lot 0lot0lot 0lot0lot0lot0lot 0lot0lot 0lot0lot 0lot0lot
8ersistentcomputation state
aster
0trong community (acebook# :*itter# inked7n)
uns 1< on existing @adoop clusters
0ingle ap!only >ob'1'!'1 6Q
0ource .laudio artella# 8resentation on Biraph at :V 5elft# Cpr '1'-%ttp++in&u'ator.apa&%e.org+girap%+%ttp++in&u'ator.apa&%e.org+girap%+
Page ran= 'en&%mar=s
:iberium :an
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
79/95
:iberium :an Clmost 2 nodes# shared among numerous groups in SahooP
@adoop -'-'2 (secure @adoop)
- # #
org-apache-giraph-benchmark-8ageank9enchmark Benerates data# number of edges# number of "ertices# Z of
supersteps
1 master/YooLeeper
' supersteps
Do checkpoints
1 random edge per "ertex
6I
0ource C"ery .hing presentation at @ortonWorks-http//***-slideshare-net/a"eryching/'11112horton*orks/
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
80/95
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
81/95
'
3
1
13
,
otalse&onds
Q1
1 ' 2 3 4
o erti&es (in 100s o millions)
0ource C"ery .hing presentation at @ortonWorks-http//***-slideshare-net/a"eryching/'11112horton*orks/
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
82/95
3'
illions)
1
'
2
3
1
13
erti&es(100Nso
,
otalse&onds
Q'
3 1 13 ' '3 3
M
o 5or=ers
0ource C"ery .hing presentation at @ortonWorks-http//***-slideshare-net/a"eryching/'11112horton*orks/
#genda
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
83/95
1- 7ntroduction
-
- 8rogramming odels for .ompute!7ntensi"e Workloads4. Programming Models or ig Data
1- apeduce
'- Braph 8rocessing
3. 8t%er i Data Pro rammin Models
3- 0ummary
'1'!'1 Q
tratosp%ere
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
84/95
eteor Nuery language# 0upremo operator frame*ork
r r r r r Extended set of 'nd order functions ("s apeduce) 5eclarati"e definition of data parallelism
Dephele execution engine 0chedules multiple dataflo*s simultaneously
#
@50 storage engine
'1'!'1 Q2
tratosp%ereProgramming Contra&ts (P#C,s) ?1+2A
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
85/95
'nd!order function
'1'!'1 Q3
0ource 8C.: o"er"ie*#https//stratosphere-eu/*iki/doku-php/*ikipactpm
Clso in apeduce
tratosp%ereProgramming Contra&ts (P#C,s) ?2+2A
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
86/95
% @o* can 8C.:soptimi,e data
processing&
'1'!'1 Q4
0ource 8C.: o"er"ie*#https//stratosphere-eu/*iki/doku-php/*ikipactpm
tratosp%ereNep%ele
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
87/95
'1'!'1 Q6
tratosp%ere s Mapedu&e
8C.: extends apeduce
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
88/95
9oth propose 'nd!order functions (3 8C.:s "s ap X educe)
9oth reNuire from user 1st!order functions (*hat?s inside the ap)
o can ene rom g er! e"e anguages
8C.: ecosystem has 7aa0 support
Ley!"alue
data model
'1'!'1 QQ
0ource abian @ueske# arge 0cale 5ata Cnalysis 9eyond apeduce# @adoop Bet:ogether# eb '1'-
tratosp%ere s Mapedu&ePair5ise %ortest Pat%s ?1+3A
l d W h ll Cl ith
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
89/95
loyd!Warshall Clgorithm For k from 1 to n
For j from 1 to n
Di,j min( Di,j , Di,k + Dk,j )
0ource 0tratosphere example#https//stratosphere-eu/*iki/lib/exe/detail-php/*ikiall'allFspFtaskdescription-png&id=*iki
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
90/95
7nputundirected graph
'- 0ort bysmaller "ertex 75
!
Gutputtriangles
2- :riads to triangles1- ead input graph
'1'!'1 I
0ource abian @ueske# arge 0cale 5ata Cnalysis 9eyond apeduce# @adoop Bet:ogether# eb '1' and 0tratosphere example-
tratosp%ere s Mapedu&e,riangle "numeration ?2+3A
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
91/95
apeduce
'1'!'1 I1
0ource abian @ueske# arge 0cale 5ata Cnalysis 9eyond apeduce# @adoop Bet:ogether# eb '1' and 0tratosphere example-
tratosp%ere s Mapedu&e,riangle "numeration ?3+3A
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
92/95
0tratosphere
'1'!'1 I'
0ource abian @ueske# arge 0cale 5ata Cnalysis 9eyond apeduce# @adoop Bet:ogether# eb '1' and 0tratosphere example-
#genda
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
93/95
1- 7ntroduction
-
- 8rogramming odels for .ompute!7ntensi"e Workloads2- 8rogramming odels for 9ig 5ata
B. ummar
'1'!'1 I
Con&lusion ,a=eCon&lusion ,a=e--ome Messageome Message
Programming model L &omputer sstem a'stra&tion
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
94/95
Programming Models or Compute-Intensie $or=loads
any trade!offs# fe* dominant programming models
odels bags of tasks# *orkflo*s# master/*orker# 908# $
Programming Models or ig Data
ig data programming models %ae e&osstems
any trade!offs# many programming models
odels apeduce# 8regel# 8C.:# 5ryad# $
Gctober 1# '1' I2
Execution engines @adoop# Loala+# Biraph# 8C.:/Dephele# 5ryad# $
ealit &%e&= &loud programming is maturing
http//***-flickr-com/photos/dimitrisotiropoulos/2'264421Q/
eading Material (eall #&tie 7ield)
$or=loads
Clexandru 7osup# 5ick @- A- Epema Brid .omputing Workloads- 7EEE 7nternet .omputing 13(') 1I!'4 ('11)
,%e 7ourt% Paradigm
-
8/12/2019 2012 IN4392 Lecture-5 CloudProgrammingModels
95/95
g
:he ourth 8aradigm# http//research-microsoft-com/en!us/collaboration/fourthparadigm/
Programming Models or Compute-Intensie $or=loads
Clexandru 7osup# athieu Aan# Gmer G,an 0onme,# 5ick @- A- Epema :he .haracteristics and 8erformance of Broups of Aobs in Brids-Euro!8ar '6 Q'!I
Gstermann et al-# Gn the .haracteristics of Brid Workflo*s# .oreB75 7ntegrated esearch in Brid .omputing (.B7W)# 'Q-http//***-pds-e*i-tudelft-nl/iosup/*ftraces6charsFcamera-pdf
.orina 0tratan# Clexandru 7osup# 5ick @- A- Epema C performance study of grid *orkflo* engines- B75 'Q '3!'
Programming Models or ig Data
Aeffrey 5ean# 0an>ay Bhema*at apeduce 0implified 5ata 8rocessing on arge .lusters- G057 '2 16!13
Aeffrey 5ean# 0an>ay Bhema*at apeduce a flexible data processing tool- .ommun- C. 3(1) 6'!66 ('1) atei Yaharia# Cndy Lon*inski# Cnthony 5- Aoseph# andy Lat,# and 7on 0toica- 'Q- 7mpro"ing apeduce performance in
heterogeneous en"ironments- 7n 8roceedings of the Qth V0ED7 conference on Gperating systems design and implementation (G057Q)-V0ED7 Cssociation# 9erkeley# .C# V0C# 'I!2'-
# # # # #for achie"ing locality and fairness in cluster scheduling- Euro0ys '1 '43!'6Q
:yson .ondie# Deil .on*ay# 8eter Cl"aro# Aoseph - @ellerstein# Lhaled Elmeleegy# ussell 0ears apeduce Gnline- D057 '1 1!'Q
Br,egor, ale*ic,# atthe* @- Custern# Cart A- .- 9ik# Aames .- 5ehnert# 7lan @orn# Daty eiser# Br,egor, .,a>ko*ski 8regel a systemfor large!scale graph processing- 07BG5 .onference '1 13!124
5ominic 9attr# 0tephan E*en# abian @ueske# Gde> Lao# Oolker arkl# 5aniel Warneke Dephele/8C.:s a programming model andexecution frame*ork for *eb!scale analytical processing- 0o.. '1 11I!1
'1'!'1 I3