ctbd x preparation - github pages · •hive, hbase, yarn •futures, promises, actors •spark...

39
Exam Preparation Guido Salvaneschi 1

Upload: others

Post on 22-May-2020

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CTBD X preparation - GitHub Pages · •Hive, HBase, Yarn •Futures, Promises, Actors •Spark •Spark streaming 2 Papers •MapReduce •GFS •Spark Exercises •MapReduce •Futures,

ExamPreparationGuidoSalvaneschi

1

Page 2: CTBD X preparation - GitHub Pages · •Hive, HBase, Yarn •Futures, Promises, Actors •Spark •Spark streaming 2 Papers •MapReduce •GFS •Spark Exercises •MapReduce •Futures,

LectureMaterial

Lectures• Introtodist.systems• MapReduce• HDFS• Hive,HBase,Yarn• Futures,Promises,Actors• Spark• Sparkstreaming

2

Papers• MapReduce• GFS• Spark

Exercises• MapReduce• Futures,Actors• Spark

Page 3: CTBD X preparation - GitHub Pages · •Hive, HBase, Yarn •Futures, Promises, Actors •Spark •Spark streaming 2 Papers •MapReduce •GFS •Spark Exercises •MapReduce •Futures,

Warning!

• Thesearejustexamplesofthekindofquestionsthatcanappearintheexam.

• Theyarenotsupposedtobecomplete(ofcourse).

• Theyarenotrepresentativeofthecoverageofthecoursetopicsintheexam.

• Theydonotcoverquestionsaboutcoding(but“simple”exercisesprovidegoodexamplesforthat).

3

Page 4: CTBD X preparation - GitHub Pages · •Hive, HBase, Yarn •Futures, Promises, Actors •Spark •Spark streaming 2 Papers •MapReduce •GFS •Spark Exercises •MapReduce •Futures,

Explain3reasonsthatmotivatebuildingasysteminadistributedway

4

Page 5: CTBD X preparation - GitHub Pages · •Hive, HBase, Yarn •Futures, Promises, Actors •Spark •Spark streaming 2 Papers •MapReduce •GFS •Spark Exercises •MapReduce •Futures,

WhyDistributedSystems

• Functionaldistribution• Computershavedifferentfunctionalcapabilities(e.g.,Fileserver,printer)yetmayneedtoshareresources

• Client/server• Datagathering/dataprocessing

• Incrementalgrowth• Easiertoevolvethesystem• Modularexpandability

• Inherentdistributioninapplicationdomain• Banks,reservationservices,distributedgames,mobileapps• physicallyoracrossadministrativedomains• cashregisterandinventorysystemsforsupermarketchains• computersupportedcollaborativework

Page 6: CTBD X preparation - GitHub Pages · •Hive, HBase, Yarn •Futures, Promises, Actors •Spark •Spark streaming 2 Papers •MapReduce •GFS •Spark Exercises •MapReduce •Futures,

WhyDistributedSystems

• Economics• collectionsofmicroprocessorsofferabetterprice/performanceratiothanlargemainframes.

• Lowprice/performanceratio:costeffectivewaytoincreasecomputingpower.

• Betterperformance• Loadbalancing• Replicationofprocessingpower• Adistributedsystemmayhavemoretotalcomputingpowerthanamainframe.Ex.10,000CPUchips,eachrunningat50MIPS.Notpossibletobuild500,000MIPSsingleprocessorsinceitwouldrequire0.002nsec instructioncycle.Enhancedperformancethroughloaddistributing.

• IncreasedReliability• Exploitindependentfailuresproperty• Ifonemachinecrashes,thesystemasawholecanstillsurvive.

• Anotherdrivingforce:theexistenceoflargenumberofpersonalcomputers,theneedforpeopletocollaborateandshareinformation.

Page 7: CTBD X preparation - GitHub Pages · •Hive, HBase, Yarn •Futures, Promises, Actors •Spark •Spark streaming 2 Papers •MapReduce •GFS •Spark Exercises •MapReduce •Futures,

Explain3goals(andchallenges)ofdistributedsystems

Page 8: CTBD X preparation - GitHub Pages · •Hive, HBase, Yarn •Futures, Promises, Actors •Spark •Spark streaming 2 Papers •MapReduce •GFS •Spark Exercises •MapReduce •Futures,

Goalsandchallengesofdistributedsystems

• Transparency• Howtoachievethesingle-systemimage

• Performance• Thesystemprovideshigh(computing,storage,..)performance

• Scalability• Theabilitytoservemoreusers,provideacceptableresponsetimeswithincreasedamountofdata

• Openness• Anopendistributedsystemcanbeextendedandimprovedincrementally• Requirespublicationofcomponentinterfacesandstandardsprotocolsforaccessinginterfaces

• Reliability/faulttolerance• Maintainavailabilityevenwhenindividualcomponentsfail

• Heterogeneity• Network,hardware,operatingsystem,programminglanguages,differentdevelopers

• Security• Confidentiality,integrityandavailability

Page 9: CTBD X preparation - GitHub Pages · •Hive, HBase, Yarn •Futures, Promises, Actors •Spark •Spark streaming 2 Papers •MapReduce •GFS •Spark Exercises •MapReduce •Futures,

Whichtechniquescanbeusedtomakeasystemscalable?Brieflyexplainthem.

9

Page 10: CTBD X preparation - GitHub Pages · •Hive, HBase, Yarn •Futures, Promises, Actors •Spark •Spark streaming 2 Papers •MapReduce •GFS •Spark Exercises •MapReduce •Futures,

Scalingtechniques

Distribution• Splittingaresource(suchasdata)intosmallerparts,andspreadingthepartsacrossthesystem(cf DNS)

10

Page 11: CTBD X preparation - GitHub Pages · •Hive, HBase, Yarn •Futures, Promises, Actors •Spark •Spark streaming 2 Papers •MapReduce •GFS •Spark Exercises •MapReduce •Futures,

Scalingtechniques

• Replication• Replicateresources(services,data)acrossthesystem,canaccesstheminmultipleplaces

• Cachingtoavoidrecomputation• Increasedavailabilityreducestheprobabilitythatabiggersystembreaks

• Hidingcommunicationlatencies• Avoidwaitingforresponsestoremoteservicerequests

• Useasynchronouscommunication

11

Page 12: CTBD X preparation - GitHub Pages · •Hive, HBase, Yarn •Futures, Promises, Actors •Spark •Spark streaming 2 Papers •MapReduce •GFS •Spark Exercises •MapReduce •Futures,

ShowthesignatureoftheMapfunctionandtheReducefunctioninMapReduce.WhatistheMapphaseandwhataretheReducephaseresponsiblefor?

Page 13: CTBD X preparation - GitHub Pages · •Hive, HBase, Yarn •Futures, Promises, Actors •Spark •Spark streaming 2 Papers •MapReduce •GFS •Spark Exercises •MapReduce •Futures,

Functionalprogramming“foundations”

• mapinMapReduce↔mapinFP• map::(a→b)→[a]→[b]• Example:Doubleallnumbersinalist.• >map((*)2)[1,2,3]>[2,4,6]

• Inapurelyfunctionalsetting,anelementofalistbeingcomputedbymapcannotseetheeffectsofthecomputationsonotherelements.

• Iftheorderofapplicationofafunctionftoelementsinalistiscommutative,thenwecanreorderorparallelizeexecution.

13

Note:Thereisnoprecise1-1correspondence.Pleasetakethisjustasananalogy.

Page 14: CTBD X preparation - GitHub Pages · •Hive, HBase, Yarn •Futures, Promises, Actors •Spark •Spark streaming 2 Papers •MapReduce •GFS •Spark Exercises •MapReduce •Futures,

Functionalprogramming“foundations”

• Moveoverthelist,applyftoeachelementandanaccumulator.freturnsthenextaccumulatorvalue,whichiscombinedwiththenextelement.

• reduceinMapReduce↔foldinFP• foldl ::(b→a→b)→b→[a]→b• Example:Sumofallnumbersinalist.• >foldl (+)0[1,2,3]foldl (+)0[1,2,3]>6

14

Note:Thereisnoprecise1-1correspondence.Pleasetakethisjustasananalogy.

Page 15: CTBD X preparation - GitHub Pages · •Hive, HBase, Yarn •Futures, Promises, Actors •Spark •Spark streaming 2 Papers •MapReduce •GFS •Spark Exercises •MapReduce •Futures,

MapReduceBasicProgrammingModel

• Transformasetofinputkey-valuepairstoasetofoutputvalues:• Map:(k1,v1)→list(k2,v2)• MapReducelibrarygroupsallintermediatepairswithsamekeytogether.

• Reduce:(k2,list(v2))→list(v2)

15

Page 16: CTBD X preparation - GitHub Pages · •Hive, HBase, Yarn •Futures, Promises, Actors •Spark •Spark streaming 2 Papers •MapReduce •GFS •Spark Exercises •MapReduce •Futures,

Whatistheproblemwith“stragglers”(slowworkers)andwhatcanbedonetosolvethisproblem?

Page 17: CTBD X preparation - GitHub Pages · •Hive, HBase, Yarn •Futures, Promises, Actors •Spark •Spark streaming 2 Papers •MapReduce •GFS •Spark Exercises •MapReduce •Futures,

Stragglers&BackupTasks

• Problem:“Stragglers”(i.e.,slowworkers)significantlylengthenthecompletiontime.

• Solution:Closetocompletion,spawnbackupcopiesoftheremainingin-progresstasks.

• Whicheveronefinishesfirst,“wins”.

• Additionalcost:afewpercentmoreresourceusage.• Example:Asortprogramwithoutbackup=44%longer.

17

Page 18: CTBD X preparation - GitHub Pages · •Hive, HBase, Yarn •Futures, Promises, Actors •Spark •Spark streaming 2 Papers •MapReduce •GFS •Spark Exercises •MapReduce •Futures,

SketchtheGFSarchitecturepresentingthecomponentsthatconstitutesitandthemaininteractions.

Page 19: CTBD X preparation - GitHub Pages · •Hive, HBase, Yarn •Futures, Promises, Actors •Spark •Spark streaming 2 Papers •MapReduce •GFS •Spark Exercises •MapReduce •Futures,

GFS- Overview

19

Page 20: CTBD X preparation - GitHub Pages · •Hive, HBase, Yarn •Futures, Promises, Actors •Spark •Spark streaming 2 Papers •MapReduce •GFS •Spark Exercises •MapReduce •Futures,

Explainwhatafutureis

20

Page 21: CTBD X preparation - GitHub Pages · •Hive, HBase, Yarn •Futures, Promises, Actors •Spark •Spark streaming 2 Papers •MapReduce •GFS •Spark Exercises •MapReduce •Futures,

Explainwhatafutureis

21

• Placeholderobjectforavaluethatmaynotyetexist• ThevalueoftheFutureissuppliedconcurrentlyandcansubsequentlybeused

Page 22: CTBD X preparation - GitHub Pages · •Hive, HBase, Yarn •Futures, Promises, Actors •Spark •Spark streaming 2 Papers •MapReduce •GFS •Spark Exercises •MapReduce •Futures,

WhichunderlyingdatastructureisusedbyApacheSpark?Showaminimal exampleandindicatewheresuchdatastructureisused.

22

Page 23: CTBD X preparation - GitHub Pages · •Hive, HBase, Yarn •Futures, Promises, Actors •Spark •Spark streaming 2 Papers •MapReduce •GFS •Spark Exercises •MapReduce •Futures,

RDD (ResilientDistributedDatasets)

• Restrictedformofdistributedsharedmemory• immutable,partitionedcollectionofrecords• canonlybebuiltthroughcoarse-graineddeterministictransformations

(map,filter,join...)

• Efficientfault-toleranceusinglineage• Logcoarse-grainedoperationsinsteadoffine-graineddataupdates• AnRDDhasenoughinformationabouthowit’sderivedfromother

dataset• Recompute lostpartitionsonfailure

23

Page 24: CTBD X preparation - GitHub Pages · •Hive, HBase, Yarn •Futures, Promises, Actors •Spark •Spark streaming 2 Papers •MapReduce •GFS •Spark Exercises •MapReduce •Futures,

SparkandRDDs

• ImplementsResilientDistributedDatasets(RDDs)

• OperationsonRDDs• Transformations:definesnewdatasetbasedonpreviousones• Actions:startsajobtoexecuteoncluster

• Well-designedinterfacetorepresentRDDs• Makesitveryeasyto

implementtransformations• MostSparktransformation

implementation<20LoC

24

Page 25: CTBD X preparation - GitHub Pages · •Hive, HBase, Yarn •Futures, Promises, Actors •Spark •Spark streaming 2 Papers •MapReduce •GFS •Spark Exercises •MapReduce •Futures,

MoreonRDDs

Workwithdistributedcollectionsasyouwouldwithlocalones

• Resilientdistributeddatasets(RDDs)• Immutablecollectionsofobjectsspreadacrossacluster• Builtthroughparalleltransformations(map,filter,etc)• Automaticallyrebuiltonfailure• Controllablepersistence(e.g.,cachinginRAM)

• Differentstoragelevelsavailable,fallbacktodiskpossible

• Operations• Transformations (e.g.map,filter,groupBy,join)

• LazyoperationstobuildRDDsfromotherRDDs• Actions (e.g.count,collect,save)

• Returnaresultorwriteittostorage

Page 26: CTBD X preparation - GitHub Pages · •Hive, HBase, Yarn •Futures, Promises, Actors •Spark •Spark streaming 2 Papers •MapReduce •GFS •Spark Exercises •MapReduce •Futures,

WorkflowwithRDDs

• CreateanRDDfromadatasource: <list>• ApplytransformationstoanRDD:mapfilter• ApplyactionstoanRDD:collectcount

distFile = sc.textFile("...", 4) • RDDdistributedin4partitions• Elementsarelinesofinput• Lazyevaluationmeansnoexecutionhappensnow

26

Page 27: CTBD X preparation - GitHub Pages · •Hive, HBase, Yarn •Futures, Promises, Actors •Spark •Spark streaming 2 Papers •MapReduce •GFS •Spark Exercises •MapReduce •Futures,

GiveapossibleexplanationwhythecomputationofthePageRankissignificantlydifferentbetweenHadoop andSpark

171

80

23

14

020406080100120140160180200

30 60

Iteratio

ntim

e(s)

Numberofmachines

HadoopSpark

Page 28: CTBD X preparation - GitHub Pages · •Hive, HBase, Yarn •Futures, Promises, Actors •Spark •Spark streaming 2 Papers •MapReduce •GFS •Spark Exercises •MapReduce •Futures,

Spark

• Fast,expressiveclustercomputingsystemcompatiblewithApacheHadoop

• WorkswithanyHadoop-supportedstoragesystem(HDFS,S3,Avro,…)

• Improvesefficiency through:• In-memorycomputingprimitives• Generalcomputationgraphs

• Improvesusability through:• RichAPIsinJava,Scala,Python• Interactiveshell

Up to 100× faster

Often 2-10× less code

Page 29: CTBD X preparation - GitHub Pages · •Hive, HBase, Yarn •Futures, Promises, Actors •Spark •Spark streaming 2 Papers •MapReduce •GFS •Spark Exercises •MapReduce •Futures,

PageRank

• Givepagesranks(scores)basedonlinkstothem• Linksfrommanypagesè highrank• Linkfromahigh-rankpageè highrank

• Goodexampleofamorecomplexalgorithm• Multiplestagesofmap&reduce

• BenefitsfromSpark’sin-memorycaching• Multipleiterationsoverthesamedata

Image: en.wikipedia.org/wiki/File:PageRank-hi-res-2.png

Page 30: CTBD X preparation - GitHub Pages · •Hive, HBase, Yarn •Futures, Promises, Actors •Spark •Spark streaming 2 Papers •MapReduce •GFS •Spark Exercises •MapReduce •Futures,

Whatisaresourcemanagementsystem,e.g.,ApacheYARN?

Page 31: CTBD X preparation - GitHub Pages · •Hive, HBase, Yarn •Futures, Promises, Actors •Spark •Spark streaming 2 Papers •MapReduce •GFS •Spark Exercises •MapReduce •Futures,

ResourceManagement

• Typicallyimplementedbyasystemdeployedacrossnodesofacluster

• Layerbelow“frameworks”likeHadoop• Onanynode,thesystemkeepstrackofavailabilities• Applicationsontopuseinformationandestimationsofownrequirementstochoosewheretodeploysomething

• RMsystems(RMSs)differinabstractions/interfaceprovidedandactualschedulingdecisions

Page 32: CTBD X preparation - GitHub Pages · •Hive, HBase, Yarn •Futures, Promises, Actors •Spark •Spark streaming 2 Papers •MapReduce •GFS •Spark Exercises •MapReduce •Futures,

GiventhescenarioX,whatisthetechnology/approachthatyouwouldrecommendforsolvingproblemY?• MapReduce• HDFS• Adatabase• HBase• ApacheSpark• Sparkstreaming• ...

32

Page 33: CTBD X preparation - GitHub Pages · •Hive, HBase, Yarn •Futures, Promises, Actors •Spark •Spark streaming 2 Papers •MapReduce •GFS •Spark Exercises •MapReduce •Futures,

MapReducevs.TraditionalRDBMS

MapReduce TraditionalRDBMSDatasize Petabytes GigabytesAccess Batch Interactiveandbatch

Updates Writeonce,readmanytimes

Readandwritemanytimes

Structure Dynamicschema StaticschemaIntegrity Low High(normalizeddata)

Scaling Linear Non-linear(generalSQL)

33

Page 34: CTBD X preparation - GitHub Pages · •Hive, HBase, Yarn •Futures, Promises, Actors •Spark •Spark streaming 2 Papers •MapReduce •GFS •Spark Exercises •MapReduce •Futures,

ASummary

34

Page 35: CTBD X preparation - GitHub Pages · •Hive, HBase, Yarn •Futures, Promises, Actors •Spark •Spark streaming 2 Papers •MapReduce •GFS •Spark Exercises •MapReduce •Futures,

35

ProgrammingMod

el

DataOrganization

Declarative

StructuredFlatRawTypes

Proced

ural

Page 36: CTBD X preparation - GitHub Pages · •Hive, HBase, Yarn •Futures, Promises, Actors •Spark •Spark streaming 2 Papers •MapReduce •GFS •Spark Exercises •MapReduce •Futures,

Event-drivenapplications

• Canweuseexistingtechnologiesforbatchprocessing?• Theyarenotdesignedtominimizelatency• Weneedawholenewmodel!

Page 37: CTBD X preparation - GitHub Pages · •Hive, HBase, Yarn •Futures, Promises, Actors •Spark •Spark streaming 2 Papers •MapReduce •GFS •Spark Exercises •MapReduce •Futures,

Esper inanutshell

• EPL:richlanguagetoexpressrules• GroundedontheDSMSapproach

• Windowing• Relationalselect,join,aggregate,…• Relation-to-streamoperatorstoproduceoutput• Sub-queries

• Queriescanbecombinedtoformagraph• IntroducessomefeaturesofCEPlanguages

• Patterndetection

• Designedforperformance• Highthroughput• Lowlatency

Page 38: CTBD X preparation - GitHub Pages · •Hive, HBase, Yarn •Futures, Promises, Actors •Spark •Spark streaming 2 Papers •MapReduce •GFS •Spark Exercises •MapReduce •Futures,

Goals

Batch

Interactive Streaming

One stack to

rule them all?

§ Easy to combine batch, streaming, and interactive computations§ Easy to develop sophisticated algorithms§ Compatible with existing open source ecosystem (Hadoop/HDFS)

Page 39: CTBD X preparation - GitHub Pages · •Hive, HBase, Yarn •Futures, Promises, Actors •Spark •Spark streaming 2 Papers •MapReduce •GFS •Spark Exercises •MapReduce •Futures,

39