ctbd x preparation - github pages · •hive, hbase, yarn •futures, promises, actors •spark...

Post on 22-May-2020

16 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

ExamPreparationGuidoSalvaneschi

1

LectureMaterial

Lectures• Introtodist.systems• MapReduce• HDFS• Hive,HBase,Yarn• Futures,Promises,Actors• Spark• Sparkstreaming

2

Papers• MapReduce• GFS• Spark

Exercises• MapReduce• Futures,Actors• Spark

Warning!

• Thesearejustexamplesofthekindofquestionsthatcanappearintheexam.

• Theyarenotsupposedtobecomplete(ofcourse).

• Theyarenotrepresentativeofthecoverageofthecoursetopicsintheexam.

• Theydonotcoverquestionsaboutcoding(but“simple”exercisesprovidegoodexamplesforthat).

3

Explain3reasonsthatmotivatebuildingasysteminadistributedway

4

WhyDistributedSystems

• Functionaldistribution• Computershavedifferentfunctionalcapabilities(e.g.,Fileserver,printer)yetmayneedtoshareresources

• Client/server• Datagathering/dataprocessing

• Incrementalgrowth• Easiertoevolvethesystem• Modularexpandability

• Inherentdistributioninapplicationdomain• Banks,reservationservices,distributedgames,mobileapps• physicallyoracrossadministrativedomains• cashregisterandinventorysystemsforsupermarketchains• computersupportedcollaborativework

WhyDistributedSystems

• Economics• collectionsofmicroprocessorsofferabetterprice/performanceratiothanlargemainframes.

• Lowprice/performanceratio:costeffectivewaytoincreasecomputingpower.

• Betterperformance• Loadbalancing• Replicationofprocessingpower• Adistributedsystemmayhavemoretotalcomputingpowerthanamainframe.Ex.10,000CPUchips,eachrunningat50MIPS.Notpossibletobuild500,000MIPSsingleprocessorsinceitwouldrequire0.002nsec instructioncycle.Enhancedperformancethroughloaddistributing.

• IncreasedReliability• Exploitindependentfailuresproperty• Ifonemachinecrashes,thesystemasawholecanstillsurvive.

• Anotherdrivingforce:theexistenceoflargenumberofpersonalcomputers,theneedforpeopletocollaborateandshareinformation.

Explain3goals(andchallenges)ofdistributedsystems

Goalsandchallengesofdistributedsystems

• Transparency• Howtoachievethesingle-systemimage

• Performance• Thesystemprovideshigh(computing,storage,..)performance

• Scalability• Theabilitytoservemoreusers,provideacceptableresponsetimeswithincreasedamountofdata

• Openness• Anopendistributedsystemcanbeextendedandimprovedincrementally• Requirespublicationofcomponentinterfacesandstandardsprotocolsforaccessinginterfaces

• Reliability/faulttolerance• Maintainavailabilityevenwhenindividualcomponentsfail

• Heterogeneity• Network,hardware,operatingsystem,programminglanguages,differentdevelopers

• Security• Confidentiality,integrityandavailability

Whichtechniquescanbeusedtomakeasystemscalable?Brieflyexplainthem.

9

Scalingtechniques

Distribution• Splittingaresource(suchasdata)intosmallerparts,andspreadingthepartsacrossthesystem(cf DNS)

10

Scalingtechniques

• Replication• Replicateresources(services,data)acrossthesystem,canaccesstheminmultipleplaces

• Cachingtoavoidrecomputation• Increasedavailabilityreducestheprobabilitythatabiggersystembreaks

• Hidingcommunicationlatencies• Avoidwaitingforresponsestoremoteservicerequests

• Useasynchronouscommunication

11

ShowthesignatureoftheMapfunctionandtheReducefunctioninMapReduce.WhatistheMapphaseandwhataretheReducephaseresponsiblefor?

Functionalprogramming“foundations”

• mapinMapReduce↔mapinFP• map::(a→b)→[a]→[b]• Example:Doubleallnumbersinalist.• >map((*)2)[1,2,3]>[2,4,6]

• Inapurelyfunctionalsetting,anelementofalistbeingcomputedbymapcannotseetheeffectsofthecomputationsonotherelements.

• Iftheorderofapplicationofafunctionftoelementsinalistiscommutative,thenwecanreorderorparallelizeexecution.

13

Note:Thereisnoprecise1-1correspondence.Pleasetakethisjustasananalogy.

Functionalprogramming“foundations”

• Moveoverthelist,applyftoeachelementandanaccumulator.freturnsthenextaccumulatorvalue,whichiscombinedwiththenextelement.

• reduceinMapReduce↔foldinFP• foldl ::(b→a→b)→b→[a]→b• Example:Sumofallnumbersinalist.• >foldl (+)0[1,2,3]foldl (+)0[1,2,3]>6

14

Note:Thereisnoprecise1-1correspondence.Pleasetakethisjustasananalogy.

MapReduceBasicProgrammingModel

• Transformasetofinputkey-valuepairstoasetofoutputvalues:• Map:(k1,v1)→list(k2,v2)• MapReducelibrarygroupsallintermediatepairswithsamekeytogether.

• Reduce:(k2,list(v2))→list(v2)

15

Whatistheproblemwith“stragglers”(slowworkers)andwhatcanbedonetosolvethisproblem?

Stragglers&BackupTasks

• Problem:“Stragglers”(i.e.,slowworkers)significantlylengthenthecompletiontime.

• Solution:Closetocompletion,spawnbackupcopiesoftheremainingin-progresstasks.

• Whicheveronefinishesfirst,“wins”.

• Additionalcost:afewpercentmoreresourceusage.• Example:Asortprogramwithoutbackup=44%longer.

17

SketchtheGFSarchitecturepresentingthecomponentsthatconstitutesitandthemaininteractions.

GFS- Overview

19

Explainwhatafutureis

20

Explainwhatafutureis

21

• Placeholderobjectforavaluethatmaynotyetexist• ThevalueoftheFutureissuppliedconcurrentlyandcansubsequentlybeused

WhichunderlyingdatastructureisusedbyApacheSpark?Showaminimal exampleandindicatewheresuchdatastructureisused.

22

RDD (ResilientDistributedDatasets)

• Restrictedformofdistributedsharedmemory• immutable,partitionedcollectionofrecords• canonlybebuiltthroughcoarse-graineddeterministictransformations

(map,filter,join...)

• Efficientfault-toleranceusinglineage• Logcoarse-grainedoperationsinsteadoffine-graineddataupdates• AnRDDhasenoughinformationabouthowit’sderivedfromother

dataset• Recompute lostpartitionsonfailure

23

SparkandRDDs

• ImplementsResilientDistributedDatasets(RDDs)

• OperationsonRDDs• Transformations:definesnewdatasetbasedonpreviousones• Actions:startsajobtoexecuteoncluster

• Well-designedinterfacetorepresentRDDs• Makesitveryeasyto

implementtransformations• MostSparktransformation

implementation<20LoC

24

MoreonRDDs

Workwithdistributedcollectionsasyouwouldwithlocalones

• Resilientdistributeddatasets(RDDs)• Immutablecollectionsofobjectsspreadacrossacluster• Builtthroughparalleltransformations(map,filter,etc)• Automaticallyrebuiltonfailure• Controllablepersistence(e.g.,cachinginRAM)

• Differentstoragelevelsavailable,fallbacktodiskpossible

• Operations• Transformations (e.g.map,filter,groupBy,join)

• LazyoperationstobuildRDDsfromotherRDDs• Actions (e.g.count,collect,save)

• Returnaresultorwriteittostorage

WorkflowwithRDDs

• CreateanRDDfromadatasource: <list>• ApplytransformationstoanRDD:mapfilter• ApplyactionstoanRDD:collectcount

distFile = sc.textFile("...", 4) • RDDdistributedin4partitions• Elementsarelinesofinput• Lazyevaluationmeansnoexecutionhappensnow

26

GiveapossibleexplanationwhythecomputationofthePageRankissignificantlydifferentbetweenHadoop andSpark

171

80

23

14

020406080100120140160180200

30 60

Iteratio

ntim

e(s)

Numberofmachines

HadoopSpark

Spark

• Fast,expressiveclustercomputingsystemcompatiblewithApacheHadoop

• WorkswithanyHadoop-supportedstoragesystem(HDFS,S3,Avro,…)

• Improvesefficiency through:• In-memorycomputingprimitives• Generalcomputationgraphs

• Improvesusability through:• RichAPIsinJava,Scala,Python• Interactiveshell

Up to 100× faster

Often 2-10× less code

PageRank

• Givepagesranks(scores)basedonlinkstothem• Linksfrommanypagesè highrank• Linkfromahigh-rankpageè highrank

• Goodexampleofamorecomplexalgorithm• Multiplestagesofmap&reduce

• BenefitsfromSpark’sin-memorycaching• Multipleiterationsoverthesamedata

Image: en.wikipedia.org/wiki/File:PageRank-hi-res-2.png

Whatisaresourcemanagementsystem,e.g.,ApacheYARN?

ResourceManagement

• Typicallyimplementedbyasystemdeployedacrossnodesofacluster

• Layerbelow“frameworks”likeHadoop• Onanynode,thesystemkeepstrackofavailabilities• Applicationsontopuseinformationandestimationsofownrequirementstochoosewheretodeploysomething

• RMsystems(RMSs)differinabstractions/interfaceprovidedandactualschedulingdecisions

GiventhescenarioX,whatisthetechnology/approachthatyouwouldrecommendforsolvingproblemY?• MapReduce• HDFS• Adatabase• HBase• ApacheSpark• Sparkstreaming• ...

32

MapReducevs.TraditionalRDBMS

MapReduce TraditionalRDBMSDatasize Petabytes GigabytesAccess Batch Interactiveandbatch

Updates Writeonce,readmanytimes

Readandwritemanytimes

Structure Dynamicschema StaticschemaIntegrity Low High(normalizeddata)

Scaling Linear Non-linear(generalSQL)

33

ASummary

34

35

ProgrammingMod

el

DataOrganization

Declarative

StructuredFlatRawTypes

Proced

ural

Event-drivenapplications

• Canweuseexistingtechnologiesforbatchprocessing?• Theyarenotdesignedtominimizelatency• Weneedawholenewmodel!

Esper inanutshell

• EPL:richlanguagetoexpressrules• GroundedontheDSMSapproach

• Windowing• Relationalselect,join,aggregate,…• Relation-to-streamoperatorstoproduceoutput• Sub-queries

• Queriescanbecombinedtoformagraph• IntroducessomefeaturesofCEPlanguages

• Patterndetection

• Designedforperformance• Highthroughput• Lowlatency

Goals

Batch

Interactive Streaming

One stack to

rule them all?

§ Easy to combine batch, streaming, and interactive computations§ Easy to develop sophisticated algorithms§ Compatible with existing open source ecosystem (Hadoop/HDFS)

39

top related