ctbd x preparation - github pages · •hive, hbase, yarn •futures, promises, actors •spark...

ExamPreparationGuidoSalvaneschi

LectureMaterial

Lectures• Introtodist.systems• MapReduce• HDFS• Hive,HBase,Yarn• Futures,Promises,Actors• Spark• Sparkstreaming

Papers• MapReduce• GFS• Spark

Exercises• MapReduce• Futures,Actors• Spark

Warning!

• Thesearejustexamplesofthekindofquestionsthatcanappearintheexam.

• Theyarenotsupposedtobecomplete(ofcourse).

• Theyarenotrepresentativeofthecoverageofthecoursetopicsintheexam.

• Theydonotcoverquestionsaboutcoding(but“simple”exercisesprovidegoodexamplesforthat).

Explain3reasonsthatmotivatebuildingasysteminadistributedway

WhyDistributedSystems

• Functionaldistribution• Computershavedifferentfunctionalcapabilities(e.g.,Fileserver,printer)yetmayneedtoshareresources

• Client/server• Datagathering/dataprocessing

• Incrementalgrowth• Easiertoevolvethesystem• Modularexpandability

• Inherentdistributioninapplicationdomain• Banks,reservationservices,distributedgames,mobileapps• physicallyoracrossadministrativedomains• cashregisterandinventorysystemsforsupermarketchains• computersupportedcollaborativework

WhyDistributedSystems

• Economics• collectionsofmicroprocessorsofferabetterprice/performanceratiothanlargemainframes.

• Lowprice/performanceratio:costeffectivewaytoincreasecomputingpower.

• Betterperformance• Loadbalancing• Replicationofprocessingpower• Adistributedsystemmayhavemoretotalcomputingpowerthanamainframe.Ex.10,000CPUchips,eachrunningat50MIPS.Notpossibletobuild500,000MIPSsingleprocessorsinceitwouldrequire0.002nsec instructioncycle.Enhancedperformancethroughloaddistributing.

• IncreasedReliability• Exploitindependentfailuresproperty• Ifonemachinecrashes,thesystemasawholecanstillsurvive.

• Anotherdrivingforce:theexistenceoflargenumberofpersonalcomputers,theneedforpeopletocollaborateandshareinformation.

Explain3goals(andchallenges)ofdistributedsystems

Goalsandchallengesofdistributedsystems

• Transparency• Howtoachievethesingle-systemimage

• Performance• Thesystemprovideshigh(computing,storage,..)performance

• Scalability• Theabilitytoservemoreusers,provideacceptableresponsetimeswithincreasedamountofdata

• Openness• Anopendistributedsystemcanbeextendedandimprovedincrementally• Requirespublicationofcomponentinterfacesandstandardsprotocolsforaccessinginterfaces

• Reliability/faulttolerance• Maintainavailabilityevenwhenindividualcomponentsfail

• Heterogeneity• Network,hardware,operatingsystem,programminglanguages,differentdevelopers

• Security• Confidentiality,integrityandavailability

Whichtechniquescanbeusedtomakeasystemscalable?Brieflyexplainthem.

Scalingtechniques

Distribution• Splittingaresource(suchasdata)intosmallerparts,andspreadingthepartsacrossthesystem(cf DNS)

Scalingtechniques

• Replication• Replicateresources(services,data)acrossthesystem,canaccesstheminmultipleplaces

• Cachingtoavoidrecomputation• Increasedavailabilityreducestheprobabilitythatabiggersystembreaks

• Hidingcommunicationlatencies• Avoidwaitingforresponsestoremoteservicerequests

• Useasynchronouscommunication

ShowthesignatureoftheMapfunctionandtheReducefunctioninMapReduce.WhatistheMapphaseandwhataretheReducephaseresponsiblefor?

Functionalprogramming“foundations”

• mapinMapReduce↔mapinFP• map::(a→b)→[a]→[b]• Example:Doubleallnumbersinalist.• >map((*)2)[1,2,3]>[2,4,6]

• Inapurelyfunctionalsetting,anelementofalistbeingcomputedbymapcannotseetheeffectsofthecomputationsonotherelements.

• Iftheorderofapplicationofafunctionftoelementsinalistiscommutative,thenwecanreorderorparallelizeexecution.

Note:Thereisnoprecise1-1correspondence.Pleasetakethisjustasananalogy.

Functionalprogramming“foundations”

• Moveoverthelist,applyftoeachelementandanaccumulator.freturnsthenextaccumulatorvalue,whichiscombinedwiththenextelement.

• reduceinMapReduce↔foldinFP• foldl ::(b→a→b)→b→[a]→b• Example:Sumofallnumbersinalist.• >foldl (+)0[1,2,3]foldl (+)0[1,2,3]>6

Note:Thereisnoprecise1-1correspondence.Pleasetakethisjustasananalogy.

MapReduceBasicProgrammingModel

• Transformasetofinputkey-valuepairstoasetofoutputvalues:• Map:(k1,v1)→list(k2,v2)• MapReducelibrarygroupsallintermediatepairswithsamekeytogether.

• Reduce:(k2,list(v2))→list(v2)

Whatistheproblemwith“stragglers”(slowworkers)andwhatcanbedonetosolvethisproblem?

Stragglers&BackupTasks

• Problem:“Stragglers”(i.e.,slowworkers)significantlylengthenthecompletiontime.

• Solution:Closetocompletion,spawnbackupcopiesoftheremainingin-progresstasks.

• Whicheveronefinishesfirst,“wins”.

• Additionalcost:afewpercentmoreresourceusage.• Example:Asortprogramwithoutbackup=44%longer.

SketchtheGFSarchitecturepresentingthecomponentsthatconstitutesitandthemaininteractions.

GFS- Overview

Explainwhatafutureis

• Placeholderobjectforavaluethatmaynotyetexist• ThevalueoftheFutureissuppliedconcurrentlyandcansubsequentlybeused

WhichunderlyingdatastructureisusedbyApacheSpark?Showaminimal exampleandindicatewheresuchdatastructureisused.

RDD (ResilientDistributedDatasets)

• Restrictedformofdistributedsharedmemory• immutable,partitionedcollectionofrecords• canonlybebuiltthroughcoarse-graineddeterministictransformations

(map,filter,join...)

• Efficientfault-toleranceusinglineage• Logcoarse-grainedoperationsinsteadoffine-graineddataupdates• AnRDDhasenoughinformationabouthowit’sderivedfromother

dataset• Recompute lostpartitionsonfailure

SparkandRDDs

• ImplementsResilientDistributedDatasets(RDDs)

• OperationsonRDDs• Transformations:definesnewdatasetbasedonpreviousones• Actions:startsajobtoexecuteoncluster

• Well-designedinterfacetorepresentRDDs• Makesitveryeasyto

implementtransformations• MostSparktransformation

implementation<20LoC

MoreonRDDs

Workwithdistributedcollectionsasyouwouldwithlocalones

• Resilientdistributeddatasets(RDDs)• Immutablecollectionsofobjectsspreadacrossacluster• Builtthroughparalleltransformations(map,filter,etc)• Automaticallyrebuiltonfailure• Controllablepersistence(e.g.,cachinginRAM)

• Differentstoragelevelsavailable,fallbacktodiskpossible

• Operations• Transformations (e.g.map,filter,groupBy,join)

• LazyoperationstobuildRDDsfromotherRDDs• Actions (e.g.count,collect,save)

• Returnaresultorwriteittostorage

WorkflowwithRDDs

• CreateanRDDfromadatasource: <list>• ApplytransformationstoanRDD:mapfilter• ApplyactionstoanRDD:collectcount

distFile = sc.textFile("...", 4) • RDDdistributedin4partitions• Elementsarelinesofinput• Lazyevaluationmeansnoexecutionhappensnow

GiveapossibleexplanationwhythecomputationofthePageRankissignificantlydifferentbetweenHadoop andSpark

020406080100120140160180200

Iteratio

Numberofmachines

HadoopSpark

• Fast,expressiveclustercomputingsystemcompatiblewithApacheHadoop

• WorkswithanyHadoop-supportedstoragesystem(HDFS,S3,Avro,…)

• Improvesefficiency through:• In-memorycomputingprimitives• Generalcomputationgraphs

• Improvesusability through:• RichAPIsinJava,Scala,Python• Interactiveshell

Up to 100× faster

Often 2-10× less code

PageRank

• Givepagesranks(scores)basedonlinkstothem• Linksfrommanypagesè highrank• Linkfromahigh-rankpageè highrank

• Goodexampleofamorecomplexalgorithm• Multiplestagesofmap&reduce

• BenefitsfromSpark’sin-memorycaching• Multipleiterationsoverthesamedata

Image: en.wikipedia.org/wiki/File:PageRank-hi-res-2.png

Whatisaresourcemanagementsystem,e.g.,ApacheYARN?

ResourceManagement

• Typicallyimplementedbyasystemdeployedacrossnodesofacluster

• Layerbelow“frameworks”likeHadoop• Onanynode,thesystemkeepstrackofavailabilities• Applicationsontopuseinformationandestimationsofownrequirementstochoosewheretodeploysomething

• RMsystems(RMSs)differinabstractions/interfaceprovidedandactualschedulingdecisions

GiventhescenarioX,whatisthetechnology/approachthatyouwouldrecommendforsolvingproblemY?• MapReduce• HDFS• Adatabase• HBase• ApacheSpark• Sparkstreaming• ...

MapReducevs.TraditionalRDBMS

MapReduce TraditionalRDBMSDatasize Petabytes GigabytesAccess Batch Interactiveandbatch

Updates Writeonce,readmanytimes

Readandwritemanytimes

Structure Dynamicschema StaticschemaIntegrity Low High(normalizeddata)

Scaling Linear Non-linear(generalSQL)

ASummary

ProgrammingMod

DataOrganization

Declarative

StructuredFlatRawTypes

Proced

Event-drivenapplications

• Canweuseexistingtechnologiesforbatchprocessing?• Theyarenotdesignedtominimizelatency• Weneedawholenewmodel!

Esper inanutshell

• EPL:richlanguagetoexpressrules• GroundedontheDSMSapproach

• Windowing• Relationalselect,join,aggregate,…• Relation-to-streamoperatorstoproduceoutput• Sub-queries

• Queriescanbecombinedtoformagraph• IntroducessomefeaturesofCEPlanguages

• Patterndetection

• Designedforperformance• Highthroughput• Lowlatency

Interactive Streaming

One stack to

rule them all?

§ Easy to combine batch, streaming, and interactive computations§ Easy to develop sophisticated algorithms§ Compatible with existing open source ecosystem (Hadoop/HDFS)

ctbd x preparation - github pages · •hive, hbase, yarn •futures, promises, actors •spark...

Documents

matei zaharia uc berkeley amplab spark-project.org uc...

cloud computing using mapreduce, hadoop,...

lecture 11 spark - github pagesintro to spark • spark is...

from mapreduce to apache spark

spark vs. mapreduce

cloud computing using mapreduce, hadoop, spark -...

bigdata hadoop and spark development - acadgild · •...

spark: beyond mapreduce

mapreduce in spark · mapreduce mapreduce programs are...

introduction to apache spark - tropars.github.io · i...

spark motivation -...

running spark and mapreduce together in production

beyond mapreduce and spark: cap, hbase, and hive · beyond...

introduction to spark -...

getting started with apache spark -...

overview of mapreduce and spark

tr-spark: transient computing for big data analytics ·...

unified big data processing with apache spark · unified...

mapreduce and spark: overview

big data technology - hadoop, mapreduce, and spark