- fall 2019david/cs348/lect-bigdata.pdf · groups together all intermediate values with the same...

40
Big Data Fall 2019 School of Computer Science University of Waterloo Databases CS348 (material in part by Tamer Özsu) (University of Waterloo) Big Data 1 / 25

Upload: others

Post on 16-Oct-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: - Fall 2019david/cs348/lect-BIGDATA.pdf · Groups together all intermediate values with the same key (i.e., ... Yarn Data (University of Waterloo) Big Data 12/25. Hadoop (cont’ed)

Big DataFall 2019

School of Computer ScienceUniversity of Waterloo

Databases CS348

(material in part by Tamer Özsu)

(University of Waterloo) Big Data 1 / 25

Page 2: - Fall 2019david/cs348/lect-BIGDATA.pdf · Groups together all intermediate values with the same key (i.e., ... Yarn Data (University of Waterloo) Big Data 12/25. Hadoop (cont’ed)

Big Data Characteristics

Volume

Almost every dataset we use is high volume as discussed

Variety

Text dataStreaming data (mobile included)Graph and RDF data

Velocity

Streaming data

Veracity

Data quality & cleaningManaging uncertain dataSearch under uncertainty

(University of Waterloo) Big Data 2 / 25

Page 3: - Fall 2019david/cs348/lect-BIGDATA.pdf · Groups together all intermediate values with the same key (i.e., ... Yarn Data (University of Waterloo) Big Data 12/25. Hadoop (cont’ed)

Big Data Characteristics

VolumeAlmost every dataset we use is high volume as discussed

VarietyText dataStreaming data (mobile included)Graph and RDF data

VelocityStreaming data

VeracityData quality & cleaningManaging uncertain dataSearch under uncertainty

(University of Waterloo) Big Data 2 / 25

Page 4: - Fall 2019david/cs348/lect-BIGDATA.pdf · Groups together all intermediate values with the same key (i.e., ... Yarn Data (University of Waterloo) Big Data 12/25. Hadoop (cont’ed)

Global Information Storage Capacity

(University of Waterloo) Big Data 3 / 25

Page 5: - Fall 2019david/cs348/lect-BIGDATA.pdf · Groups together all intermediate values with the same key (i.e., ... Yarn Data (University of Waterloo) Big Data 12/25. Hadoop (cont’ed)

Platforms for Big Data

Architecture Changes

Dataflow-like rather than Persistent Storage and TransactionsScalability via PartitioningFail tolerance and lineage

(University of Waterloo) Big Data 4 / 25

Page 6: - Fall 2019david/cs348/lect-BIGDATA.pdf · Groups together all intermediate values with the same key (i.e., ... Yarn Data (University of Waterloo) Big Data 12/25. Hadoop (cont’ed)

MapReduce Basics

For data analysis of very large data setsHighly dynamic, irregular, schemaless, etc.SQL too heavy

“Embarrassingly parallel problems”New, simple parallel programming model

Data structured as (key, value) pairsE.g. (doc-id, content), (word, count), etc.

Functional programming style with two functions to be given:Map(k1,v1) → list(k2,v2)Reduce(k2, list (v2)) → list(v3)

Implemented on a distributed file system (e.g., Google FileSystem) on very large clusters

(University of Waterloo) Big Data 5 / 25

Page 7: - Fall 2019david/cs348/lect-BIGDATA.pdf · Groups together all intermediate values with the same key (i.e., ... Yarn Data (University of Waterloo) Big Data 12/25. Hadoop (cont’ed)

Map Function

User-defined functionProcesses input key/value pairsProduces a set of intermediate key/value pairs

Map function I/OInput: read a chunk from distributed file system (DFS)Output: Write to intermediate file on local disk

MapReduce libraryExecutes map functionGroups together all intermediate values with the same key (i.e.,generates a set of lists)Passes these lists to reduce functions

Effect of map functionProcesses and partitions input dataBuilds a distributed map (transparent to user)Similar to “group by” operation in SQL

(University of Waterloo) Big Data 6 / 25

Page 8: - Fall 2019david/cs348/lect-BIGDATA.pdf · Groups together all intermediate values with the same key (i.e., ... Yarn Data (University of Waterloo) Big Data 12/25. Hadoop (cont’ed)

Reduce Function

User-defined functionAccepts one intermediate key and a set of values for that key (i.e., alist)Merges these values together to form a (possibly) smaller setTypically, zero or one output value is generated per invocation

Reduce function I/OInput: read from intermediate files using remote reads on local filesof corresponding mapper nodesOutput: Each reduces writes its output as a file back to DFS

Effect of map functionSimilar to aggregation operation in SQL

(University of Waterloo) Big Data 7 / 25

Page 9: - Fall 2019david/cs348/lect-BIGDATA.pdf · Groups together all intermediate values with the same key (i.e., ... Yarn Data (University of Waterloo) Big Data 12/25. Hadoop (cont’ed)

Map Reduce Example

function map(String document):// document: document contentsfor each word w in document:emit (w, 1)

function reduce(String word, Iterator pCounts):// word: a word// pCounts: a list of aggregated partial countssum = 0for each pc in pCounts:sum += pc

emit (word, sum)

(University of Waterloo) Big Data 8 / 25

Page 10: - Fall 2019david/cs348/lect-BIGDATA.pdf · Groups together all intermediate values with the same key (i.e., ... Yarn Data (University of Waterloo) Big Data 12/25. Hadoop (cont’ed)

Map Reduce Framework

The MapReduce system:1 Prepare the Map() input

designate Map processors, assigns the input key K1 to them, andprovides them with all the input data for K1;

2 Run the user-provided Map() code – Map() is run exactly once foreach K1 key, generating output organized by key K2.

3 "Shuffle" the Map output to the Reduce processorsdesignates Reduce processors, assigns the K2 key to them, andprovides all the Map-generated data associated with that key.

4 Run the user-provided Reduce() code – Reduce() is run exactlyonce for each K2 key produced by the Map step.

5 Produce the final outputcollects all the Reduce output, and sorts it by K2 to produce thefinal outcome.

(University of Waterloo) Big Data 9 / 25

Page 11: - Fall 2019david/cs348/lect-BIGDATA.pdf · Groups together all intermediate values with the same key (i.e., ... Yarn Data (University of Waterloo) Big Data 12/25. Hadoop (cont’ed)

MapReduce Processing

...Inputdataset

Map

Map

Map

Map

(k1, v)

(k2, v)(k2, v)

(k2, v)

(k1, v)

(k1, v)

(k2, v)

Group by k

Group by k

(k1, (v , v , v))

(k2, (v , v , v , v)) Reduce

Reduce

Outputdataset

(University of Waterloo) Big Data 10 / 25

Page 12: - Fall 2019david/cs348/lect-BIGDATA.pdf · Groups together all intermediate values with the same key (i.e., ... Yarn Data (University of Waterloo) Big Data 12/25. Hadoop (cont’ed)

MapReduce Architecture

Scheduler

Master

Input Module

Map Module

Combine Module

Partition Module

Map Process

Worker

Input Module

Map Module

Combine Module

Partition Module

Map Process

Worker

Input Module

Map Module

Combine Module

Partition Module

Map Process

Worker

Group Module

Reduce Module

Output Module

Reduce Process

Worker

Group Module

Reduce Module

Output Module

Reduce Process

Worker

(University of Waterloo) Big Data 11 / 25

Page 13: - Fall 2019david/cs348/lect-BIGDATA.pdf · Groups together all intermediate values with the same key (i.e., ... Yarn Data (University of Waterloo) Big Data 12/25. Hadoop (cont’ed)

HadoopMost popular MapReduce implementation – developed by Yahoo!Hadoop protocol stack

Hadoop Distributed File System (HDFS)

MapReduce

Hive & HiveQLHbase

3rd party analysis tools: R (stats),Mahoot (machine learning), . . .

Yarn

Data

(University of Waterloo) Big Data 12 / 25

Page 14: - Fall 2019david/cs348/lect-BIGDATA.pdf · Groups together all intermediate values with the same key (i.e., ... Yarn Data (University of Waterloo) Big Data 12/25. Hadoop (cont’ed)

Hadoop (cont’ed)

Two componentsProcessing engineHDFS: Hadoop Distributed Storage System – others possibleCan be deployed on the same machine or on different machines

ProcessesJob tracker: hosted on the master node and implements the scheduleTask tracker: hosted on the worker nodes and accepts tasks from jobtracker and executes them

HDFSName node: stores how data are partitioned, monitors the status of datanodes, and data dictionaryData node: Stores and manages data chunks assigned to it

(University of Waterloo) Big Data 13 / 25

Page 15: - Fall 2019david/cs348/lect-BIGDATA.pdf · Groups together all intermediate values with the same key (i.e., ... Yarn Data (University of Waterloo) Big Data 12/25. Hadoop (cont’ed)

Aggregation

Key Value

1 R1

2 R2

map

Aid Value

1 R1

2 R2

Mapper 1

R

Extracting aggregationattribute (Aid)

Key Value

3 R3

4 R4

map

Aid Value

1 R3

2 R4

Mapper 2

R

Map Phase

Partition

ingby

Aid

(Rou

ndRob

in)

Aid Value

1R1

R3 reduce Result

1, f(R1, R3)

Reducer 1

Groupingby Aid

Applying the aggrega-tion function for the

tuples with the same Aid

Aid Value

2R2

R4 reduce Result

2, f(R2, R4)

Reducer 2

Reduce Phase

(University of Waterloo) Big Data 14 / 25

Page 16: - Fall 2019david/cs348/lect-BIGDATA.pdf · Groups together all intermediate values with the same key (i.e., ... Yarn Data (University of Waterloo) Big Data 12/25. Hadoop (cont’ed)

θ-Join

Baseline implementation of R(A,B) 1 S(B,C)

1 Partition R and assign each partition to mappers2 Each mapper takes 〈a,b〉 tuples and converts them to a list of

key/value pairs of the form (b, 〈a,R〉)3 Each reducer pulls the pairs with the same key4 Each reducer joins tuples of R with tuples of S

(University of Waterloo) Big Data 15 / 25

Page 17: - Fall 2019david/cs348/lect-BIGDATA.pdf · Groups together all intermediate values with the same key (i.e., ... Yarn Data (University of Waterloo) Big Data 12/25. Hadoop (cont’ed)

Spark System

MapReduce does not perform well in iterative computationsWorkflow model is acyclicHave to write to HDFS after each iteration and have to read fromHDFS at the beginning of next iteration

Spark objectivesBetter support for iterative programsProvide a complete ecosystemSimilar abstraction (to MapReduce) for programmingMaintain MapReduce fault-tolerance and scalability

Fundamental conceptsRDD: Reliable Distributed DatasetsCaching of working setMaintaining lineage for fault-tolerance

(University of Waterloo) Big Data 16 / 25

Page 18: - Fall 2019david/cs348/lect-BIGDATA.pdf · Groups together all intermediate values with the same key (i.e., ... Yarn Data (University of Waterloo) Big Data 12/25. Hadoop (cont’ed)

Spark System

MapReduce does not perform well in iterative computationsWorkflow model is acyclicHave to write to HDFS after each iteration and have to read fromHDFS at the beginning of next iteration

Spark objectivesBetter support for iterative programsProvide a complete ecosystemSimilar abstraction (to MapReduce) for programmingMaintain MapReduce fault-tolerance and scalability

Fundamental conceptsRDD: Reliable Distributed DatasetsCaching of working setMaintaining lineage for fault-tolerance

(University of Waterloo) Big Data 16 / 25

Page 19: - Fall 2019david/cs348/lect-BIGDATA.pdf · Groups together all intermediate values with the same key (i.e., ... Yarn Data (University of Waterloo) Big Data 12/25. Hadoop (cont’ed)

Spark System

MapReduce does not perform well in iterative computationsWorkflow model is acyclicHave to write to HDFS after each iteration and have to read fromHDFS at the beginning of next iteration

Spark objectivesBetter support for iterative programsProvide a complete ecosystemSimilar abstraction (to MapReduce) for programmingMaintain MapReduce fault-tolerance and scalability

Fundamental conceptsRDD: Reliable Distributed DatasetsCaching of working setMaintaining lineage for fault-tolerance

(University of Waterloo) Big Data 16 / 25

Page 20: - Fall 2019david/cs348/lect-BIGDATA.pdf · Groups together all intermediate values with the same key (i.e., ... Yarn Data (University of Waterloo) Big Data 12/25. Hadoop (cont’ed)

Example – Log MiningLoad log messages from a file system, create a new file by filtering theerror messages, read this file into memory, then interactively search forvarious patterns

Driver

WorkerWorkerWorker

Block 1 Block 2 Block 3

TasksResults

Cache Cache Cache

lines = spark.textFile(hdfs://...)

CreateRDD

errors = lines.filter(_.startsWith(“ERROR”))

Transform RDD

messages = errors.map(_.split(‘\t ’)(2))

Another transform

cachedMsgs = messages.cache()

Cache results

cachedMsgs.filter(_.contains(“foo”)).count

Action

cachedMsgs.filter(_.contains(“bar”)).count

Another Actionaccesses cache

(University of Waterloo) Big Data 17 / 25

Page 21: - Fall 2019david/cs348/lect-BIGDATA.pdf · Groups together all intermediate values with the same key (i.e., ... Yarn Data (University of Waterloo) Big Data 12/25. Hadoop (cont’ed)

Example – Log MiningLoad log messages from a file system, create a new file by filtering theerror messages, read this file into memory, then interactively search forvarious patterns

Driver

WorkerWorkerWorker

Block 1 Block 2 Block 3

TasksResults

Cache Cache Cache

lines = spark.textFile(hdfs://...)

CreateRDD

errors = lines.filter(_.startsWith(“ERROR”))

Transform RDD

messages = errors.map(_.split(‘\t ’)(2))

Another transform

cachedMsgs = messages.cache()

Cache results

cachedMsgs.filter(_.contains(“foo”)).count

Action

cachedMsgs.filter(_.contains(“bar”)).count

Another Actionaccesses cache

(University of Waterloo) Big Data 17 / 25

Page 22: - Fall 2019david/cs348/lect-BIGDATA.pdf · Groups together all intermediate values with the same key (i.e., ... Yarn Data (University of Waterloo) Big Data 12/25. Hadoop (cont’ed)

Example – Log MiningLoad log messages from a file system, create a new file by filtering theerror messages, read this file into memory, then interactively search forvarious patterns

Driver

WorkerWorkerWorker

Block 1 Block 2 Block 3

TasksResults

Cache Cache Cache

lines = spark.textFile(hdfs://...)

CreateRDD

errors = lines.filter(_.startsWith(“ERROR”))

Transform RDD

messages = errors.map(_.split(‘\t ’)(2))

Another transform

cachedMsgs = messages.cache()

Cache results

cachedMsgs.filter(_.contains(“foo”)).count

Action

cachedMsgs.filter(_.contains(“bar”)).count

Another Actionaccesses cache

(University of Waterloo) Big Data 17 / 25

Page 23: - Fall 2019david/cs348/lect-BIGDATA.pdf · Groups together all intermediate values with the same key (i.e., ... Yarn Data (University of Waterloo) Big Data 12/25. Hadoop (cont’ed)

Example – Log MiningLoad log messages from a file system, create a new file by filtering theerror messages, read this file into memory, then interactively search forvarious patterns

Driver

WorkerWorkerWorker

Block 1 Block 2 Block 3

TasksResults

Cache Cache Cache

lines = spark.textFile(hdfs://...)

CreateRDD

errors = lines.filter(_.startsWith(“ERROR”))

Transform RDD

messages = errors.map(_.split(‘\t ’)(2))

Another transform

cachedMsgs = messages.cache()

Cache results

cachedMsgs.filter(_.contains(“foo”)).count

Action

cachedMsgs.filter(_.contains(“bar”)).count

Another Actionaccesses cache

(University of Waterloo) Big Data 17 / 25

Page 24: - Fall 2019david/cs348/lect-BIGDATA.pdf · Groups together all intermediate values with the same key (i.e., ... Yarn Data (University of Waterloo) Big Data 12/25. Hadoop (cont’ed)

Example – Log MiningLoad log messages from a file system, create a new file by filtering theerror messages, read this file into memory, then interactively search forvarious patterns

Driver

WorkerWorkerWorker

Block 1 Block 2 Block 3

TasksResults

Cache Cache Cache

lines = spark.textFile(hdfs://...)

CreateRDD

errors = lines.filter(_.startsWith(“ERROR”))

Transform RDD

messages = errors.map(_.split(‘\t ’)(2))

Another transform

cachedMsgs = messages.cache()

Cache results

cachedMsgs.filter(_.contains(“foo”)).count

Action

cachedMsgs.filter(_.contains(“bar”)).count

Another Actionaccesses cache

(University of Waterloo) Big Data 17 / 25

Page 25: - Fall 2019david/cs348/lect-BIGDATA.pdf · Groups together all intermediate values with the same key (i.e., ... Yarn Data (University of Waterloo) Big Data 12/25. Hadoop (cont’ed)

Example – Log MiningLoad log messages from a file system, create a new file by filtering theerror messages, read this file into memory, then interactively search forvarious patterns

Driver

WorkerWorkerWorker

Block 1 Block 2 Block 3

TasksResults

Cache Cache Cache

lines = spark.textFile(hdfs://...)

CreateRDD

errors = lines.filter(_.startsWith(“ERROR”))

Transform RDD

messages = errors.map(_.split(‘\t ’)(2))

Another transform

cachedMsgs = messages.cache()

Cache results

cachedMsgs.filter(_.contains(“foo”)).count

Action

cachedMsgs.filter(_.contains(“bar”)).count

Another Actionaccesses cache

(University of Waterloo) Big Data 17 / 25

Page 26: - Fall 2019david/cs348/lect-BIGDATA.pdf · Groups together all intermediate values with the same key (i.e., ... Yarn Data (University of Waterloo) Big Data 12/25. Hadoop (cont’ed)

Example – Log MiningLoad log messages from a file system, create a new file by filtering theerror messages, read this file into memory, then interactively search forvarious patterns

Driver

WorkerWorkerWorker

Block 1 Block 2 Block 3

Tasks

Results

Cache Cache Cache

lines = spark.textFile(hdfs://...)

CreateRDD

errors = lines.filter(_.startsWith(“ERROR”))

Transform RDD

messages = errors.map(_.split(‘\t ’)(2))

Another transform

cachedMsgs = messages.cache()

Cache results

cachedMsgs.filter(_.contains(“foo”)).count

Action

cachedMsgs.filter(_.contains(“bar”)).count

Another Actionaccesses cache

(University of Waterloo) Big Data 17 / 25

Page 27: - Fall 2019david/cs348/lect-BIGDATA.pdf · Groups together all intermediate values with the same key (i.e., ... Yarn Data (University of Waterloo) Big Data 12/25. Hadoop (cont’ed)

Example – Log MiningLoad log messages from a file system, create a new file by filtering theerror messages, read this file into memory, then interactively search forvarious patterns

Driver

WorkerWorkerWorker

Block 1 Block 2 Block 3

TasksResults

Cache Cache Cache

lines = spark.textFile(hdfs://...)

CreateRDD

errors = lines.filter(_.startsWith(“ERROR”))

Transform RDD

messages = errors.map(_.split(‘\t ’)(2))

Another transform

cachedMsgs = messages.cache()

Cache results

cachedMsgs.filter(_.contains(“foo”)).count

Action

cachedMsgs.filter(_.contains(“bar”)).count

Another Actionaccesses cache

(University of Waterloo) Big Data 17 / 25

Page 28: - Fall 2019david/cs348/lect-BIGDATA.pdf · Groups together all intermediate values with the same key (i.e., ... Yarn Data (University of Waterloo) Big Data 12/25. Hadoop (cont’ed)

Example – Log MiningLoad log messages from a file system, create a new file by filtering theerror messages, read this file into memory, then interactively search forvarious patterns

Driver

WorkerWorkerWorker

Block 1 Block 2 Block 3

TasksResults

Cache Cache Cache

lines = spark.textFile(hdfs://...)

CreateRDD

errors = lines.filter(_.startsWith(“ERROR”))

Transform RDD

messages = errors.map(_.split(‘\t ’)(2))

Another transform

cachedMsgs = messages.cache()

Cache results

cachedMsgs.filter(_.contains(“foo”)).count

Action

cachedMsgs.filter(_.contains(“bar”)).count

Another Actionaccesses cache

(University of Waterloo) Big Data 17 / 25

Page 29: - Fall 2019david/cs348/lect-BIGDATA.pdf · Groups together all intermediate values with the same key (i.e., ... Yarn Data (University of Waterloo) Big Data 12/25. Hadoop (cont’ed)

Example – Log MiningLoad log messages from a file system, create a new file by filtering theerror messages, read this file into memory, then interactively search forvarious patterns

Driver

WorkerWorkerWorker

Block 1 Block 2 Block 3

TasksResults

Cache Cache Cache

lines = spark.textFile(hdfs://...)

CreateRDD

errors = lines.filter(_.startsWith(“ERROR”))

Transform RDD

messages = errors.map(_.split(‘\t ’)(2))

Another transform

cachedMsgs = messages.cache()

Cache results

cachedMsgs.filter(_.contains(“foo”)).count

Action

cachedMsgs.filter(_.contains(“bar”)).count

Another Actionaccesses cache

(University of Waterloo) Big Data 17 / 25

Page 30: - Fall 2019david/cs348/lect-BIGDATA.pdf · Groups together all intermediate values with the same key (i.e., ... Yarn Data (University of Waterloo) Big Data 12/25. Hadoop (cont’ed)

RDD and Processing

HDFS

lines = spark.textFile(“hdfs://...”)

linesError, msg1

Warn, msg2

Error, msg1

Info, msg8

Warn, msg2

Info, msg8

Error, msg3

Info, msg5

Info, msg5

Error, msg4

Warn, msg9

Error, msg1

errors

errors = lines.filter(_.startsWith(“ERROR”))

Error, msg1

Error, msg1

Error, msg3 Error, msg4

Error, msg1

messages

messages = errors.map_.split(‘\t ’)(2)

msg1

msg1

msg3 msg4

msg1

Thes

ear

eno

tyet

gene

rate

d

(University of Waterloo) Big Data 18 / 25

Page 31: - Fall 2019david/cs348/lect-BIGDATA.pdf · Groups together all intermediate values with the same key (i.e., ... Yarn Data (University of Waterloo) Big Data 12/25. Hadoop (cont’ed)

RDD and Processing

lineserrors

messagesmsg1

msg1

msg3 msg4

msg1

lines

messages.filter(_.contains(“foo”)).count

errors

messagesmsg1

msg1

msg3 msg4

msg1

Now

the

RD

Ds

are

mat

eria

lized

;

Com

man

dno

tye

tex

ecut

ed

Driver

messages.filter(_.contains(“foo”).count

(University of Waterloo) Big Data 18 / 25

Page 32: - Fall 2019david/cs348/lect-BIGDATA.pdf · Groups together all intermediate values with the same key (i.e., ... Yarn Data (University of Waterloo) Big Data 12/25. Hadoop (cont’ed)

Example Transform Functions

Function Meaningmap(func) Return a new RDD formed by passing each element of the

source through a function funcfilter(func) Return a new RDD formed by selecting those elements of

the source on which func returns trueflatMap(func) Similar to map, but each input item can be mapped to 0 or

more output items (func should return a Seq not a singleitem)

mapPartitions(func) Similar to map, but runs separately on each partition(block) of the RDD, so func must be of type Iterator

sample(repl, fraction, seed) Sample a fraction fraction of the data, with or without re-placement (set repl accordingly), using a given randomnumber generator seed

union(otherDataset) intersection() Return a new RDD that contains the union/intersection ofthe elements in the source RDD and the argument

groupByKey() Operates on a RDD of (K, V) pairs, returns a RDD of (K,Iterable<V>) pairs

reduceByKey(func, . . .) Operates on a RDD of (K, V) pairs, returns a RDD of (K, V)pairs where the values for each key are aggregated usingthe given reduce function func

(University of Waterloo) Big Data 19 / 25

Page 33: - Fall 2019david/cs348/lect-BIGDATA.pdf · Groups together all intermediate values with the same key (i.e., ... Yarn Data (University of Waterloo) Big Data 12/25. Hadoop (cont’ed)

Example Actions

Action Meaningcollect() Return all the elements of the dataset as an array at the

driver program.count()take(n) Return an array with the first n elements of the datasetfirst(func) Return the first element of the dataset (similar to take(1))takeSample(repl, num, [seed ]) Return an array with a random sample of num elements

of the dataset, with or without replacement, optionally pre-specifying a random number generator seed

takeOrdered(n, [ordering]) Return the first n elements of the RDD using either theirnatural order or a custom comparator

saveAsTextFile(path) Write the elements of the RDD as a text file (or set of textfiles) in a given directory in the local filesystem, HDFS orany other Hadoop-supported file system

countByKey() Only available on RDDs of type (K, V). Returns a hashmapof (K, Int) pairs with the count of each key

foreach(func) Run a function func on each element of the dataset

(University of Waterloo) Big Data 20 / 25

Page 34: - Fall 2019david/cs348/lect-BIGDATA.pdf · Groups together all intermediate values with the same key (i.e., ... Yarn Data (University of Waterloo) Big Data 12/25. Hadoop (cont’ed)

RDD Fault Tolerance

RDDs maintain lineage information that can be used toreconstruct lost partitionsLineage is constructed as an object and stored persistently forrecoveryExample

messages = spark.textFile(“hdfs://...”).lines.filter(_.startsWith(“ERROR”))

.errors.map(_.split(‘\t ’)(2))

HDFS File Filtered RDDfilter

(func=_.contains(...))

Mapped RDDmap

(func=_.split(...))

(University of Waterloo) Big Data 21 / 25

Page 35: - Fall 2019david/cs348/lect-BIGDATA.pdf · Groups together all intermediate values with the same key (i.e., ... Yarn Data (University of Waterloo) Big Data 12/25. Hadoop (cont’ed)

Iteration (transitive closure)var tcr = sc.parallelize(tuples,5)var deltar = tcr.localCheckpoint()var deltal = tcr.map(x=>(x._2,x._1)).

localCheckpoint()

do { deltar = deltal.join(tcr).map(x=>(x._2._1,x._2._2)).subtract(tcr).coalesce(5,true).distinct().localCheckpoint();

deltal = deltar.map(x=>(x._2,x._1)).localCheckpoint();

tcr = tcr.union(deltar)} while(deltal.count()>0)

tcr.foreach { t => println(t) }(University of Waterloo) Big Data 22 / 25

Page 36: - Fall 2019david/cs348/lect-BIGDATA.pdf · Groups together all intermediate values with the same key (i.e., ... Yarn Data (University of Waterloo) Big Data 12/25. Hadoop (cont’ed)

Lineage and StagesLineage example:

2.4.4 (/) Spark shell application UI

Details for Job 9

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

Status: SUCCEEDEDCompleted Stages: 6Skipped Stages: 45

Event Timeline DAG Visualization

Stage 234 (skipped)

parallelize

map

Stage 235 (skipped)

parallelize

Stage 236 (skipped)

join

map

subtract

Stage 237 (skipped)

parallelize

subtract

Stage 238 (skipped)

subtract

distinct

Stage 239 (skipped)

map

checkpoint

Stage 240 (skipped)

parallelize

distinct

union

Stage 241 (skipped)

join

map

subtract

Stage 242 (skipped)

parallelize

distinct

union

subtract

Stage 243 (skipped)

subtract

distinct

Stage 244 (skipped)

parallelize

distinct

union

distinct

union

subtract

Stage 245 (skipped)

map

checkpoint

Stage 246 (skipped)

parallelize

distinct

union

distinct

union

Stage 247 (skipped)

join

map

subtract

Stage 248 (skipped)

subtract

distinct

Stage 249 (skipped)

parallelize

distinct

union

distinct

union

distinct

union

subtract

Stage 250 (skipped)

parallelize

distinct

union

distinct

union

distinct

union

Stage 251 (skipped)

map

checkpoint

Stage 252 (skipped)

join

map

subtract

Stage 253 (skipped)

subtract

distinct

Stage 254 (skipped)

map

checkpoint

Stage 255 (skipped)

parallelize

distinct

union

distinct

union

distinct

union

distinct

union

Stage 256 (skipped)

join

map

subtract

Stage 257 (skipped)

parallelize

distinct

union

distinct

union

distinct

union

distinct

union

subtract

Stage 258 (skipped)

subtract

distinct

Stage 259 (skipped)

map

checkpoint

Stage 260 (skipped)

parallelize

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

Stage 261 (skipped)

join

map

subtract

Stage 262 (skipped)

parallelize

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

subtract

Stage 263 (skipped)

subtract

distinct

Stage 264 (skipped)

parallelize

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

Stage 265 (skipped)

map

checkpoint

Stage 266 (skipped)

join

map

subtract

Stage 267 (skipped)

parallelize

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

subtract

Stage 268 (skipped)

subtract

distinct

Stage 269 (skipped)

map

checkpoint

Stage 270 (skipped)

parallelize

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

Stage 271 (skipped)

parallelize

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

subtract

Stage 272 (skipped)

join

map

subtract

Stage 273 (skipped)

subtract

distinct

Stage 274 (skipped)

parallelize

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

Stage 275 (skipped)

map

checkpoint

Stage 276 (skipped)

join

map

subtract

Stage 277 (skipped)

parallelize

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

subtract

Stage 278 (skipped)

subtract

distinct

Stage 279

parallelize

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

Stage 280

map

checkpoint

Stage 281

join

map

subtract

Stage 282

parallelize

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

subtract

Stage 283

subtract

distinct

Stage 284

distinct

map

Completed Stages (6)

Stage Id  ▾ (/jobs/job/?id=9&completedStage.sort=Stage+Id&completedStage.desc=false&completedStage.pageSize=100#completed) Description (/jobs/job/?id=9&completedStage.sort=Description&completedStage.pageSize=100#completed) Submitted (/jobs/job/?id=9&completedStage.sort=Submitted&completedStage.pageSize=100#completed) Duration (/jobs/job/?id=9&completedStage.sort=Duration&completedStage.pageSize=100#completed) Tasks: Succeeded/Total Input (/jobs/job/?id=9&completedStage.sort=Input&completedStage.pageSize=100#completed) Output (/jobs/job/?id=9&completedStage.sort=Output&completedStage.pageSize=100#completed) Shuffle Read (/jobs/job/?id=9&completedStage.sort=Shuffle+Read&completedStage.pageSize=100#completed) Shuffle Write (/jobs/job/?id=9&completedStage.sort=Shuffle+Write&completedStage.pageSize=100#completed)

284 count at <console>:30 (/stages/stage/?id=284&attempt=0) 2019/11/01 17:11:17 0.6 s

283 distinct at <console>:30 (/stages/stage/?id=283&attempt=0) 2019/11/01 17:11:09 8 s 112.7 MB

282 subtract at <console>:30 (/stages/stage/?id=282&attempt=0) 2019/11/01 17:11:00 7 s 4.3 MB 3.8 MB

281 subtract at <console>:30 (/stages/stage/?id=281&attempt=0) 2019/11/01 17:11:02 6 s 17.8 MB 108.9 MB

280 map at <console>:30 (/stages/stage/?id=280&attempt=0) 2019/11/01 17:11:00 0.7 s 1073.4 KB 3.5 MB

279 union at <console>:30 (/stages/stage/?id=279&attempt=0) 2019/11/01 17:11:00 2 s 4.3 MB 14.3 MB

Skipped Stages (45)

Stage Id  ▾ (/jobs/job/?id=9&pendingStage.sort=Stage+Id&pendingStage.desc=false&pendingStage.pageSize=100#pending) Description (/jobs/job/?id=9&pendingStage.sort=Description&pendingStage.pageSize=100#pending) Submitted (/jobs/job/?id=9&pendingStage.sort=Submitted&pendingStage.pageSize=100#pending) Duration (/jobs/job/?id=9&pendingStage.sort=Duration&pendingStage.pageSize=100#pending) Tasks: Succeeded/Total Input (/jobs/job/?id=9&pendingStage.sort=Input&pendingStage.pageSize=100#pending) Output (/jobs/job/?id=9&pendingStage.sort=Output&pendingStage.pageSize=100#pending) Shuffle Read (/jobs/job/?id=9&pendingStage.sort=Shuffle+Read&pendingStage.pageSize=100#pending) Shuffle Write (/jobs/job/?id=9&pendingStage.sort=Shuffle+Write&pendingStage.pageSize=100#pending)

278 distinct at <console>:30 (/stages/stage/?id=278&attempt=0) Unknown Unknown

277 subtract at <console>:30 (/stages/stage/?id=277&attempt=0) Unknown Unknown

276 subtract at <console>:30 (/stages/stage/?id=276&attempt=0) Unknown Unknown

275 map at <console>:30 (/stages/stage/?id=275&attempt=0) Unknown Unknown

274 union at <console>:30 (/stages/stage/?id=274&attempt=0) Unknown Unknown

273 distinct at <console>:30 (/stages/stage/?id=273&attempt=0) Unknown Unknown

272 subtract at <console>:30 (/stages/stage/?id=272&attempt=0) Unknown Unknown

271 subtract at <console>:30 (/stages/stage/?id=271&attempt=0) Unknown Unknown

270 union at <console>:30 (/stages/stage/?id=270&attempt=0) Unknown Unknown

269 map at <console>:30 (/stages/stage/?id=269&attempt=0) Unknown Unknown

268 distinct at <console>:30 (/stages/stage/?id=268&attempt=0) Unknown Unknown

267 subtract at <console>:30 (/stages/stage/?id=267&attempt=0) Unknown Unknown

266 subtract at <console>:30 (/stages/stage/?id=266&attempt=0) Unknown Unknown

265 map at <console>:30 (/stages/stage/?id=265&attempt=0) Unknown Unknown

264 union at <console>:30 (/stages/stage/?id=264&attempt=0) Unknown Unknown

263 distinct at <console>:30 (/stages/stage/?id=263&attempt=0) Unknown Unknown

262 subtract at <console>:30 (/stages/stage/?id=262&attempt=0) Unknown Unknown

261 subtract at <console>:30 (/stages/stage/?id=261&attempt=0) Unknown Unknown

260 union at <console>:30 (/stages/stage/?id=260&attempt=0) Unknown Unknown

259 map at <console>:30 (/stages/stage/?id=259&attempt=0) Unknown Unknown

258 distinct at <console>:30 (/stages/stage/?id=258&attempt=0) Unknown Unknown

257 subtract at <console>:30 (/stages/stage/?id=257&attempt=0) Unknown Unknown

256 subtract at <console>:30 (/stages/stage/?id=256&attempt=0) Unknown Unknown

255 union at <console>:30 (/stages/stage/?id=255&attempt=0) Unknown Unknown

254 map at <console>:30 (/stages/stage/?id=254&attempt=0) Unknown Unknown

253 distinct at <console>:30 (/stages/stage/?id=253&attempt=0) Unknown Unknown

252 subtract at <console>:30 (/stages/stage/?id=252&attempt=0) Unknown Unknown

251 map at <console>:30 (/stages/stage/?id=251&attempt=0) Unknown Unknown

250 union at <console>:30 (/stages/stage/?id=250&attempt=0) Unknown Unknown

249 subtract at <console>:30 (/stages/stage/?id=249&attempt=0) Unknown Unknown

248 distinct at <console>:30 (/stages/stage/?id=248&attempt=0) Unknown Unknown

247 subtract at <console>:30 (/stages/stage/?id=247&attempt=0) Unknown Unknown

246 union at <console>:30 (/stages/stage/?id=246&attempt=0) Unknown Unknown

245 map at <console>:30 (/stages/stage/?id=245&attempt=0) Unknown Unknown

244 subtract at <console>:30 (/stages/stage/?id=244&attempt=0) Unknown Unknown

243 distinct at <console>:30 (/stages/stage/?id=243&attempt=0) Unknown Unknown

242 subtract at <console>:30 (/stages/stage/?id=242&attempt=0) Unknown Unknown

241 subtract at <console>:30 (/stages/stage/?id=241&attempt=0) Unknown Unknown

240 union at <console>:30 (/stages/stage/?id=240&attempt=0) Unknown Unknown

239 map at <console>:30 (/stages/stage/?id=239&attempt=0) Unknown Unknown

238 distinct at <console>:30 (/stages/stage/?id=238&attempt=0) Unknown Unknown

237 subtract at <console>:30 (/stages/stage/?id=237&attempt=0) Unknown Unknown

236 subtract at <console>:30 (/stages/stage/?id=236&attempt=0) Unknown Unknown

235 parallelize at <console>:26 (/stages/stage/?id=235&attempt=0) Unknown Unknown

234 map at <console>:25 (/stages/stage/?id=234&attempt=0) Unknown Unknown

Jobs (/jobs/) Stages (/stages/) Storage (/storage/) Environment (/environment/) Executors (/executors/)

2560/2560

2560/2560

2560/2560

2560/2560

1280/1280

2560/2560

0/1280

0/1280

0/1280

0/640

0/1280

0/640

0/640

0/640

0/640

0/320

0/320

0/320

0/320

0/160

0/320

0/160

0/160

0/160

0/160

0/80

0/80

0/80

0/80

0/80

0/40

0/40

0/40

0/20

0/40

0/40

0/20

0/20

0/20

0/10

0/20

0/10

0/10

0/10

0/10

0/5

0/5

0/5

0/5

0/5

0/5

⇒ replaces durability: system can replay failed stages

Stage example: 2.4.4

(/) Spark shell application UI

Details for Stage 282 (Attempt 0)

driver

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

Total Time Across All Tasks: 14 s

Locality Level Summary: Process local: 2560

Input Size / Records: 4.3 MB / 125250

Shuffle Write: 3.8 MB / 125250

DAG Visualization

Stage 282

parallelize

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

subtract

ParallelCollectionRDD [0] [Cached]

parallelize at <console>:26

ShuffledRDD [11]

distinct at <console>:30

MapPartitionsRDD [12] [Cached]

distinct at <console>:30

UnionRDD [14]

union at <console>:30

ShuffledRDD [25]

distinct at <console>:30

MapPartitionsRDD [26] [Cached]

distinct at <console>:30

UnionRDD [28]

union at <console>:30

ShuffledRDD [39]

distinct at <console>:30

MapPartitionsRDD [40] [Cached]

distinct at <console>:30

UnionRDD [42]

union at <console>:30

ShuffledRDD [53]

distinct at <console>:30

MapPartitionsRDD [54] [Cached]

distinct at <console>:30

UnionRDD [56]

union at <console>:30

ShuffledRDD [67]

distinct at <console>:30

MapPartitionsRDD [68] [Cached]

distinct at <console>:30

UnionRDD [70]

union at <console>:30

ShuffledRDD [81]

distinct at <console>:30

MapPartitionsRDD [82] [Cached]

distinct at <console>:30

UnionRDD [84]

union at <console>:30

ShuffledRDD [95]

distinct at <console>:30

MapPartitionsRDD [96] [Cached]

distinct at <console>:30

UnionRDD [98]

union at <console>:30

ShuffledRDD [109]

distinct at <console>:30

MapPartitionsRDD [110] [Cached]

distinct at <console>:30

UnionRDD [112]

union at <console>:30

ShuffledRDD [123]

distinct at <console>:30

MapPartitionsRDD [124] [Cached]

distinct at <console>:30

UnionRDD [126]

union at <console>:30

MapPartitionsRDD [133]

subtract at <console>:30

Show Additional Metrics

Event Timeline

This page has more than the maximum number of tasks that can be shown in the visualization! Only the most recent 1000 tasks (of 2560 total) are shown. Enable zooming

Scheduler Delay

Task Deserialization Time

Shuffle Read Time

Executor Computing Time

Shuffle Write Time

Result Serialization Time

Getting Result Time

Summary Metrics for 2560 Completed Tasks

Metric Min 25th percentile Median 75th percentile Max

Duration 0 ms 3 ms 5 ms 8 ms 17 ms

Scheduler Delay 0 ms 0 ms 0 ms 0 ms 5 ms

Task Deserialization Time 0 ms 0 ms 0 ms 1 ms 5 ms

GC Time 0 ms 0 ms 0 ms 0 ms 3 ms

Result Serialization Time 0 ms 0 ms 0 ms 0 ms 1 ms

Getting Result Time 0 ms 0 ms 0 ms 0 ms 0 ms

Peak Execution Memory 1400.0 B 2.4 KB 3.5 KB 6.3 KB 9.9 KB

Input Size / Records 344.0 B / 9 848.0 B / 23 1456.0 B / 40 2.5 KB / 72 4.3 KB / 123

Shuffle Write Size / Records 410.0 B / 9 521.0 B / 23 1004.0 B / 40 1772.0 B / 72 17.2 KB / 123

Aggregated Metrics by Executor

Executor ID ▴ Address Task Time Total Tasks Failed Tasks Killed Tasks Succeeded Tasks Input Size / Records Shuffle Write Size / Records Blacklisted

david.uwaterloo.ca:52245 16 s 2560 0 0 2560 4.3 MB / 125250 3.8 MB / 125250 false

Tasks (2560)

Index (/stages/stage/?id=282&attempt=0&task.sort=Index&task.pageSize=10000)

ID (/stages/stage/?id=282&attempt=0&task.sort=ID&task.pageSize=10000)

Attempt (/stages/stage/?id=282&attempt=0&task.sort=Attempt&task.pageSize=10000)

Status (/stages/stage/?id=282&attempt=0&task.sort=Status&task.pageSize=10000)

Locality Level (/stages/stage/?id=282&attempt=0&task.sort=Locality+Level&task.pageSize=10000)

Executor ID (/stages/stage/?id=282&attempt=0&task.sort=Executor+ID&task.pageSize=10000)

Host (/stages/stage/?id=282&attempt=0&task.sort=Host&task.pageSize=10000)

Launch Time (/stages/stage/?id=282&attempt=0&task.sort=Launch+Time&task.pageSize=10000)

Duration  ▾ (/stages/stage/?id=282&attempt=0&task.sort=Duration&task.desc=false&task.pageSize=10000)

Scheduler Delay (/stages/stage/?id=282&attempt=0&task.sort=Scheduler+Delay&task.pageSize=10000)

Task Deserialization Time (/stages/stage/?id=282&attempt=0&task.sort=Task+Deserialization+Time&task.pageSize=10000)

GC Time (/stages/stage/?id=282&attempt=0&task.sort=GC+Time&task.pageSize=10000)

Result Serialization Time (/stages/stage/?id=282&attempt=0&task.sort=Result+Serialization+Time&task.pageSize=10000)

Getting Result Time (/stages/stage/?id=282&attempt=0&task.sort=Getting+Result+Time&tas

36 20491 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 17 ms 1 ms 0 ms 3 ms 0 ms 0 ms

4 17899 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:02 17 ms 0 ms 1 ms 3 ms 0 ms 0 ms

9 17904 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:02 16 ms 1 ms 1 ms 3 ms 0 ms 0 ms

8 17903 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:02 16 ms 0 ms 1 ms 3 ms 0 ms 0 ms

37 20492 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 15 ms 0 ms 0 ms 3 ms 0 ms 0 ms

2530 22985 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 14 ms 1 ms 0 ms 0 ms 0 ms

120 20575 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 14 ms 0 ms 1 ms 3 ms 0 ms 0 ms

31 20486 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 14 ms 1 ms 0 ms 3 ms 0 ms 0 ms

30 20485 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 14 ms 1 ms 1 ms 3 ms 0 ms 0 ms

16 17911 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:02 14 ms 1 ms 0 ms 3 ms 0 ms 0 ms

14 17909 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:02 14 ms 0 ms 1 ms 3 ms 0 ms 0 ms

2540 22995 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 13 ms 1 ms 0 ms 0 ms 0 ms

2539 22994 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 13 ms 1 ms 0 ms 0 ms 0 ms

2531 22986 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 13 ms 0 ms 1 ms 0 ms 0 ms

2521 22976 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 13 ms 0 ms 0 ms 0 ms 0 ms

275 20730 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 13 ms 0 ms 0 ms 2 ms 0 ms 0 ms

266 20721 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 13 ms 2 ms 0 ms 2 ms 0 ms 0 ms

121 20576 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 13 ms 0 ms 0 ms 3 ms 0 ms 0 ms

52 20507 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 13 ms 0 ms 0 ms 0 ms 0 ms

51 20506 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 13 ms 1 ms 1 ms 0 ms 0 ms

38 20493 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 13 ms 0 ms 1 ms 3 ms 0 ms 0 ms

17 17912 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:02 13 ms 0 ms 1 ms 3 ms 0 ms 0 ms

10 17905 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:02 13 ms 1 ms 1 ms 3 ms 0 ms 0 ms

2420 22875 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 12 ms 0 ms 0 ms 0 ms 0 ms

2146 22601 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 12 ms 0 ms 0 ms 3 ms 0 ms 0 ms

1961 22416 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 12 ms 0 ms 0 ms 0 ms 0 ms

1644 22099 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 12 ms 0 ms 1 ms 3 ms 0 ms 0 ms

1635 22090 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 12 ms 0 ms 0 ms 3 ms 0 ms 0 ms

1178 21633 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 12 ms 0 ms 0 ms 3 ms 0 ms 0 ms

815 21270 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 12 ms 0 ms 1 ms 2 ms 0 ms 0 ms

805 21260 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 12 ms 0 ms 1 ms 2 ms 0 ms 0 ms

504 20959 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 0 ms 1 ms 3 ms 0 ms 0 ms

496 20951 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 0 ms 0 ms 3 ms 0 ms 0 ms

495 20950 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 2 ms 0 ms 3 ms 0 ms 0 ms

421 20876 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 0 ms 0 ms 0 ms 0 ms

412 20867 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 0 ms 1 ms 0 ms 0 ms

267 20722 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 0 ms 1 ms 2 ms 0 ms 0 ms

236 20691 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 0 ms 0 ms 0 ms 0 ms

211 20666 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 0 ms 0 ms 0 ms 0 ms

154 20609 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 0 ms 0 ms 0 ms 0 ms

128 20583 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 0 ms 0 ms 3 ms 0 ms 0 ms

127 20582 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 0 ms 1 ms 3 ms 0 ms 0 ms

122 20577 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 1 ms 0 ms 3 ms 0 ms 0 ms

59 20514 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 2 ms 0 ms 0 ms 0 ms

44 20499 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 0 ms 0 ms 0 ms 0 ms

43 20498 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 1 ms 1 ms 0 ms 0 ms

39 20494 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 0 ms 0 ms 3 ms 0 ms 0 ms

15 17910 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:02 12 ms 0 ms 1 ms 3 ms 0 ms 0 ms

12 17907 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:02 12 ms 0 ms 1 ms 3 ms 0 ms 0 ms

2532 22987 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 11 ms 0 ms 1 ms 0 ms 0 ms

2485 22940 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 11 ms 0 ms 1 ms 0 ms 0 ms

2484 22939 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 11 ms 1 ms 1 ms 0 ms 0 ms

2428 22883 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 11 ms 1 ms 0 ms 0 ms 0 ms

2157 22612 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 11 ms 0 ms 0 ms 3 ms 0 ms 0 ms

2156 22611 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 11 ms 0 ms 1 ms 3 ms 0 ms 0 ms

2147 22602 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 11 ms 0 ms 1 ms 3 ms 0 ms 0 ms

1989 22444 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 11 ms 0 ms 0 ms 0 ms 0 ms

1980 22435 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 11 ms 0 ms 1 ms 0 ms 0 ms

1970 22425 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 11 ms 0 ms 0 ms 0 ms 0 ms

Jobs (/jobs/) Stages (/stages/) Storage (/storage/) Environment (/environment/) Executors (/executors/)

driver / localhost

370 380 390 400 410 420

17:11:09

430 440 450 460 470 480 490 500 510 520 530 540 550 560 570 580 590 600 610 620 630 640 650 660 670 680 690 700 710 720 730 740 750 760 770 780 790 800 810 820 830 840 850

(University of Waterloo) Big Data 23 / 25

Page 37: - Fall 2019david/cs348/lect-BIGDATA.pdf · Groups together all intermediate values with the same key (i.e., ... Yarn Data (University of Waterloo) Big Data 12/25. Hadoop (cont’ed)

Lineage and StagesLineage example:

2.4.4 (/) Spark shell application UI

Details for Job 9

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

+details

Status: SUCCEEDEDCompleted Stages: 6Skipped Stages: 45

Event Timeline DAG Visualization

Stage 234 (skipped)

parallelize

map

Stage 235 (skipped)

parallelize

Stage 236 (skipped)

join

map

subtract

Stage 237 (skipped)

parallelize

subtract

Stage 238 (skipped)

subtract

distinct

Stage 239 (skipped)

map

checkpoint

Stage 240 (skipped)

parallelize

distinct

union

Stage 241 (skipped)

join

map

subtract

Stage 242 (skipped)

parallelize

distinct

union

subtract

Stage 243 (skipped)

subtract

distinct

Stage 244 (skipped)

parallelize

distinct

union

distinct

union

subtract

Stage 245 (skipped)

map

checkpoint

Stage 246 (skipped)

parallelize

distinct

union

distinct

union

Stage 247 (skipped)

join

map

subtract

Stage 248 (skipped)

subtract

distinct

Stage 249 (skipped)

parallelize

distinct

union

distinct

union

distinct

union

subtract

Stage 250 (skipped)

parallelize

distinct

union

distinct

union

distinct

union

Stage 251 (skipped)

map

checkpoint

Stage 252 (skipped)

join

map

subtract

Stage 253 (skipped)

subtract

distinct

Stage 254 (skipped)

map

checkpoint

Stage 255 (skipped)

parallelize

distinct

union

distinct

union

distinct

union

distinct

union

Stage 256 (skipped)

join

map

subtract

Stage 257 (skipped)

parallelize

distinct

union

distinct

union

distinct

union

distinct

union

subtract

Stage 258 (skipped)

subtract

distinct

Stage 259 (skipped)

map

checkpoint

Stage 260 (skipped)

parallelize

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

Stage 261 (skipped)

join

map

subtract

Stage 262 (skipped)

parallelize

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

subtract

Stage 263 (skipped)

subtract

distinct

Stage 264 (skipped)

parallelize

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

Stage 265 (skipped)

map

checkpoint

Stage 266 (skipped)

join

map

subtract

Stage 267 (skipped)

parallelize

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

subtract

Stage 268 (skipped)

subtract

distinct

Stage 269 (skipped)

map

checkpoint

Stage 270 (skipped)

parallelize

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

Stage 271 (skipped)

parallelize

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

subtract

Stage 272 (skipped)

join

map

subtract

Stage 273 (skipped)

subtract

distinct

Stage 274 (skipped)

parallelize

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

Stage 275 (skipped)

map

checkpoint

Stage 276 (skipped)

join

map

subtract

Stage 277 (skipped)

parallelize

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

subtract

Stage 278 (skipped)

subtract

distinct

Stage 279

parallelize

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

Stage 280

map

checkpoint

Stage 281

join

map

subtract

Stage 282

parallelize

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

subtract

Stage 283

subtract

distinct

Stage 284

distinct

map

Completed Stages (6)

Stage Id  ▾ (/jobs/job/?id=9&completedStage.sort=Stage+Id&completedStage.desc=false&completedStage.pageSize=100#completed) Description (/jobs/job/?id=9&completedStage.sort=Description&completedStage.pageSize=100#completed) Submitted (/jobs/job/?id=9&completedStage.sort=Submitted&completedStage.pageSize=100#completed) Duration (/jobs/job/?id=9&completedStage.sort=Duration&completedStage.pageSize=100#completed) Tasks: Succeeded/Total Input (/jobs/job/?id=9&completedStage.sort=Input&completedStage.pageSize=100#completed) Output (/jobs/job/?id=9&completedStage.sort=Output&completedStage.pageSize=100#completed) Shuffle Read (/jobs/job/?id=9&completedStage.sort=Shuffle+Read&completedStage.pageSize=100#completed) Shuffle Write (/jobs/job/?id=9&completedStage.sort=Shuffle+Write&completedStage.pageSize=100#completed)

284 count at <console>:30 (/stages/stage/?id=284&attempt=0) 2019/11/01 17:11:17 0.6 s

283 distinct at <console>:30 (/stages/stage/?id=283&attempt=0) 2019/11/01 17:11:09 8 s 112.7 MB

282 subtract at <console>:30 (/stages/stage/?id=282&attempt=0) 2019/11/01 17:11:00 7 s 4.3 MB 3.8 MB

281 subtract at <console>:30 (/stages/stage/?id=281&attempt=0) 2019/11/01 17:11:02 6 s 17.8 MB 108.9 MB

280 map at <console>:30 (/stages/stage/?id=280&attempt=0) 2019/11/01 17:11:00 0.7 s 1073.4 KB 3.5 MB

279 union at <console>:30 (/stages/stage/?id=279&attempt=0) 2019/11/01 17:11:00 2 s 4.3 MB 14.3 MB

Skipped Stages (45)

Stage Id  ▾ (/jobs/job/?id=9&pendingStage.sort=Stage+Id&pendingStage.desc=false&pendingStage.pageSize=100#pending) Description (/jobs/job/?id=9&pendingStage.sort=Description&pendingStage.pageSize=100#pending) Submitted (/jobs/job/?id=9&pendingStage.sort=Submitted&pendingStage.pageSize=100#pending) Duration (/jobs/job/?id=9&pendingStage.sort=Duration&pendingStage.pageSize=100#pending) Tasks: Succeeded/Total Input (/jobs/job/?id=9&pendingStage.sort=Input&pendingStage.pageSize=100#pending) Output (/jobs/job/?id=9&pendingStage.sort=Output&pendingStage.pageSize=100#pending) Shuffle Read (/jobs/job/?id=9&pendingStage.sort=Shuffle+Read&pendingStage.pageSize=100#pending) Shuffle Write (/jobs/job/?id=9&pendingStage.sort=Shuffle+Write&pendingStage.pageSize=100#pending)

278 distinct at <console>:30 (/stages/stage/?id=278&attempt=0) Unknown Unknown

277 subtract at <console>:30 (/stages/stage/?id=277&attempt=0) Unknown Unknown

276 subtract at <console>:30 (/stages/stage/?id=276&attempt=0) Unknown Unknown

275 map at <console>:30 (/stages/stage/?id=275&attempt=0) Unknown Unknown

274 union at <console>:30 (/stages/stage/?id=274&attempt=0) Unknown Unknown

273 distinct at <console>:30 (/stages/stage/?id=273&attempt=0) Unknown Unknown

272 subtract at <console>:30 (/stages/stage/?id=272&attempt=0) Unknown Unknown

271 subtract at <console>:30 (/stages/stage/?id=271&attempt=0) Unknown Unknown

270 union at <console>:30 (/stages/stage/?id=270&attempt=0) Unknown Unknown

269 map at <console>:30 (/stages/stage/?id=269&attempt=0) Unknown Unknown

268 distinct at <console>:30 (/stages/stage/?id=268&attempt=0) Unknown Unknown

267 subtract at <console>:30 (/stages/stage/?id=267&attempt=0) Unknown Unknown

266 subtract at <console>:30 (/stages/stage/?id=266&attempt=0) Unknown Unknown

265 map at <console>:30 (/stages/stage/?id=265&attempt=0) Unknown Unknown

264 union at <console>:30 (/stages/stage/?id=264&attempt=0) Unknown Unknown

263 distinct at <console>:30 (/stages/stage/?id=263&attempt=0) Unknown Unknown

262 subtract at <console>:30 (/stages/stage/?id=262&attempt=0) Unknown Unknown

261 subtract at <console>:30 (/stages/stage/?id=261&attempt=0) Unknown Unknown

260 union at <console>:30 (/stages/stage/?id=260&attempt=0) Unknown Unknown

259 map at <console>:30 (/stages/stage/?id=259&attempt=0) Unknown Unknown

258 distinct at <console>:30 (/stages/stage/?id=258&attempt=0) Unknown Unknown

257 subtract at <console>:30 (/stages/stage/?id=257&attempt=0) Unknown Unknown

256 subtract at <console>:30 (/stages/stage/?id=256&attempt=0) Unknown Unknown

255 union at <console>:30 (/stages/stage/?id=255&attempt=0) Unknown Unknown

254 map at <console>:30 (/stages/stage/?id=254&attempt=0) Unknown Unknown

253 distinct at <console>:30 (/stages/stage/?id=253&attempt=0) Unknown Unknown

252 subtract at <console>:30 (/stages/stage/?id=252&attempt=0) Unknown Unknown

251 map at <console>:30 (/stages/stage/?id=251&attempt=0) Unknown Unknown

250 union at <console>:30 (/stages/stage/?id=250&attempt=0) Unknown Unknown

249 subtract at <console>:30 (/stages/stage/?id=249&attempt=0) Unknown Unknown

248 distinct at <console>:30 (/stages/stage/?id=248&attempt=0) Unknown Unknown

247 subtract at <console>:30 (/stages/stage/?id=247&attempt=0) Unknown Unknown

246 union at <console>:30 (/stages/stage/?id=246&attempt=0) Unknown Unknown

245 map at <console>:30 (/stages/stage/?id=245&attempt=0) Unknown Unknown

244 subtract at <console>:30 (/stages/stage/?id=244&attempt=0) Unknown Unknown

243 distinct at <console>:30 (/stages/stage/?id=243&attempt=0) Unknown Unknown

242 subtract at <console>:30 (/stages/stage/?id=242&attempt=0) Unknown Unknown

241 subtract at <console>:30 (/stages/stage/?id=241&attempt=0) Unknown Unknown

240 union at <console>:30 (/stages/stage/?id=240&attempt=0) Unknown Unknown

239 map at <console>:30 (/stages/stage/?id=239&attempt=0) Unknown Unknown

238 distinct at <console>:30 (/stages/stage/?id=238&attempt=0) Unknown Unknown

237 subtract at <console>:30 (/stages/stage/?id=237&attempt=0) Unknown Unknown

236 subtract at <console>:30 (/stages/stage/?id=236&attempt=0) Unknown Unknown

235 parallelize at <console>:26 (/stages/stage/?id=235&attempt=0) Unknown Unknown

234 map at <console>:25 (/stages/stage/?id=234&attempt=0) Unknown Unknown

Jobs (/jobs/) Stages (/stages/) Storage (/storage/) Environment (/environment/) Executors (/executors/)

2560/2560

2560/2560

2560/2560

2560/2560

1280/1280

2560/2560

0/1280

0/1280

0/1280

0/640

0/1280

0/640

0/640

0/640

0/640

0/320

0/320

0/320

0/320

0/160

0/320

0/160

0/160

0/160

0/160

0/80

0/80

0/80

0/80

0/80

0/40

0/40

0/40

0/20

0/40

0/40

0/20

0/20

0/20

0/10

0/20

0/10

0/10

0/10

0/10

0/5

0/5

0/5

0/5

0/5

0/5

⇒ replaces durability: system can replay failed stages

Stage example: 2.4.4

(/) Spark shell application UI

Details for Stage 282 (Attempt 0)

driver

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

localhost

Total Time Across All Tasks: 14 s

Locality Level Summary: Process local: 2560

Input Size / Records: 4.3 MB / 125250

Shuffle Write: 3.8 MB / 125250

DAG Visualization

Stage 282

parallelize

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

distinct

union

subtract

ParallelCollectionRDD [0] [Cached]

parallelize at <console>:26

ShuffledRDD [11]

distinct at <console>:30

MapPartitionsRDD [12] [Cached]

distinct at <console>:30

UnionRDD [14]

union at <console>:30

ShuffledRDD [25]

distinct at <console>:30

MapPartitionsRDD [26] [Cached]

distinct at <console>:30

UnionRDD [28]

union at <console>:30

ShuffledRDD [39]

distinct at <console>:30

MapPartitionsRDD [40] [Cached]

distinct at <console>:30

UnionRDD [42]

union at <console>:30

ShuffledRDD [53]

distinct at <console>:30

MapPartitionsRDD [54] [Cached]

distinct at <console>:30

UnionRDD [56]

union at <console>:30

ShuffledRDD [67]

distinct at <console>:30

MapPartitionsRDD [68] [Cached]

distinct at <console>:30

UnionRDD [70]

union at <console>:30

ShuffledRDD [81]

distinct at <console>:30

MapPartitionsRDD [82] [Cached]

distinct at <console>:30

UnionRDD [84]

union at <console>:30

ShuffledRDD [95]

distinct at <console>:30

MapPartitionsRDD [96] [Cached]

distinct at <console>:30

UnionRDD [98]

union at <console>:30

ShuffledRDD [109]

distinct at <console>:30

MapPartitionsRDD [110] [Cached]

distinct at <console>:30

UnionRDD [112]

union at <console>:30

ShuffledRDD [123]

distinct at <console>:30

MapPartitionsRDD [124] [Cached]

distinct at <console>:30

UnionRDD [126]

union at <console>:30

MapPartitionsRDD [133]

subtract at <console>:30

Show Additional Metrics

Event Timeline

This page has more than the maximum number of tasks that can be shown in the visualization! Only the most recent 1000 tasks (of 2560 total) are shown. Enable zooming

Scheduler Delay

Task Deserialization Time

Shuffle Read Time

Executor Computing Time

Shuffle Write Time

Result Serialization Time

Getting Result Time

Summary Metrics for 2560 Completed Tasks

Metric Min 25th percentile Median 75th percentile Max

Duration 0 ms 3 ms 5 ms 8 ms 17 ms

Scheduler Delay 0 ms 0 ms 0 ms 0 ms 5 ms

Task Deserialization Time 0 ms 0 ms 0 ms 1 ms 5 ms

GC Time 0 ms 0 ms 0 ms 0 ms 3 ms

Result Serialization Time 0 ms 0 ms 0 ms 0 ms 1 ms

Getting Result Time 0 ms 0 ms 0 ms 0 ms 0 ms

Peak Execution Memory 1400.0 B 2.4 KB 3.5 KB 6.3 KB 9.9 KB

Input Size / Records 344.0 B / 9 848.0 B / 23 1456.0 B / 40 2.5 KB / 72 4.3 KB / 123

Shuffle Write Size / Records 410.0 B / 9 521.0 B / 23 1004.0 B / 40 1772.0 B / 72 17.2 KB / 123

Aggregated Metrics by Executor

Executor ID ▴ Address Task Time Total Tasks Failed Tasks Killed Tasks Succeeded Tasks Input Size / Records Shuffle Write Size / Records Blacklisted

david.uwaterloo.ca:52245 16 s 2560 0 0 2560 4.3 MB / 125250 3.8 MB / 125250 false

Tasks (2560)

Index (/stages/stage/?id=282&attempt=0&task.sort=Index&task.pageSize=10000)

ID (/stages/stage/?id=282&attempt=0&task.sort=ID&task.pageSize=10000)

Attempt (/stages/stage/?id=282&attempt=0&task.sort=Attempt&task.pageSize=10000)

Status (/stages/stage/?id=282&attempt=0&task.sort=Status&task.pageSize=10000)

Locality Level (/stages/stage/?id=282&attempt=0&task.sort=Locality+Level&task.pageSize=10000)

Executor ID (/stages/stage/?id=282&attempt=0&task.sort=Executor+ID&task.pageSize=10000)

Host (/stages/stage/?id=282&attempt=0&task.sort=Host&task.pageSize=10000)

Launch Time (/stages/stage/?id=282&attempt=0&task.sort=Launch+Time&task.pageSize=10000)

Duration  ▾ (/stages/stage/?id=282&attempt=0&task.sort=Duration&task.desc=false&task.pageSize=10000)

Scheduler Delay (/stages/stage/?id=282&attempt=0&task.sort=Scheduler+Delay&task.pageSize=10000)

Task Deserialization Time (/stages/stage/?id=282&attempt=0&task.sort=Task+Deserialization+Time&task.pageSize=10000)

GC Time (/stages/stage/?id=282&attempt=0&task.sort=GC+Time&task.pageSize=10000)

Result Serialization Time (/stages/stage/?id=282&attempt=0&task.sort=Result+Serialization+Time&task.pageSize=10000)

Getting Result Time (/stages/stage/?id=282&attempt=0&task.sort=Getting+Result+Time&tas

36 20491 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 17 ms 1 ms 0 ms 3 ms 0 ms 0 ms

4 17899 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:02 17 ms 0 ms 1 ms 3 ms 0 ms 0 ms

9 17904 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:02 16 ms 1 ms 1 ms 3 ms 0 ms 0 ms

8 17903 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:02 16 ms 0 ms 1 ms 3 ms 0 ms 0 ms

37 20492 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 15 ms 0 ms 0 ms 3 ms 0 ms 0 ms

2530 22985 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 14 ms 1 ms 0 ms 0 ms 0 ms

120 20575 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 14 ms 0 ms 1 ms 3 ms 0 ms 0 ms

31 20486 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 14 ms 1 ms 0 ms 3 ms 0 ms 0 ms

30 20485 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 14 ms 1 ms 1 ms 3 ms 0 ms 0 ms

16 17911 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:02 14 ms 1 ms 0 ms 3 ms 0 ms 0 ms

14 17909 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:02 14 ms 0 ms 1 ms 3 ms 0 ms 0 ms

2540 22995 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 13 ms 1 ms 0 ms 0 ms 0 ms

2539 22994 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 13 ms 1 ms 0 ms 0 ms 0 ms

2531 22986 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 13 ms 0 ms 1 ms 0 ms 0 ms

2521 22976 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 13 ms 0 ms 0 ms 0 ms 0 ms

275 20730 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 13 ms 0 ms 0 ms 2 ms 0 ms 0 ms

266 20721 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 13 ms 2 ms 0 ms 2 ms 0 ms 0 ms

121 20576 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 13 ms 0 ms 0 ms 3 ms 0 ms 0 ms

52 20507 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 13 ms 0 ms 0 ms 0 ms 0 ms

51 20506 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 13 ms 1 ms 1 ms 0 ms 0 ms

38 20493 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 13 ms 0 ms 1 ms 3 ms 0 ms 0 ms

17 17912 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:02 13 ms 0 ms 1 ms 3 ms 0 ms 0 ms

10 17905 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:02 13 ms 1 ms 1 ms 3 ms 0 ms 0 ms

2420 22875 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 12 ms 0 ms 0 ms 0 ms 0 ms

2146 22601 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 12 ms 0 ms 0 ms 3 ms 0 ms 0 ms

1961 22416 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 12 ms 0 ms 0 ms 0 ms 0 ms

1644 22099 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 12 ms 0 ms 1 ms 3 ms 0 ms 0 ms

1635 22090 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 12 ms 0 ms 0 ms 3 ms 0 ms 0 ms

1178 21633 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 12 ms 0 ms 0 ms 3 ms 0 ms 0 ms

815 21270 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 12 ms 0 ms 1 ms 2 ms 0 ms 0 ms

805 21260 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 12 ms 0 ms 1 ms 2 ms 0 ms 0 ms

504 20959 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 0 ms 1 ms 3 ms 0 ms 0 ms

496 20951 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 0 ms 0 ms 3 ms 0 ms 0 ms

495 20950 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 2 ms 0 ms 3 ms 0 ms 0 ms

421 20876 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 0 ms 0 ms 0 ms 0 ms

412 20867 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 0 ms 1 ms 0 ms 0 ms

267 20722 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 0 ms 1 ms 2 ms 0 ms 0 ms

236 20691 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 0 ms 0 ms 0 ms 0 ms

211 20666 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 0 ms 0 ms 0 ms 0 ms

154 20609 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 0 ms 0 ms 0 ms 0 ms

128 20583 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 0 ms 0 ms 3 ms 0 ms 0 ms

127 20582 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 0 ms 1 ms 3 ms 0 ms 0 ms

122 20577 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 1 ms 0 ms 3 ms 0 ms 0 ms

59 20514 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 2 ms 0 ms 0 ms 0 ms

44 20499 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 0 ms 0 ms 0 ms 0 ms

43 20498 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 1 ms 1 ms 0 ms 0 ms

39 20494 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:08 12 ms 0 ms 0 ms 3 ms 0 ms 0 ms

15 17910 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:02 12 ms 0 ms 1 ms 3 ms 0 ms 0 ms

12 17907 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:02 12 ms 0 ms 1 ms 3 ms 0 ms 0 ms

2532 22987 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 11 ms 0 ms 1 ms 0 ms 0 ms

2485 22940 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 11 ms 0 ms 1 ms 0 ms 0 ms

2484 22939 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 11 ms 1 ms 1 ms 0 ms 0 ms

2428 22883 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 11 ms 1 ms 0 ms 0 ms 0 ms

2157 22612 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 11 ms 0 ms 0 ms 3 ms 0 ms 0 ms

2156 22611 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 11 ms 0 ms 1 ms 3 ms 0 ms 0 ms

2147 22602 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 11 ms 0 ms 1 ms 3 ms 0 ms 0 ms

1989 22444 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 11 ms 0 ms 0 ms 0 ms 0 ms

1980 22435 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 11 ms 0 ms 1 ms 0 ms 0 ms

1970 22425 0 SUCCESS PROCESS_LOCAL driver 2019/11/01 17:11:09 11 ms 0 ms 0 ms 0 ms 0 ms

Jobs (/jobs/) Stages (/stages/) Storage (/storage/) Environment (/environment/) Executors (/executors/)

driver / localhost

370 380 390 400 410 420

17:11:09

430 440 450 460 470 480 490 500 510 520 530 540 550 560 570 580 590 600 610 620 630 640 650 660 670 680 690 700 710 720 730 740 750 760 770 780 790 800 810 820 830 840 850

(University of Waterloo) Big Data 23 / 25

Page 38: - Fall 2019david/cs348/lect-BIGDATA.pdf · Groups together all intermediate values with the same key (i.e., ... Yarn Data (University of Waterloo) Big Data 12/25. Hadoop (cont’ed)

New SPARK

DataFrame/Dataset⇒ tabluar abstraction on top of RDDs⇒ rudimentary column types⇒ better properties when written to disk

SQL (on SPARK)⇒ on-the-fly query optimizer (catalyst)

(University of Waterloo) Big Data 24 / 25

Page 39: - Fall 2019david/cs348/lect-BIGDATA.pdf · Groups together all intermediate values with the same key (i.e., ... Yarn Data (University of Waterloo) Big Data 12/25. Hadoop (cont’ed)

New SPARK

DataFrame/Dataset⇒ tabluar abstraction on top of RDDs⇒ rudimentary column types⇒ better properties when written to diskSQL (on SPARK)⇒ on-the-fly query optimizer (catalyst)

(University of Waterloo) Big Data 24 / 25

Page 40: - Fall 2019david/cs348/lect-BIGDATA.pdf · Groups together all intermediate values with the same key (i.e., ... Yarn Data (University of Waterloo) Big Data 12/25. Hadoop (cont’ed)

Take Home

Lots of open issues:

1 Tools/Frameworks that simplify writing of parallel programs2 Various ways of implementing scheduling/synchronization/failure

recovery3 Taking advantage of parallelism and partitioning (many levels)4 . . .

(University of Waterloo) Big Data 25 / 25