writing your own rdd for fun and profit
TRANSCRIPT
![Page 1: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/1.jpg)
Writing your own RDD for fun and profit
by Paweł Szulc @rabbitonweb
![Page 2: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/2.jpg)
Writing my own RDD? What for?
● To write your own RDD, you need to understand to some extent internal mechanics of Apache Spark
● Writing your own RDD will prove you understand them well● When connecting to external storage, it is reasonable to
create your own RDD for it
![Page 3: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/3.jpg)
Outline
1. The Recap
![Page 4: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/4.jpg)
Outline
1. The Recap2. The Internals
![Page 5: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/5.jpg)
Outline
1. The Recap2. The Internals3. The Fun & Profit
![Page 6: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/6.jpg)
Part I - The Recap
![Page 7: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/7.jpg)
RDD - the definition
![Page 8: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/8.jpg)
RDD - the definition
RDD stands for resilient distributed dataset
![Page 9: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/9.jpg)
RDD - the definition
RDD stands for resilient distributed dataset
Dataset - initial data comes from some distributed storage
![Page 10: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/10.jpg)
RDD - the definition
RDD stands for resilient distributed dataset
Distributed - stored in nodes among the cluster
Dataset - initial data comes from some distributed storage
![Page 11: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/11.jpg)
RDD - the definition
RDD stands for resilient distributed dataset
Resilient - if data is lost, data can be recreated
Distributed - stored in nodes among the cluster
Dataset - initial data comes from some distributed storage
![Page 12: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/12.jpg)
RDD - example
![Page 13: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/13.jpg)
RDD - example
val logs = sc.textFile("hdfs://logs.txt")
![Page 14: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/14.jpg)
RDD - example
val logs = sc.textFile("hdfs://logs.txt")
From Hadoop DistributedFile System
![Page 15: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/15.jpg)
RDD - example
val logs = sc.textFile("hdfs://logs.txt")
From Hadoop DistributedFile SystemThis is the RDD
![Page 16: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/16.jpg)
RDD - example
val numbers = sc.parallelize(List(1, 2, 3, 4))
Programmatically from a collection of elementsThis is the RDD
![Page 17: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/17.jpg)
RDD - example
val logs = sc.textFile("logs.txt")
![Page 18: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/18.jpg)
RDD - example
val logs = sc.textFile("logs.txt")
val lcLogs = logs.map(_.toLowerCase)
![Page 19: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/19.jpg)
RDD - example
val logs = sc.textFile("logs.txt")
val lcLogs = logs.map(_.toLowerCase)
Creates a new RDD
![Page 20: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/20.jpg)
RDD - example
val logs = sc.textFile("logs.txt")
val lcLogs = logs.map(_.toLowerCase)
val errors = lcLogs.filter(_.contains(“error”))
![Page 21: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/21.jpg)
RDD - example
val logs = sc.textFile("logs.txt")
val lcLogs = logs.map(_.toLowerCase)
val errors = lcLogs.filter(_.contains(“error”))
And yet another RDD
![Page 22: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/22.jpg)
RDD - example
val logs = sc.textFile("logs.txt")
val lcLogs = logs.map(_.toLowerCase)
val errors = lcLogs.filter(_.contains(“error”))
And yet another RDDPerformance Alert?!?!
![Page 23: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/23.jpg)
RDD - Operations
1. Transformationsa. Mapb. Filterc. FlatMapd. Samplee. Unionf. Intersectg. Distincth. GroupByKeyi. ….
2. Actionsa. Reduceb. Collectc. Countd. Firste. Take(n)f. TakeSampleg. SaveAsTextFileh. ….
![Page 24: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/24.jpg)
RDD - example
val logs = sc.textFile("logs.txt")
val lcLogs = logs.map(_.toLowerCase)
val errors = lcLogs.filter(_.contains(“error”))
![Page 25: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/25.jpg)
RDD - example
val logs = sc.textFile("logs.txt")
val lcLogs = logs.map(_.toLowerCase)
val errors = lcLogs.filter(_.contains(“error”))
val numberOfErrors = errors.count
![Page 26: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/26.jpg)
RDD - example
val logs = sc.textFile("logs.txt")
val lcLogs = logs.map(_.toLowerCase)
val errors = lcLogs.filter(_.contains(“error”))
val numberOfErrors = errors.count
This will trigger the computation
![Page 27: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/27.jpg)
RDD - example
val logs = sc.textFile("logs.txt")
val lcLogs = logs.map(_.toLowerCase)
val errors = lcLogs.filter(_.contains(“error”))
val numberOfErrors = errors.count
This will the calculated value (Int)
This will trigger the computation
![Page 28: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/28.jpg)
Partitions?
![Page 29: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/29.jpg)
Partitions?
A partition represents subset of data within your distributed collection.
![Page 30: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/30.jpg)
Partitions?
A partition represents subset of data within your distributed collection.
Number of partitions tightly coupled with level of parallelism.
![Page 31: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/31.jpg)
Partitions evaluationval counted = sc.textFile(..).count
![Page 32: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/32.jpg)
Partitions evaluationval counted = sc.textFile(..).count
node 1
node 2
node 3
![Page 33: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/33.jpg)
Partitions evaluationval counted = sc.textFile(..).count
node 1
node 2
node 3
![Page 34: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/34.jpg)
Partitions evaluationval counted = sc.textFile(..).count
node 1
node 2
node 3
![Page 35: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/35.jpg)
Partitions evaluationval counted = sc.textFile(..).count
node 1
node 2
node 3
![Page 36: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/36.jpg)
Partitions evaluationval counted = sc.textFile(..).count
node 1
node 2
node 3
![Page 37: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/37.jpg)
Partitions evaluationval counted = sc.textFile(..).count
node 1
node 2
node 3
![Page 38: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/38.jpg)
Partitions evaluationval counted = sc.textFile(..).count
node 1
node 2
node 3
![Page 39: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/39.jpg)
Partitions evaluationval counted = sc.textFile(..).count
node 1
node 2
node 3
![Page 40: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/40.jpg)
Partitions evaluationval counted = sc.textFile(..).count
node 1
node 2
node 3
![Page 41: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/41.jpg)
Partitions evaluationval counted = sc.textFile(..).count
node 1
node 2
node 3
![Page 42: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/42.jpg)
Partitions evaluationval counted = sc.textFile(..).count
node 1
node 2
node 3
![Page 43: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/43.jpg)
Partitions evaluationval counted = sc.textFile(..).count
node 1
node 2
node 3
![Page 44: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/44.jpg)
Partitions evaluationval counted = sc.textFile(..).count
node 1
node 2
node 3
![Page 45: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/45.jpg)
Pipeline
![Page 46: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/46.jpg)
Pipelinemap
![Page 47: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/47.jpg)
Pipelinemap count
![Page 48: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/48.jpg)
Pipelinemap count
task
![Page 49: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/49.jpg)
Pipelinemap count
task
![Page 50: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/50.jpg)
Pipelinemap count
task
![Page 51: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/51.jpg)
But what if...val startings = allShakespeare
.filter(_.trim != "")
.groupBy(_.charAt(0))
.mapValues(_.size)
.reduceByKey {
case (acc, length) =>
acc + length
}
![Page 52: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/52.jpg)
But what if...val startings = allShakespeare
.filter(_.trim != "")
.groupBy(_.charAt(0))
.mapValues(_.size)
.reduceByKey {
case (acc, length) =>
acc + length
}
![Page 53: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/53.jpg)
But what if...filter
val startings = allShakespeare
.filter(_.trim != "")
.groupBy(_.charAt(0))
.mapValues(_.size)
.reduceByKey {
case (acc, length) =>
acc + length
}
![Page 54: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/54.jpg)
And now what?filter
val startings = allShakespeare
.filter(_.trim != "")
.groupBy(_.charAt(0))
.mapValues(_.size)
.reduceByKey {
case (acc, length) =>
acc + length
}
![Page 55: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/55.jpg)
And now what?filter mapValues
val startings = allShakespeare
.filter(_.trim != "")
.groupBy(_.charAt(0))
.mapValues(_.size)
.reduceByKey {
case (acc, length) =>
acc + length
}
![Page 56: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/56.jpg)
And now what?filter
val startings = allShakespeare
.filter(_.trim != "")
.groupBy(_.charAt(0))
.mapValues(_.size)
.reduceByKey {
case (acc, length) =>
acc + length
}
![Page 57: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/57.jpg)
Shufflingfilter groupBy
val startings = allShakespeare
.filter(_.trim != "")
.groupBy(_.charAt(0))
.mapValues(_.size)
.reduceByKey {
case (acc, length) =>
acc + length
}
![Page 58: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/58.jpg)
Shufflingfilter mapValuesgroupBy
val startings = allShakespeare
.filter(_.trim != "")
.groupBy(_.charAt(0))
.mapValues(_.size)
.reduceByKey {
case (acc, length) =>
acc + length
}
![Page 59: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/59.jpg)
Shufflingfilter reduceByKeygroupBy
val startings = allShakespeare
.filter(_.trim != "")
.groupBy(_.charAt(0))
.mapValues(_.size)
.reduceByKey {
case (acc, length) =>
acc + length
}
mapValues
![Page 60: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/60.jpg)
Shufflingfilter reduceByKeygroupBy mapValues
![Page 61: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/61.jpg)
Shufflingfilter reduceByKey
task
groupBy mapValues
![Page 62: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/62.jpg)
Shufflingfilter reduceByKey
task
groupBy mapValues
![Page 63: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/63.jpg)
Shufflingfilter reduceByKey
task
groupBy mapValues
![Page 64: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/64.jpg)
Shufflingfilter reduceByKey
task
Wait for calculations on all partitions before moving on
groupBy mapValues
![Page 65: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/65.jpg)
Shufflingfilter reduceByKey
task
groupBy mapValues
![Page 66: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/66.jpg)
Shufflingfilter reduceByKey
task
groupBy
Data flying around through cluster
mapValues
![Page 67: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/67.jpg)
Shufflingfilter reduceByKey
task
groupBy mapValues
![Page 68: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/68.jpg)
Shufflingfilter reduceByKey
task taskgroupBy mapValues
![Page 69: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/69.jpg)
Shufflingfilter reduceByKeygroupBy mapValues
![Page 70: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/70.jpg)
stage1
Stagefilter reduceByKeygroupBy mapValues
![Page 71: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/71.jpg)
sda
stage2stage1
Stagefilter reduceByKeygroupBy mapValues
![Page 72: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/72.jpg)
Part II - The Internals
![Page 73: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/73.jpg)
What is a RDD?
![Page 74: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/74.jpg)
What is a RDD?
Resilient Distributed Dataset
![Page 75: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/75.jpg)
What is a RDD?
Resilient Distributed Dataset
![Page 76: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/76.jpg)
...10 10/05/2015 10:14:01 UserInitialized Ania Nowak10 10/05/2015 10:14:55 FirstNameChanged Anna12 10/05/2015 10:17:03 UserLoggedIn12 10/05/2015 10:21:31 UserLoggedOut …198 13/05/2015 21:10:11 UserInitialized Jan Kowalski
What is a RDD?
![Page 77: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/77.jpg)
node 1
...10 10/05/2015 10:14:01 UserInitialized Ania Nowak10 10/05/2015 10:14:55 FirstNameChanged Anna12 10/05/2015 10:17:03 UserLoggedIn12 10/05/2015 10:21:31 UserLoggedOut …198 13/05/2015 21:10:11 UserInitialized Jan Kowalski
node 2 node 3
What is a RDD?
![Page 78: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/78.jpg)
node 1
...10 10/05/2015 10:14:01 UserInitialized Ania Nowak10 10/05/2015 10:14:55 FirstNameChanged Anna12 10/05/2015 10:17:03 UserLoggedIn12 10/05/2015 10:21:31 UserLoggedOut …198 13/05/2015 21:10:11 UserInitialized Jan Kowalski
...10 10/05/2015 10:14:01 UserInitialized Ania Nowak10 10/05/2015 10:14:55 FirstNameChanged Anna12 10/05/2015 10:17:03 UserLoggedIn12 10/05/2015 10:21:31 UserLoggedOut …198 13/05/2015 21:10:11 UserInitialized Jan Kowalski
node 2 node 3
...10 10/05/2015 10:14:01 UserInitialized Ania Nowak10 10/05/2015 10:14:55 FirstNameChanged Anna12 10/05/2015 10:17:03 UserLoggedIn12 10/05/2015 10:21:31 UserLoggedOut …198 13/05/2015 21:10:11 UserInitialized Jan Kowalski
...10 10/05/2015 10:14:01 UserInitialized Ania Nowak10 10/05/2015 10:14:55 FirstNameChanged Anna12 10/05/2015 10:17:03 UserLoggedIn12 10/05/2015 10:21:31 UserLoggedOut …198 13/05/2015 21:10:11 UserInitialized Jan Kowalski
What is a RDD?
![Page 79: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/79.jpg)
What is a RDD?
![Page 80: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/80.jpg)
What is a RDD?
RDD needs to hold 3 chunks of information in order to do its work:
![Page 81: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/81.jpg)
What is a RDD?
RDD needs to hold 3 chunks of information in order to do its work:
1. pointer to his parent
![Page 82: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/82.jpg)
What is a RDD?
RDD needs to hold 3 chunks of information in order to do its work:
1. pointer to his parent2. how its internal data is partitioned
![Page 83: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/83.jpg)
What is a RDD?
RDD needs to hold 3 chunks of information in order to do its work:
1. pointer to his parent2. how its internal data is partitioned3. how to evaluate its internal data
![Page 84: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/84.jpg)
What is a RDD?
RDD needs to hold 3 chunks of information in order to do its work:
1. pointer to his parent2. how its internal data is partitioned3. how to evaluate its internal data
![Page 85: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/85.jpg)
What is a partition?
A partition represents subset of data within your distributed collection.
![Page 86: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/86.jpg)
What is a partition?
A partition represents subset of data within your distributed collection.
override def getPartitions: Array[Partition] = ???
![Page 87: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/87.jpg)
What is a partition?
A partition represents subset of data within your distributed collection.
override def getPartitions: Array[Partition] = ???
How this subset is defined depends on type of the RDD
![Page 88: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/88.jpg)
example: HadoopRDD
val journal = sc.textFile(“hdfs://journal/*”)
![Page 89: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/89.jpg)
example: HadoopRDD
val journal = sc.textFile(“hdfs://journal/*”)
How HadoopRDD is partitioned?
![Page 90: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/90.jpg)
example: HadoopRDD
val journal = sc.textFile(“hdfs://journal/*”)
How HadoopRDD is partitioned?
In HadoopRDD partition is exactly the same as file chunks in HDFS
![Page 91: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/91.jpg)
example: HadoopRDD
10 10/05/2015 10:14:01 UserInit3 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo4 10/05/2015 10:21:31 UserLo5 13/05/2015 21:10:11 UserIni
16 10/05/2015 10:14:01 UserInit20 10/05/2015 10:14:55 FirstNa42 10/05/2015 10:17:03 UserLo67 10/05/2015 10:21:31 UserLo12 13/05/2015 21:10:11 UserIni
10 10/05/2015 10:14:01 UserInit10 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo12 10/05/2015 10:21:31 UserLo198 13/05/2015 21:10:11 UserIni
5 10/05/2015 10:14:01 UserInit4 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo142 10/05/2015 10:21:31 UserLo158 13/05/2015 21:10:11 UserIni
![Page 92: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/92.jpg)
example: HadoopRDD
node 1
10 10/05/2015 10:14:01 UserInit3 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo4 10/05/2015 10:21:31 UserLo5 13/05/2015 21:10:11 UserIni
node 2 node 3
16 10/05/2015 10:14:01 UserInit20 10/05/2015 10:14:55 FirstNa42 10/05/2015 10:17:03 UserLo67 10/05/2015 10:21:31 UserLo12 13/05/2015 21:10:11 UserIni
10 10/05/2015 10:14:01 UserInit10 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo12 10/05/2015 10:21:31 UserLo198 13/05/2015 21:10:11 UserIni
5 10/05/2015 10:14:01 UserInit4 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo142 10/05/2015 10:21:31 UserLo158 13/05/2015 21:10:11 UserIni
![Page 93: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/93.jpg)
example: HadoopRDD
node 1
10 10/05/2015 10:14:01 UserInit3 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo4 10/05/2015 10:21:31 UserLo5 13/05/2015 21:10:11 UserIni
node 2 node 3
16 10/05/2015 10:14:01 UserInit20 10/05/2015 10:14:55 FirstNa42 10/05/2015 10:17:03 UserLo67 10/05/2015 10:21:31 UserLo12 13/05/2015 21:10:11 UserIni
10 10/05/2015 10:14:01 UserInit10 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo12 10/05/2015 10:21:31 UserLo198 13/05/2015 21:10:11 UserIni
5 10/05/2015 10:14:01 UserInit4 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo142 10/05/2015 10:21:31 UserLo158 13/05/2015 21:10:11 UserIni
![Page 94: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/94.jpg)
example: HadoopRDD
node 1
10 10/05/2015 10:14:01 UserInit3 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo4 10/05/2015 10:21:31 UserLo5 13/05/2015 21:10:11 UserIni
node 2 node 3
16 10/05/2015 10:14:01 UserInit20 10/05/2015 10:14:55 FirstNa42 10/05/2015 10:17:03 UserLo67 10/05/2015 10:21:31 UserLo12 13/05/2015 21:10:11 UserIni
10 10/05/2015 10:14:01 UserInit10 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo12 10/05/2015 10:21:31 UserLo198 13/05/2015 21:10:11 UserIni
5 10/05/2015 10:14:01 UserInit4 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo142 10/05/2015 10:21:31 UserLo158 13/05/2015 21:10:11 UserIni
![Page 95: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/95.jpg)
example: HadoopRDD
node 1
10 10/05/2015 10:14:01 UserInit3 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo4 10/05/2015 10:21:31 UserLo5 13/05/2015 21:10:11 UserIni
node 2 node 3
16 10/05/2015 10:14:01 UserInit20 10/05/2015 10:14:55 FirstNa42 10/05/2015 10:17:03 UserLo67 10/05/2015 10:21:31 UserLo12 13/05/2015 21:10:11 UserIni
10 10/05/2015 10:14:01 UserInit10 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo12 10/05/2015 10:21:31 UserLo198 13/05/2015 21:10:11 UserIni
5 10/05/2015 10:14:01 UserInit4 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo142 10/05/2015 10:21:31 UserLo158 13/05/2015 21:10:11 UserIni
![Page 96: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/96.jpg)
example: HadoopRDD
node 1
10 10/05/2015 10:14:01 UserInit3 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo4 10/05/2015 10:21:31 UserLo5 13/05/2015 21:10:11 UserIni
node 2 node 3
16 10/05/2015 10:14:01 UserInit20 10/05/2015 10:14:55 FirstNa42 10/05/2015 10:17:03 UserLo67 10/05/2015 10:21:31 UserLo12 13/05/2015 21:10:11 UserIni
10 10/05/2015 10:14:01 UserInit10 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo12 10/05/2015 10:21:31 UserLo198 13/05/2015 21:10:11 UserIni
5 10/05/2015 10:14:01 UserInit4 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo142 10/05/2015 10:21:31 UserLo158 13/05/2015 21:10:11 UserIni
![Page 97: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/97.jpg)
example: HadoopRDD
class HadoopRDD[K, V](...) extends RDD[(K, V)](sc, Nil) with Logging {...override def getPartitions: Array[Partition] = { val jobConf = getJobConf()
SparkHadoopUtil.get.addCredentials(jobConf) val inputFormat = getInputFormat(jobConf) if (inputFormat.isInstanceOf[Configurable]) { inputFormat.asInstanceOf[Configurable].setConf(jobConf) } val inputSplits = inputFormat.getSplits(jobConf, minPartitions) val array = new Array[Partition](inputSplits.size) for (i <- 0 until inputSplits.size) { array(i) = new HadoopPartition(id, i, inputSplits(i)) } array
}
![Page 98: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/98.jpg)
example: HadoopRDD
class HadoopRDD[K, V](...) extends RDD[(K, V)](sc, Nil) with Logging {...override def getPartitions: Array[Partition] = { val jobConf = getJobConf()
SparkHadoopUtil.get.addCredentials(jobConf) val inputFormat = getInputFormat(jobConf) if (inputFormat.isInstanceOf[Configurable]) { inputFormat.asInstanceOf[Configurable].setConf(jobConf) } val inputSplits = inputFormat.getSplits(jobConf, minPartitions) val array = new Array[Partition](inputSplits.size) for (i <- 0 until inputSplits.size) { array(i) = new HadoopPartition(id, i, inputSplits(i)) } array
}
![Page 99: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/99.jpg)
example: HadoopRDD
class HadoopRDD[K, V](...) extends RDD[(K, V)](sc, Nil) with Logging {...override def getPartitions: Array[Partition] = { val jobConf = getJobConf()
SparkHadoopUtil.get.addCredentials(jobConf) val inputFormat = getInputFormat(jobConf) if (inputFormat.isInstanceOf[Configurable]) { inputFormat.asInstanceOf[Configurable].setConf(jobConf) } val inputSplits = inputFormat.getSplits(jobConf, minPartitions) val array = new Array[Partition](inputSplits.size) for (i <- 0 until inputSplits.size) { array(i) = new HadoopPartition(id, i, inputSplits(i)) } array
}
![Page 100: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/100.jpg)
example: MapPartitionsRDD
val journal = sc.textFile(“hdfs://journal/*”)
val fromMarch = journal.filter {
case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)
}
![Page 101: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/101.jpg)
example: MapPartitionsRDD
val journal = sc.textFile(“hdfs://journal/*”)
val fromMarch = journal.filter {
case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)
}
How MapPartitionsRDD is partitioned?
![Page 102: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/102.jpg)
example: MapPartitionsRDD
val journal = sc.textFile(“hdfs://journal/*”)
val fromMarch = journal.filter {
case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)
}
How MapPartitionsRDD is partitioned?
MapPartitionsRDD inherits partition information from its parent RDD
![Page 103: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/103.jpg)
example: MapPartitionsRDD
class MapPartitionsRDD[U: ClassTag, T: ClassTag](...) extends RDD[U](prev) {
...
override def getPartitions: Array[Partition] = firstParent[T].partitions
![Page 104: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/104.jpg)
What is a RDD?
RDD needs to hold 3 chunks of information in order to do its work:
1. pointer to his parent2. how its internal data is partitioned3. how to evaluate its internal data
![Page 105: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/105.jpg)
What is a RDD?
RDD needs to hold 3 chunks of information in order to do its work:
1. pointer to his parent2. how its internal data is partitioned3. how to evaluate its internal data
![Page 106: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/106.jpg)
RDD parent
sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter {
case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)
}
.take(300)
.foreach(println)
![Page 107: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/107.jpg)
RDD parent
sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter {
case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)
}
.take(300)
.foreach(println)
![Page 108: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/108.jpg)
RDD parent
sc.textFile()
.groupBy()
.map { }
.filter {
}
.take()
.foreach()
![Page 109: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/109.jpg)
Directed acyclic graphsc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
![Page 110: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/110.jpg)
Directed acyclic graph
HadoopRDD
sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
![Page 111: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/111.jpg)
Directed acyclic graph
HadoopRDD
ShuffeledRDD
sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
![Page 112: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/112.jpg)
Directed acyclic graph
HadoopRDD
ShuffeledRDD MapPartRDD
sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
![Page 113: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/113.jpg)
Directed acyclic graph
HadoopRDD
ShuffeledRDD MapPartRDD MapPartRDD
sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
![Page 114: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/114.jpg)
Directed acyclic graph
HadoopRDD
ShuffeledRDD MapPartRDD MapPartRDD
sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
![Page 115: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/115.jpg)
Directed acyclic graph
HadoopRDD
ShuffeledRDD MapPartRDD MapPartRDD
sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
Two types of parent dependencies:
1. narrow dependency2. wider dependency
![Page 116: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/116.jpg)
Directed acyclic graph
HadoopRDD
ShuffeledRDD MapPartRDD MapPartRDD
sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
Two types of parent dependencies:
1. narrow dependency2. wider dependency
![Page 117: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/117.jpg)
Directed acyclic graph
HadoopRDD
ShuffeledRDD MapPartRDD MapPartRDD
sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
Two types of parent dependencies:
1. narrow dependency2. wider dependency
![Page 118: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/118.jpg)
Directed acyclic graph
HadoopRDD
ShuffeledRDD MapPartRDD MapPartRDD
sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
![Page 119: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/119.jpg)
Directed acyclic graphsc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
![Page 120: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/120.jpg)
Directed acyclic graphsc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
![Page 121: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/121.jpg)
Stage 1Stage 2
Directed acyclic graphsc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
![Page 122: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/122.jpg)
What is a RDD?
RDD needs to hold 3 chunks of information in order to do its work:
1. pointer to his parent2. how its internal data is partitioned3. how evaluate its internal data
![Page 123: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/123.jpg)
What is a RDD?
RDD needs to hold 3 chunks of information in order to do its work:
1. pointer to his parent2. how its internal data is partitioned3. how evaluate its internal data
![Page 124: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/124.jpg)
Stage 1Stage 2
Running Job aka materializing DAGsc.textFile() .groupBy() .map { } .filter { }
![Page 125: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/125.jpg)
Stage 1Stage 2
Running Job aka materializing DAGsc.textFile() .groupBy() .map { } .filter { } .collect()
![Page 126: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/126.jpg)
Stage 1Stage 2
Running Job aka materializing DAGsc.textFile() .groupBy() .map { } .filter { } .collect()
action
![Page 127: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/127.jpg)
Stage 1Stage 2
Running Job aka materializing DAGsc.textFile() .groupBy() .map { } .filter { } .collect()
action
Actions are implemented using sc.runJob method
![Page 128: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/128.jpg)
Running Job aka materializing DAG
/**
* Run a function on a given set of partitions in an RDD and return the results as an array.
*/
def runJob[T, U](
): Array[U]
![Page 129: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/129.jpg)
Running Job aka materializing DAG
/**
* Run a function on a given set of partitions in an RDD and return the results as an array.
*/
def runJob[T, U](
rdd: RDD[T],
): Array[U]
![Page 130: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/130.jpg)
Running Job aka materializing DAG
/**
* Run a function on a given set of partitions in an RDD and return the results as an array.
*/
def runJob[T, U](
rdd: RDD[T],
func: Iterator[T] => U,
): Array[U]
![Page 131: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/131.jpg)
Running Job aka materializing DAG
/**
* Return an array that contains all of the elements in this RDD.
*/
def collect(): Array[T] = {
val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
Array.concat(results: _*)
}
![Page 132: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/132.jpg)
Running Job aka materializing DAG
/**
* Return the number of elements in the RDD.
*/
def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum
![Page 133: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/133.jpg)
Multiple jobs for single action
/*** Take the first num elements of the RDD. It works by first scanning one partition, and use the results from that partition to estimate the number of additional partitions needed to satisfy the limit.*/def take(num: Int): Array[T] = { while (buf.size < num && partsScanned < totalParts) { (….) val left = num - buf.size val res = sc.runJob(this, (it: Iterator[T]) => it.take(left).toArray, p, allowLocal = true) (….) res.foreach(buf ++= _.take(num - buf.size)) partsScanned += numPartsToTry (….) } buf.toArray }
![Page 134: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/134.jpg)
Running Job aka materializing DAG
/**
* Run a function on a given set of partitions in an RDD and return the results as an array.
*/
def runJob[T, U](
rdd: RDD[T],
func: Iterator[T] => U,
): Array[U]
![Page 135: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/135.jpg)
Running Job aka materializing DAG
/**
* Run a function on a given set of partitions in an RDD and return the results as an array.
*/
def runJob[T, U](
rdd: RDD[T],
func: Iterator[T] => U,
): Array[U]
![Page 136: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/136.jpg)
Running Job aka materializing DAG
/** * :: DeveloperApi :: * Implemented by subclasses to compute a given partition. */@DeveloperApidef compute(split: Partition, context: TaskContext): Iterator[T]
![Page 137: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/137.jpg)
What is a RDD?
RDD needs to hold 3 chunks of information in order to do its work:
1. pointer to his parent2. how its internal data is partitioned3. how evaluate its internal data
![Page 138: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/138.jpg)
What is a RDD?
RDD needs to hold 3 + 2 chunks of information in order to do its work:
1. pointer to his parent2. how its internal data is partitioned3. how evaluate its internal data4. data locality5. paritioner
![Page 139: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/139.jpg)
What is a RDD?
RDD needs to hold 3 + 2 chunks of information in order to do its work:
1. pointer to his parent2. how its internal data is partitioned3. how evaluate its internal data4. data locality5. paritioner
![Page 140: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/140.jpg)
Data Locality: HDFS example
node 1
10 10/05/2015 10:14:01 UserInit3 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo4 10/05/2015 10:21:31 UserLo5 13/05/2015 21:10:11 UserIni
node 2 node 3
16 10/05/2015 10:14:01 UserInit20 10/05/2015 10:14:55 FirstNa42 10/05/2015 10:17:03 UserLo67 10/05/2015 10:21:31 UserLo12 13/05/2015 21:10:11 UserIni
10 10/05/2015 10:14:01 UserInit10 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo12 10/05/2015 10:21:31 UserLo198 13/05/2015 21:10:11 UserIni
5 10/05/2015 10:14:01 UserInit4 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo142 10/05/2015 10:21:31 UserLo158 13/05/2015 21:10:11 UserIni
![Page 141: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/141.jpg)
What is a RDD?
RDD needs to hold 3 + 2 chunks of information in order to do its work:
1. pointer to his parent2. how its internal data is partitioned3. how evaluate its internal data4. data locality5. paritioner
![Page 142: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/142.jpg)
What is a RDD?
RDD needs to hold 3 + 2 chunks of information in order to do its work:
1. pointer to his parent2. how its internal data is partitioned3. how evaluate its internal data4. data locality5. paritioner
![Page 143: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/143.jpg)
Spark performance - shuffle optimization
![Page 144: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/144.jpg)
Spark performance - shuffle optimization
join
![Page 145: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/145.jpg)
Spark performance - shuffle optimization
join
![Page 146: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/146.jpg)
Spark performance - shuffle optimization
map groupBy
![Page 147: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/147.jpg)
Spark performance - shuffle optimization
map groupBy
![Page 148: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/148.jpg)
Spark performance - shuffle optimization
map groupBy join
![Page 149: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/149.jpg)
Spark performance - shuffle optimization
map groupBy join
![Page 150: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/150.jpg)
Spark performance - shuffle optimization
map groupBy join
Optimization: shuffle avoided if data is already partitioned
![Page 151: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/151.jpg)
Spark performance - shuffle optimization
map groupBy
![Page 152: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/152.jpg)
Spark performance - shuffle optimization
map groupBy map
![Page 153: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/153.jpg)
Spark performance - shuffle optimization
map groupBy map
![Page 154: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/154.jpg)
Spark performance - shuffle optimization
map groupBy map join
![Page 155: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/155.jpg)
Spark performance - shuffle optimization
map groupBy map join
![Page 156: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/156.jpg)
Spark performance - shuffle optimization
map groupBy mapValues
![Page 157: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/157.jpg)
Spark performance - shuffle optimization
map groupBy mapValues
![Page 158: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/158.jpg)
Spark performance - shuffle optimization
map groupBy mapValues join
![Page 159: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/159.jpg)
Spark performance - shuffle optimization
map groupBy mapValues join
![Page 160: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/160.jpg)
Part III - The Fun & Profit
![Page 162: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/162.jpg)
RandomRDD
![Page 163: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/163.jpg)
RandomRDD
sc.random()
.take(3)
.foreach(println)
![Page 164: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/164.jpg)
RandomRDD
sc.random()
.take(3)
.foreach(println)
210
-321
21312
![Page 165: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/165.jpg)
RandomRDD
sc.random()
.take(3)
.foreach(println)
![Page 166: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/166.jpg)
RandomRDD
sc.random()
.take(3)
.foreach(println)
sc.random(maxSize = 10, numPartitions = 4)
.take(10)
.foreach(println)
![Page 167: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/167.jpg)
CensorshipRDD
![Page 168: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/168.jpg)
CensorshipRDD
val statement =
sc.parallelize(List("We", "all", "know that", "Hadoop rocks!"))
![Page 169: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/169.jpg)
CensorshipRDD
val statement =
sc.parallelize(List("We", "all", "know that", "Hadoop rocks!"))
.censor()
.collect().toList.mkString(" ")
println(statement)
![Page 170: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/170.jpg)
CensorshipRDD
![Page 171: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/171.jpg)
CensorshipRDD
sc.parallelize(List("We", "all", "know that", "Hadoop rocks!"))
.censor().collectLegal().foreach(println)
![Page 172: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/172.jpg)
CensorshipRDD
sc.parallelize(List("We", "all", "know that", "Hadoop rocks!"))
.censor().collectLegal().foreach(println)
We
all
know that
![Page 173: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/173.jpg)
Fin
![Page 174: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/174.jpg)
Fin
Paweł Szulc
![Page 176: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/176.jpg)
Fin
Paweł Szulc
Twitter: @rabbitonweb
![Page 177: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/177.jpg)
Fin
Paweł Szulc
Twitter: @rabbitonweb
http://rabbitonweb.com
![Page 178: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/178.jpg)
Fin
Paweł Szulc
Twitter: @rabbitonweb
http://rabbitonweb.com
http://github.com/rabbitonweb
![Page 179: Writing your own RDD for fun and profit](https://reader031.vdocuments.us/reader031/viewer/2022030221/5884a4bb1a28ab76798b4815/html5/thumbnails/179.jpg)
Fin
Paweł Szulc
Twitter: @rabbitonweb
http://rabbitonweb.com
http://github.com/rabbitonweb
http://bit.do/scalapolis