data-intensive distributed computing · split 2 split 3 split 4 worker worker worker worker worker...
TRANSCRIPT
![Page 1: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/1.jpg)
Data-Intensive Distributed Computing
Part 1: MapReduce Algorithm Design (2/4)
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details
CS 431/631 451/651 (Winter 2019)
Adam RoegiestKira Systems
January 10, 2019
These slides are available at http://roegiest.com/bigdata-2019w/
![Page 2: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/2.jpg)
RTFM you must
![Page 3: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/3.jpg)
Announcements
A0 finalized for all sections
CS 431/631 Only: If you want a challenge, you may elect to do all of the CS 451 assignments instead of the six CS 431 assignments.
This is a one-way road. No switching back.
![Page 4: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/4.jpg)
Agenda for Today
Why big data?Hadoop walkthrough
![Page 5: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/5.jpg)
Why big data?
Execution
Infrastructure
Analytics
Infrastructure
Data Science
Tools
This
Co
urs
e“big data stack”
![Page 6: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/6.jpg)
Source: Wikipedia (Everest)
Why big data? ScienceBusinessSociety
![Page 7: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/7.jpg)
Emergence of the 4th Paradigm
Data-intensive e-ScienceMaximilien Brice, © CERN
Science
![Page 8: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/8.jpg)
Maximilien Brice, © CERN
![Page 9: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/9.jpg)
Maximilien Brice, © CERN
![Page 10: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/10.jpg)
![Page 11: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/11.jpg)
Source: Wikipedia (DNA)
![Page 12: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/12.jpg)
GATGCTTACTATGCGGGCCCC
CGGTCTAATGCTTACTATGC
GCTTACTATGCGGGCCCCTT
AATGCTTACTATGCGGGCCCCTT
TAATGCTTACTATGC
AATGCTTAGCTATGCGGGC
AATGCTTACTATGCGGGCCCCTT
AATGCTTACTATGCGGGCCCCTT
CGGTCTAGATGCTTACTATGC
AATGCTTACTATGCGGGCCCCTT
CGGTCTAATGCTTAGCTATGC
ATGCTTACTATGCGGGCCCCTT
?
Subject genome
Sequencer
Reads
Human genome: 3 gbpA few billion short reads (~100 GB compressed data)
![Page 13: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/13.jpg)
BusinessData-driven decisions
Data-driven products
Source: Wikiedia (Shinjuku, Tokyo)
![Page 14: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/14.jpg)
An organization should retain data that result from carrying out its mission and exploit those data to generate insights that benefit the organization, for example, market analysis, strategic planning, decision making, etc.
Business Intelligence
![Page 15: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/15.jpg)
In the 1990s, Wal-Mart found that customers tended to buy diapers and beer together. So they put them next to each other and increased sales of both.*
This is not a new idea!
So what’s changed?
More compute and storage
Ability to gather behavioral data
* BTW, this is completely apocryphal. (But it makes a nice story.)
![Page 16: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/16.jpg)
a useful service
analyze user behavior to extract insights
transform insights into action
$(hopefully)
Google. Facebook. Twitter. Amazon. Uber.
data sciencedata products
Virtuous Product Cycle
![Page 17: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/17.jpg)
Source: https://images.lookhuman.com/render/standard/8002245806006052/pillow14in-whi-z1-t-netflixing.png
![Page 18: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/18.jpg)
Source: https://www.reddit.com/r/teslamotors/comments/6gsc6v/i_think_the_neural_net_mining_is_just_starting/ (June 2017)
![Page 19: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/19.jpg)
![Page 20: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/20.jpg)
Source: Wikipedia (Rosetta Stone)
![Page 21: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/21.jpg)
(Banko and Brill, ACL 2001)
(Brants et al., EMNLP 2007)
No data like more data!
![Page 22: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/22.jpg)
Source: Guardian
Humans as social sensors
Computational social science
Society
![Page 23: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/23.jpg)
Predicting X with Twitter
(Paul and Dredze, ICWSM 2011; Bond et al., Nature 2011)
Political Mobilization on Facebook
2010 US Midterm Elections: 60m users shown “I Voted” Messages
Summary: increased turnout by 60k directly and 280k indirectly
![Page 24: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/24.jpg)
Source: https://www.theverge.com/2017/11/1/16593346/house-russia-facebook-ads
![Page 25: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/25.jpg)
Source: https://www.propublica.org/article/facebook-enabled-advertisers-to-reach-jew-haters
![Page 26: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/26.jpg)
![Page 27: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/27.jpg)
Source: http://www.cnn.com/2016/12/07/asia/new-zealand-passport-robot-asian-trnd/index.html
![Page 28: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/28.jpg)
The Perils of Big Data
The end of privacyWho owns your data and can the government access it?
The echo chamberAre you seeing only what you want to see?
The racist algorithmAlgorithms aren’t racist, people are?
We desperately need “data ethics” to go with big data!
![Page 29: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/29.jpg)
![Page 30: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/30.jpg)
Source: Popular Internet Meme
![Page 31: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/31.jpg)
Source: Google
Tackling Big Data
![Page 32: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/32.jpg)
combinecombine combine combine
ba 1 2 c 9 a c5 2 b c7 8
partition partition partition partition
mapmap map map
k1 k2 k3 k4 k5 k6v1 v2 v3 v4 v5 v6
ba 1 2 c c3 6 a c5 2 b c7 8
group values by key
reduce reduce reduce
a 1 5 b 2 7 c 2 9 8
r1 s1 r2 s2 r3 s3 * Important detail: reducers process keys in sorted order
***
Logical View
![Page 33: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/33.jpg)
split 0
split 1
split 2
split 3
split 4
worker
worker
worker
worker
worker
Master
User
Program
output
file 0
output
file 1
(1) submit
(2) schedule map (2) schedule reduce
(3) read(4) local write
(5) remote read(6) write
Input
files
Map
phase
Intermediate files
(on local disk)
Reduce
phase
Output
files
Adapted from (Dean and Ghemawat, OSDI 2004)
Physical View
![Page 34: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/34.jpg)
Source: Google
The datacenter is the computer!
![Page 35: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/35.jpg)
The datacenter is the computer!
It’s all about the right level of abstractionMoving beyond the von Neumann architecture
What’s the “instruction set” of the datacenter computer?
Hide system-level details from the developersNo more race conditions, lock contention, etc.
No need to explicitly worry about reliability, fault tolerance, etc.
Separating the what from the howDeveloper specifies the computation that needs to be performed
Execution framework (“runtime”) handles actual execution
![Page 36: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/36.jpg)
The datacenter is the computer!
Scale “out”, not “up”Limits of SMP and large shared-memory machines
Move processing to the dataCluster have limited bandwidth, code is a lot smaller
Process data sequentially, avoid random accessSeeks are expensive, disk throughput is good
“Big ideas”
Assume that components will breakEngineer software around hardware failures
*
*
![Page 37: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/37.jpg)
Seek vs. Scans
Consider a 1 TB database with 100 byte recordsWe want to update 1 percent of the records
Scenario 2: Rewrite all recordsAssume 100 MB/s throughput
Time = 5.6 hours(!)
Scenario 1: Mutate each recordEach update takes ~30 ms (seek, read, write)
108 updates = ~35 days
Source: Ted Dunning, on Hadoop mailing list
Lesson? Random access is expensive!
![Page 38: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/38.jpg)
Source: Wikipedia (Mahout)
So you want to drive the elephant!
![Page 39: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/39.jpg)
So you want to drive the elephant!(Aside, what about Spark?)
![Page 40: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/40.jpg)
org.apache.hadoop.mapreduceorg.apache.hadoop.mapred
Source: Wikipedia (Budapest)
A tale of two packages…
![Page 41: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/41.jpg)
MapReduce API*Mapper<Kin,Vin,Kout,Vout>
Called once at the start of the taskvoid setup(Mapper.Context context)
Called once for each key/value pair in the input splitvoid map(Kin key, Vin value, Mapper.Context context)
Called once at the end of the taskvoid cleanup(Mapper.Context context)
*Note that there are two versions of the API!
Reducer<Kin,Vin,Kout,Vout>/Combiner<Kin,Vin,Kout,Vout>
Called once at the start of the taskvoid setup(Reducer.Context context)
Called once for each keyvoid reduce(Kin key, Iterable<Vin> values, Reducer.Context context)
Called once at the end of the taskvoid cleanup(Reducer.Context context)
![Page 42: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/42.jpg)
MapReduce API*
Partitioner<K, V>
Returns the partition number given total number of partitionsint getPartition(K key, V value, int numPartitions)
*Note that there are two versions of the API!
Job
Represents a packaged Hadoop job for submission to clusterNeed to specify input and output paths
Need to specify input and output formatsNeed to specify mapper, reducer, combiner, partitioner classes
Need to specify intermediate/final key/value classesNeed to specify number of reducers (but not mappers, why?)
Don’t depend of defaults!
![Page 43: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/43.jpg)
Writable Defines a de/serialization protocol. Every data type in Hadoop is a Writable.
WritableComprable Defines a sort order. All keys must be of this type (but not values).
IntWritableLongWritableText…
Concrete classes for different data types.Note that these are container objects.
SequenceFile Binary-encoded sequence of key/value pairs.
Data Types in Hadoop: Keys and Values
![Page 44: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/44.jpg)
“Hello World” MapReduce: Word Count
def map(key: Long, value: String) = {for (word <- tokenize(value)) {
emit(word, 1)}
}
def reduce(key: String, values: Iterable[Int]) = {for (value <- values) {
sum += value}emit(key, sum)
}
![Page 45: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/45.jpg)
private static final class MyMapperextends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable ONE = new IntWritable(1);private final static Text WORD = new Text();
@Overridepublic void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {for (String word : Tokenizer.tokenize(value.toString())) {
WORD.set(word);context.write(WORD, ONE);
}}
}
Word Count Mapper
![Page 46: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/46.jpg)
private static final class MyReducerextends Reducer<Text, IntWritable, Text, IntWritable> {
private final static IntWritable SUM = new IntWritable();
@Overridepublic void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {Iterator<IntWritable> iter = values.iterator();int sum = 0;while (iter.hasNext()) {
sum += iter.next().get();}SUM.set(sum);context.write(key, SUM);
}}
Word Count Reducer
![Page 47: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/47.jpg)
Three Gotchas
Avoid object creation
Execution framework reuses value object in reducer
Passing parameters via class statics doesn’t work!
![Page 48: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/48.jpg)
Getting Data to Mappers and Reducers
Configuration parametersPass in via Job configuration object
“Side data”DistributedCache
Mappers/Reducers can read from HDFS in setup method
![Page 49: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/49.jpg)
Complex Data Types in Hadoop
The easiest way:Encode it as Text, e.g., (a, b) = “a:b”
Use regular expressions to parse and extract dataWorks, but janky
The hard way:Define a custom implementation of Writable(Comprable)
Must implement: readFields, write, (compareTo)Computationally efficient, but slow for rapid prototypingImplement WritableComparator hook for performance
How do you implement complex data types?
Somewhere in the middle:Bespin (via lin.tl) offers various building blocks
![Page 50: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/50.jpg)
Anatomy of a Job
Job submission:Client (i.e., driver program) creates a job,
configures it, and submits it to jobtracker
That’s it! The Hadoop cluster takes over…
Hadoop MapReduce program = Hadoop jobJobs are divided into map and reduce tasks
An instance of a running task is called a task attemptEach task occupies a slot on the tasktracker
Multiple jobs can be composed into a workflow
![Page 51: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/51.jpg)
split 0
split 1
split 2
split 3
split 4
worker
worker
worker
worker
worker
Master
User
Program
output
file 0
output
file 1
(1) submit
(2) schedule map (2) schedule reduce
(3) read(4) local write
(5) remote read(6) write
Input
files
Map
phase
Intermediate files
(on local disk)
Reduce
phase
Output
files
Adapted from (Dean and Ghemawat, OSDI 2004)
![Page 52: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/52.jpg)
Anatomy of a Job
Behind the scenes:Input splits are computed (on client end)
Job data (jar, configuration XML) are sent to jobtrackerJobtracker puts job data in shared location, enqueues tasks
Tasktrackers poll for tasksOff to the races…
![Page 53: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/53.jpg)
InputSplit
Source: redrawn from a slide by Cloduera, cc-licensed
InputSplit InputSplit
Input File Input File
InputSplit InputSplit
RecordReader RecordReader RecordReader RecordReader RecordReader
Mapper
Intermediates
Mapper
Intermediates
Mapper
Intermediates
Mapper
Intermediates
Mapper
Intermediates
Inp
utF
orm
at
![Page 54: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/54.jpg)
… …
InputSplit InputSplit InputSplit
Client
Records
Mapper
RecordReader
Mapper
RecordReader
Mapper
RecordReader
Where’s the data actually coming from?
![Page 55: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/55.jpg)
Source: redrawn from a slide by Cloduera, cc-licensed
Mapper Mapper Mapper Mapper Mapper
Partitioner Partitioner Partitioner Partitioner Partitioner
Intermediates Intermediates Intermediates Intermediates Intermediates
Reducer Reducer Reduce
Intermediates Intermediates Intermediates
(combiners omitted here)
![Page 56: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/56.jpg)
Source: redrawn from a slide by Cloduera, cc-licensed
Reducer Reducer Reducer
Output File
RecordWriter
Ou
tpu
tFo
rma
t
Output File
RecordWriter
Output File
RecordWriter
![Page 57: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/57.jpg)
Input and Output
InputFormatTextInputFormat
KeyValueTextInputFormatSequenceFileInputFormat
…
OutputFormatTextOutputFormat
SequenceFileOutputFormat…
Spark also uses these abstractions for reading and writing data!
![Page 58: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/58.jpg)
Hadoop ClusterYou
Submit node(datasci)
Getting data in?Writing code?Getting data out?
Hadoop Workflow
Where’s the actual data stored?
![Page 59: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/59.jpg)
Debugging Hadoop
First, take a deep breathStart small, start locally
Build incrementally
![Page 60: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/60.jpg)
Source: Wikipedia (The Scream)
![Page 61: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/61.jpg)
Code Execution Environments
Different ways to run code:Local (standalone) modePseudo-distributed mode
Fully-distributed mode
Learn what’s good for what
![Page 62: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/62.jpg)
Hadoop Debugging Strategies
Good ol’ System.out.printlnLearn to use the webapp to access logs
Logging preferred over System.out.printlnBe careful how much you log!
Fail on successThrow RuntimeExceptions and capture state
Use Hadoop as the “glue”Implement core functionality outside mappers and reducers
Independently test (e.g., unit testing)Compose (tested) components in mappers and reducers
![Page 63: Data-Intensive Distributed Computing · split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2)](https://reader034.vdocuments.us/reader034/viewer/2022042517/5f476ebd93766e6509194f5d/html5/thumbnails/63.jpg)
Source: Wikipedia (Japanese rock garden)
Questions?