![Page 1: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer](https://reader033.vdocuments.us/reader033/viewer/2022051621/56649e985503460f94b9b670/html5/thumbnails/1.jpg)
Large-scale file systems and Map-Reduce
Single-node architecture
Memory
Disk
CPU
Google example:
• 20+ billion web pages x 20KB = 400+ Terabyte• 1 computer reads 30-35 MB/sec from disk• ~4 months to read the web
• ~1,000 hard drives to store the web• Takes even more to do something useful with the data• New standard architecture is emerging:• Cluster of commodity Linux nodes• Gigabit ethernet interconnect
Slide based on www.mmds.com
![Page 2: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer](https://reader033.vdocuments.us/reader033/viewer/2022051621/56649e985503460f94b9b670/html5/thumbnails/2.jpg)
Distributed File Systems
• Files are very large, read/append.• They are divided into chunks.– Typically 64MB to a chunk.
• Chunks are replicated at several compute-nodes.• A master (possibly replicated) keeps track of all
locations of all chunks.
Slide based on www.mmds.com
![Page 3: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer](https://reader033.vdocuments.us/reader033/viewer/2022051621/56649e985503460f94b9b670/html5/thumbnails/3.jpg)
Commodity clusters: compute nodes• Organized into racks.• Intra-rack connection typically gigabit speed.• Inter-rack connection faster by a small factor.• Recall that chunks are replicated
Some implementations:• GFS (Google File System –
proprietary). In Aug 2006 Google had ~450,000 machines
• HDFS (Hadoop Distributed File System – open source).
• CloudStore (Kosmix File System, open source).
Slide based on www.mmds.com
![Page 4: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer](https://reader033.vdocuments.us/reader033/viewer/2022051621/56649e985503460f94b9b670/html5/thumbnails/4.jpg)
• Problems with Large-scale computing on commodity hardware
• Challenges:– How do you distribute computation?– How can we make it easy to write distributed
programs?– Machines fail:• One server may stay up 3 years (1,000 days)• If you have 1,000 servers, expect to loose 1/day• People estimated Google had ~1M machines in 2011
– 1,000 machines fail every day!
Slide based on www.mmds.com
![Page 5: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer](https://reader033.vdocuments.us/reader033/viewer/2022051621/56649e985503460f94b9b670/html5/thumbnails/5.jpg)
Slide based on www.mmds.com
• Issue: Copying data over a network takes time• Idea:– Bring computation close to the data– Store files multiple times for reliability
• Map-reduce addresses these problems– Google’s computational/data manipulation model– Elegant way to work with big data– Storage Infrastructure – File system• Google: GFS. Hadoop: HDFS
– Programming model• Map-Reduce
![Page 6: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer](https://reader033.vdocuments.us/reader033/viewer/2022051621/56649e985503460f94b9b670/html5/thumbnails/6.jpg)
Slide based on www.mmds.com
• Problem:– If nodes fail, how to store data persistently?
• Answer:– Distributed File System:• Provides global file namespace• Google GFS; Hadoop HDFS;
• Typical usage pattern– Huge files (100s of GB to TB)– Data is rarely updated in place– Reads and appends are common
![Page 7: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer](https://reader033.vdocuments.us/reader033/viewer/2022051621/56649e985503460f94b9b670/html5/thumbnails/7.jpg)
Racks of Compute Nodes
File
Chunks
Replication
Slide based on www.mmds.com
![Page 8: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer](https://reader033.vdocuments.us/reader033/viewer/2022051621/56649e985503460f94b9b670/html5/thumbnails/8.jpg)
3-way replication offiles, with copies ondifferent racks.
Replication
Slide based on www.mmds.com
![Page 9: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer](https://reader033.vdocuments.us/reader033/viewer/2022051621/56649e985503460f94b9b670/html5/thumbnails/9.jpg)
Map-Reduce• You write two functions, Map and Reduce.– They each have a special form to be explained.
• System (e.g., Hadoop) creates a large number of tasks for each function.– Work is divided among tasks in a precise way.
Slide based on www.mmds.com
![Page 10: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer](https://reader033.vdocuments.us/reader033/viewer/2022051621/56649e985503460f94b9b670/html5/thumbnails/10.jpg)
Map-Reduce Algorithms• Map tasks convert inputs to key-value pairs.– “keys” are not necessarily unique.
• Outputs of Map tasks are sorted by key, and each key is assigned to one Reduce task.
• Reduce tasks combine values associated with a key.
Slide based on www.mmds.com
![Page 11: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer](https://reader033.vdocuments.us/reader033/viewer/2022051621/56649e985503460f94b9b670/html5/thumbnails/11.jpg)
Simple map-reduce example: Word Count• We have a large file of words, one word to a line• Count the number of times each distinct word
appears in the file• Sample application: analyze web server logs to find
popular URLs• Different scenarios:– Case 1: Entire file fits in main memory– Case 2: File too large for main mem, but all
<word, count> pairs fit in main mem– Case 3: File on disk, too many distinct words to
fit in memorySlide based on www.mmds.com
![Page 12: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer](https://reader033.vdocuments.us/reader033/viewer/2022051621/56649e985503460f94b9b670/html5/thumbnails/12.jpg)
Word Count• Map task: For each word, e.g. CAT output (CAT,1)
• Total output: (w1,1), (w1,1), …., (w1,1) (w2,1), (w2,1), …., (w2,1) …… Hash each (w,1) to bucket h(w) in [0,r-1] in local intermediate file. r is the number of reducers
• Master: Group by key: (w1,[1,1,…,1]), (w2,[1,1,…,1]), Push group (w,[1,1,..,1]) to reducer h(w)
• Reduce task: Reducer h(w)
Read : (w,[1,1,…,1]) Aggregate: each (w,[1,1,…,1]) into (w,sum) Output: (w,sum) into common output file• Since addition is commutative and associative the map task could have sent : (w1,sum1), (w2,sum2), …
• Reduce task would receive: (wi,sumi,1), (wi,sumi,2), … (wj,sumj,1), (wj,sumj,2), … and output (wi,sumi), (wj,sumj), ….
Slide based on www.mmds.com
![Page 13: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer](https://reader033.vdocuments.us/reader033/viewer/2022051621/56649e985503460f94b9b670/html5/thumbnails/13.jpg)
Partition Function
• Inputs to map tasks are created by contiguous splits of input file
• For reduce, we need to ensure that records with the same intermediate key end up at the same worker
• System uses a default partition function e.g., hash(key) mod R
• Sometimes useful to override – E.g., hash(hostname(URL)) mod R ensures URLs
from a host end up in the same output file
Slide based on www.mmds.com
![Page 14: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer](https://reader033.vdocuments.us/reader033/viewer/2022051621/56649e985503460f94b9b670/html5/thumbnails/14.jpg)
Coordination
• Master data structures– Task status: (idle, in-progress, completed)– Idle tasks get scheduled as workers become
available– When a map task completes, it sends the master
the location and sizes of its R intermediate files, one for each reducer
– Master pushes this info to reducers• Master pings workers periodically to detect
failuresSlide based on www.mmds.com
![Page 15: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer](https://reader033.vdocuments.us/reader033/viewer/2022051621/56649e985503460f94b9b670/html5/thumbnails/15.jpg)
Data flow• Input, final output are stored on a distributed file system
– Scheduler tries to schedule map tasks “close” to physical storage location of input data
• Intermediate results are stored on local FS of map and reduce workers
• Output is often input to another map-reduce task
• Master data structures– Task status: (idle, in-progress, completed)– Idle tasks get scheduled as workers become available– When a map task completes, it sends the master the location and sizes
of its R intermediate files, one for each reducer– Master pushes this info to reducers
• Master pings workers periodically to detect failures
Slide based on www.mmds.com
![Page 16: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer](https://reader033.vdocuments.us/reader033/viewer/2022051621/56649e985503460f94b9b670/html5/thumbnails/16.jpg)
Failures
• Map worker failure– Map tasks completed or in-progress at worker
are reset to idle (result sits locally at worker)– Reduce workers are notified when task is
rescheduled on another worker• Reduce worker failure– Only in-progress tasks are reset to idle
• Master failure– Map-reduce task is aborted and client is notified
Slide based on www.mmds.com
![Page 17: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer](https://reader033.vdocuments.us/reader033/viewer/2022051621/56649e985503460f94b9b670/html5/thumbnails/17.jpg)
How many Map and Reduce jobs?
• M map tasks, R reduce tasks• Rule of thumb:– Make M and R much larger than the number of
nodes in cluster– One DFS chunk per map is common– Improves dynamic load balancing and speeds
recovery from worker failure• Usually R is smaller than M, because output
is spread across R files
Slide based on www.mmds.com
![Page 18: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer](https://reader033.vdocuments.us/reader033/viewer/2022051621/56649e985503460f94b9b670/html5/thumbnails/18.jpg)
Relational operators with map-reduce
Selection
Map task: If C(t) is true output pair (t,t)
Reduce task: With input (t,t) output t
Selection is not really suitable for map-reduce,everything could have been done in the map task
)(RC
![Page 19: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer](https://reader033.vdocuments.us/reader033/viewer/2022051621/56649e985503460f94b9b670/html5/thumbnails/19.jpg)
Relational operators with map-reduce
Projection
Map task: Let t’ be the projection of t. Output pair (t’,t’)
Reduce task: With input (t’,[t’,t’,…,t’] ) output t’
Here the duplicate elimination is done by the reduce task
)(RL
![Page 20: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer](https://reader033.vdocuments.us/reader033/viewer/2022051621/56649e985503460f94b9b670/html5/thumbnails/20.jpg)
Relational operators with map-reduce
• Union RSMap task: for each tuple t of the chunk of R or S output (t, t)Reduce task: input is (t,[t]) or (t,[t, t]). Output t
• Intersection R SMap task: for each tuple t of the chunk output (t,t)Reduce task: if input is (t,[t,t]), output t if input is (t,[t]) , output nothing
• Difference R – SMap task: for each tuple t of R output (t,R) for each tuple t of S output (t,S)Reduce task: if input is (t,[R]), output t if input is (t,[R,S]) , output nothing
![Page 21: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer](https://reader033.vdocuments.us/reader033/viewer/2022051621/56649e985503460f94b9b670/html5/thumbnails/21.jpg)
Joining by Map-Reduce• Suppose we want to compute • R(A,B) JOIN S(B,C), using k Reduce tasks.– I.e., find tuples with matching B-values.
• R and S are each stored in a chunked file.
• Use a hash function h from B-values to k buckets.– Bucket = Reduce task.
• The Map tasks take chunks from R and S, and send:– Tuple R(a,b) to Reduce task h(b).
• Key = b value = R(a,b).– Tuple S(b,c) to Reduce task h(b).
• Key = b; value = S(b,c).
Slide based on www.mmds.com
![Page 22: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer](https://reader033.vdocuments.us/reader033/viewer/2022051621/56649e985503460f94b9b670/html5/thumbnails/22.jpg)
Reducetask i
Map tasks sendR(a,b) if h(b) = i
Map tasks sendS(b,c) if h(b) = i
All (a,b,c) such thath(b) = i, and (a,b)is in R, and (b,c) isin S.
• Key point: If R(a,b) joins with S(b,c), then both tuples are sent to Reduce task h(b).
• Thus, their join (a,b,c) will be produced there and shipped to the output file.
Slide based on www.mmds.com
![Page 23: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer](https://reader033.vdocuments.us/reader033/viewer/2022051621/56649e985503460f94b9b670/html5/thumbnails/23.jpg)
Mapping tuples in joins
Mapper for R(1,2)
R(1,2) (2, (R,1))
Mapper for R(4,2)R(4,2)
Mapper for S(2,3)
S(2,3)
Mapper for S(5,6)
S(5,6)
(2, (R,4))
(2, (S,3))
(5, (S,6))
Reducerfor B = 2
Reducerfor B = 5
(2, [(R,1), (R,4), (S,3)])
(5, [(S,6)])
Slide based on www.mmds.com
![Page 24: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer](https://reader033.vdocuments.us/reader033/viewer/2022051621/56649e985503460f94b9b670/html5/thumbnails/24.jpg)
Output of the Reducers
Reducerfor B = 2
Reducerfor B = 5
(2, [(R,1), (R,4), (S,3)])
(5, [(S,6)])
(1,2,3), (4,2,3)
Slide based on www.mmds.com
![Page 25: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer](https://reader033.vdocuments.us/reader033/viewer/2022051621/56649e985503460f94b9b670/html5/thumbnails/25.jpg)
Relational operators with map-reduce
Grouping and aggregation: A,agg(B)(R(A,B,C))
Map task: for each tuple (a,b,c) output (a,[b]) Reduce task: if input is (a,[b1, b2, …, bn]), output (a,agg(b1, b2, …, bn))
for example (a, b1+b2+ …+bn)
![Page 26: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer](https://reader033.vdocuments.us/reader033/viewer/2022051621/56649e985503460f94b9b670/html5/thumbnails/26.jpg)
Matrix-vector multiplication using map-reduce
j=1
![Page 27: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer](https://reader033.vdocuments.us/reader033/viewer/2022051621/56649e985503460f94b9b670/html5/thumbnails/27.jpg)
If vector doesn’t fit in main memory
Divide matrix and vector into stripes:
Each map task gets a chunk of stripe i of the matrixand the entire stripe i of the vector and producespairs
Reduce task i gets all pairs and producespairs
Slide based on www.mmds.com
![Page 28: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer](https://reader033.vdocuments.us/reader033/viewer/2022051621/56649e985503460f94b9b670/html5/thumbnails/28.jpg)
Example:
MAPPERS:
REDUCERS:
![Page 29: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer](https://reader033.vdocuments.us/reader033/viewer/2022051621/56649e985503460f94b9b670/html5/thumbnails/29.jpg)
Examples:
• Hamming distance 1 between bit-strings
• Matrix multiplication in one MR-round
• Matrix multiplication in two MR-rounds
• Three-way joins in two rounds and in one round
![Page 30: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer](https://reader033.vdocuments.us/reader033/viewer/2022051621/56649e985503460f94b9b670/html5/thumbnails/30.jpg)
![Page 31: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer](https://reader033.vdocuments.us/reader033/viewer/2022051621/56649e985503460f94b9b670/html5/thumbnails/31.jpg)
Relational operators with map-reduceThree-Way Join
• We shall consider a simple join of three relations, the natural join
R(A,B) ⋈ S(B,C) ⋈ T(C,D).
• One way: cascade of two 2-way joins, each implemented by map-reduce.
• Fine, unless the 2-way joins produce large intermediate relations.
Slide based on www.mmds.com
![Page 32: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer](https://reader033.vdocuments.us/reader033/viewer/2022051621/56649e985503460f94b9b670/html5/thumbnails/32.jpg)
Another 3-Way Join
• Reduce processes use hash values of entire S(B,C) tuples as key.
• Choose a hash function h that maps B- and C-values to k buckets.
• There are k2 Reduce processes, one for each (B-bucket, C-bucket) pair.
Slide based on www.mmds.com
![Page 33: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer](https://reader033.vdocuments.us/reader033/viewer/2022051621/56649e985503460f94b9b670/html5/thumbnails/33.jpg)
Job of the Reducers
• Each reducer gets, for certain B-values b and C-values c :
1. All tuples from R with B = b,2. All tuples from T with C = c, and3. The tuple S(b,c) if it exists.
• Thus it can create every tuple of the form (a, b, c, d) in the join.
Slide based on www.mmds.com
![Page 34: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer](https://reader033.vdocuments.us/reader033/viewer/2022051621/56649e985503460f94b9b670/html5/thumbnails/34.jpg)
Mapping for 3-Way Join
We map each tuple S(b,c) to ((h(b), h(c)), (S, b, c)).
We map each R(a,b) tuple to ((h(b), y), (R, a, b)) for all y = 1, 2,…,k.
We map each T(c,d) tuple to ((x, h(c)), (T, c, d)) for all x = 1, 2,…,k.
Keys Values
Aside: even normalmap-reduce allowsinputs to map toseveral key-valuepairs.
Slide based on www.mmds.com
![Page 35: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer](https://reader033.vdocuments.us/reader033/viewer/2022051621/56649e985503460f94b9b670/html5/thumbnails/35.jpg)
Assigning Tuples to Reducers
h(b) = 0
1
2
3
h(c) = 0 1 2 3
S(b,c) whereh(b)=1; h(c)=2
R(a,b), whereh(b)=2
T(c,d), whereh(c)=3
Slide based on www.mmds.com
![Page 36: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer](https://reader033.vdocuments.us/reader033/viewer/2022051621/56649e985503460f94b9b670/html5/thumbnails/36.jpg)
DB = R(1,1)R(1,2)R(2,1)R(2,2)
S(1,1)S(1,2)S(2,1)S(2,2)
T(1,1)T(1,2)T(2,1)T(2,2)
..etc
R(1,1)R(2,1)
S(1,1) T(1,1)T(1,2)
R(1,1)R(2,1)
S(1,2) T(2,1)T(2,2)
R(1,2)R(2,2)
S(2,1) T(1,1)T(1,2)
R(1,2)R(2,2)
S(2,2) T(2,1)T(2,2)
MapperR(1,1) (1,1,(R,1))(1,2,(R,1))
R(1,1) S(1,1) T(1,1) R(2,1) S(1,1) T(1,1) R(2,1) S(1,1) T(1,2) R(1,1) S(1,1) T(1,2)
MapperS(1,2) (1,2,(S,1,2))