brief overview on bigdata, hadoop, mapreduce jianer chen csce-629, fall 2015
DESCRIPTION
20+ billion web pages x 20KB = 400+ TB - one computer reads MB/sec from disk, so it will take more than 4 months to read the web pages - 1,000 hard drives to store the web pages Not scalable: takes even more to do something useful with the data! A standard architecture for such problems has emerged - Cluster of commodity Linux nodes - Commodity network (ethernet) to connect them Google Example A Lot of DataTRANSCRIPT
![Page 1: Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b7c7f8b9ab0599b99d2/html5/thumbnails/1.jpg)
Brief Overview on
Bigdata, Hadoop, MapReduce
Jianer ChenCSCE-629, Fall 2015
![Page 2: Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b7c7f8b9ab0599b99d2/html5/thumbnails/2.jpg)
A Lot of Data
• Google processes 20 PB a day (2008)• Wayback Machine has 3 PB + 100 TB/month (03/2009) – 9.6 PB recently• Facebook processes 500 TB/day (08/2012)• eBay has > 10 PB of user data + 50 TB/day (01/2012)• CERN Data Centre has over 100 PB of physics data.
KB (kilobyte) = 103 bytes; MB (megabyte) = 106 bytes; GB (gigabyte) = 109 bytes; TB (terabyte) = 1012 bytes; PB (petabyte) = 1015 bytes
![Page 3: Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b7c7f8b9ab0599b99d2/html5/thumbnails/3.jpg)
• 20+ billion web pages x 20KB = 400+ TB - one computer reads 30-35 MB/sec from disk, so it will take more than 4 months to read the web pages - 1,000 hard drives to store the web pages
• Not scalable: takes even more to do something useful with the data!
• A standard architecture for such problems has emerged - Cluster of commodity Linux nodes - Commodity network (ethernet) to connect them
Google Example
A Lot of Data
![Page 4: Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b7c7f8b9ab0599b99d2/html5/thumbnails/4.jpg)
4
Cluster Architecture: Many Machines
CPU
Mem
CPU
Mem
CPU
Mem
CPU
Mem
Switch Switch
Switch
1 Gbps between nodes in a rack
2-10 Gbps backbone between racks
……
… …
Each rack has 16-64 nodesGoogle had 1 million machines in 2011.
![Page 5: Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b7c7f8b9ab0599b99d2/html5/thumbnails/5.jpg)
Hadoop ClusterDN: data nodeTT: task trackerNN: name node
From: http://bradhedlund.com/2011/09/10/understanding-hadoop-clusters-and-the-network/
Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.
Cluster Architecture: Many Machines
Hadoop Cluster
![Page 6: Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b7c7f8b9ab0599b99d2/html5/thumbnails/6.jpg)
Cluster Computing: A Classical Algorithmic Ideas: Divide-and Conquer
work 1
work partition
work 2 work 3 work 4
“worker” “worker” “worker” “worker”
result 1 result 2 result 3 result 4
result combine
solve
![Page 7: Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b7c7f8b9ab0599b99d2/html5/thumbnails/7.jpg)
Challenges in Cluster Computing
![Page 8: Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b7c7f8b9ab0599b99d2/html5/thumbnails/8.jpg)
• How do we assign work units to workers?• What if we have more work units than workers?• What if workers need to share partial results?• How do we aggregate partial results?• How do we know all the workers have finished?• What if workers die?
Challenges in Cluster Computing
![Page 9: Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b7c7f8b9ab0599b99d2/html5/thumbnails/9.jpg)
• How do we assign work units to workers?• What if we have more work units than workers?• What if workers need to share partial results?• How do we aggregate partial results?• How do we know all the workers have finished?• What if workers die?
What is the common theme of all of these problems?
Challenges in Cluster Computing
![Page 10: Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b7c7f8b9ab0599b99d2/html5/thumbnails/10.jpg)
• How do we assign work units to workers?• What if we have more work units than workers?• What if workers need to share partial results?• How do we aggregate partial results?• How do we know all the workers have finished?• What if workers die?
What is the common theme of all of these problems?• Parallelization problems arise from:
- Communication between workers (e.g., to exchange state) - Access to shared resources (e.g., data)
• We need a synchronization mechanism.
Challenges in Cluster Computing
![Page 11: Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b7c7f8b9ab0599b99d2/html5/thumbnails/11.jpg)
• We need the right level of abstraction – new model more appropriate for the multicore/cluster
environment
• Hide system-level details from the developers – no more race conditions, lock contention, etc.
• Separating the what from how – developer specifies the computation that needs to be performed – execution framework handles actual execution
Therefore,
![Page 12: Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b7c7f8b9ab0599b99d2/html5/thumbnails/12.jpg)
• We need the right level of abstraction – new model more appropriate for the multicore/cluster
environment
• Hide system-level details from the developers – no more race conditions, lock contention, etc.
• Separating the what from how – developer specifies the computation that needs to be performed – execution framework handles actual execution
Therefore,
This motivated MapReduce
![Page 13: Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b7c7f8b9ab0599b99d2/html5/thumbnails/13.jpg)
MapReduce: Big Ideas
![Page 14: Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b7c7f8b9ab0599b99d2/html5/thumbnails/14.jpg)
• Failures are common in Cluster systems MapReduce implementation copes with failures (auto task restart)
MapReduce: Big Ideas
![Page 15: Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b7c7f8b9ab0599b99d2/html5/thumbnails/15.jpg)
• Failures are common in Cluster systems MapReduce implementation copes with failures (auto task restart)
• Data movements are expensive in supercomputers MapReduce moves processing to data (leverage locality)
MapReduce: Big Ideas
![Page 16: Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b7c7f8b9ab0599b99d2/html5/thumbnails/16.jpg)
• Failures are common in Cluster systems MapReduce implementation copes with failures (auto task restart)
• Data movements are expensive in supercomputers MapReduce moves processing to data (leverage locality)
• Disk I/O is time-consuming MapReduce organizes computation into long streaming operations
MapReduce: Big Ideas
![Page 17: Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b7c7f8b9ab0599b99d2/html5/thumbnails/17.jpg)
• Failures are common in Cluster systems MapReduce implementation copes with failures (auto task restart)
• Data movements are expensive in supercomputers MapReduce moves processing to data (leverage locality)
• Disk I/O is time-consuming MapReduce organizes computation into long streaming operations
• Developing distributed software is difficult MapReduce isolates developers from implementation details.
MapReduce: Big Ideas
![Page 18: Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b7c7f8b9ab0599b99d2/html5/thumbnails/18.jpg)
• Iterate over a large number of records• Extract something of interest from each• Shuffle and sort intermediate results• Aggregate intermediate results• Generate final output
Typical Large-Data Problem
![Page 19: Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b7c7f8b9ab0599b99d2/html5/thumbnails/19.jpg)
• Iterate over a large number of records• Extract something of interest from each• Shuffle and sort intermediate results• Aggregate intermediate results• Generate final output
Typical Large-Data Problem
map
![Page 20: Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b7c7f8b9ab0599b99d2/html5/thumbnails/20.jpg)
• Iterate over a large number of records• Extract something of interest from each• Shuffle and sort intermediate results• Aggregate intermediate results• Generate final output
Typical Large-Data Problem
map
Reduce
![Page 21: Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b7c7f8b9ab0599b99d2/html5/thumbnails/21.jpg)
• Iterate over a large number of records• Extract something of interest from each• Shuffle and sort intermediate results• Aggregate intermediate results• Generate final output
Typical Large-Data Problem
map
Reduce
Key idea of MapReduce: provide a functional abstraction for these two operations. [Dean and Ghemawat, OSDI 2004]
![Page 22: Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b7c7f8b9ab0599b99d2/html5/thumbnails/22.jpg)
map
input
map map map
MapReduce: Greneral Framework
reduce reducereduce
……
…………
output
![Page 23: Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b7c7f8b9ab0599b99d2/html5/thumbnails/23.jpg)
Shuffle and Sort
map
Output written to DFS
map map map
MapReduce: Greneral Framework
reduce reducereduce
……
…………
input
InputSplit
![Page 24: Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b7c7f8b9ab0599b99d2/html5/thumbnails/24.jpg)
Shuffle and Sort
map
Output written to DFS
map map map
MapReduce: Greneral Framework
reduce reducereduce
……
…………
input
InputSplit
User specified
System provided
![Page 25: Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b7c7f8b9ab0599b99d2/html5/thumbnails/25.jpg)
• Programmers specify two functions: map (k1, v1) → (k2, v2)*
reduce (k2, v2*) → (k3, v3)*
All values with the same key are sent to the same reducer
• The execution framework handles everything else.
MapReduce
![Page 26: Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b7c7f8b9ab0599b99d2/html5/thumbnails/26.jpg)
• Programmers specify two functions: map (k1, v1) → (k2, v2)*
reduce (k2, v2*) → (k3, v3)*
All values with the same key are sent to the same reducer
• The execution framework handles everything else.
MapReduce
Example: Word Count Map(String docID, String text): map(docID, text) → (word, 1)* for each word w in text: Emit(w, 1)
Reduce(String word, Iterator<int> values): int sum = 0; for each v in values: reduce(word, [1, …, 1]) → (word, sum)* sum += v;
Emit(word, sum);
k1 v1
k3 v3k2
k2
v2*
v2
![Page 27: Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b7c7f8b9ab0599b99d2/html5/thumbnails/27.jpg)
Shuffle and Sort: aggregate values by keys
map
Output written to DFS
docID text
map map map
a
1
b
1
a
1
c
1
b
1
c
1
c
1
a
1
a
1
a
1
c
1
b
1
a
1
1
1
1
1
b
1
1
1
c
1
1
1
1
reduce reducereduce
a 5 b 3 c 4
Example: Word CountMap(String docID, String text): for each word w in text: Emit(w, 1)
Reduce(String word, Iterator<int> values): int sum = 0; for each v in values: sum += v; Emit(word, sum);
MapReduce: Word Count
![Page 28: Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b7c7f8b9ab0599b99d2/html5/thumbnails/28.jpg)
• Handles scheduling – Assigns workers to map and reduce tasks• Handles “data distribution” – Moves processes to data• Handles synchronization – Gathers, sorts, and shuffles intermediate data• Handles errors and faults – Detects worker failures and restarts• Everything happens on top of a distributed file system
MapReduce: Framework
![Page 29: Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b7c7f8b9ab0599b99d2/html5/thumbnails/29.jpg)
• Programmers specify two functions: map (k1, v1) → (k2, v2)* reduce (k2, v2*) → (k3, v3)* – all values with the same key are sent to the same reducer
MapReduce: User Specification
![Page 30: Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b7c7f8b9ab0599b99d2/html5/thumbnails/30.jpg)
• Programmers specify two functions: map (k1, v1) → (k2, v2)* reduce (k2, v2*) → (k3, v3)* – all values with the same key are sent to the same reducer
• Mappers & Reducers can specify any computation – be careful with access to external resources!
MapReduce: User Specification
![Page 31: Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b7c7f8b9ab0599b99d2/html5/thumbnails/31.jpg)
• Programmers specify two functions: map (k1, v1) → (k2, v2)* reduce (k2, v2*) → (k3, v3)* – all values with the same key are sent to the same reducer
• Mappers & Reducers can specify any computation – be careful with access to external resources!
• The execution framework handles everything else
MapReduce: User Specification
![Page 32: Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b7c7f8b9ab0599b99d2/html5/thumbnails/32.jpg)
• Programmers specify two functions: map (k1, v1) → (k2, v2)* reduce (k2, v2*) → (k3, v3)* – all values with the same key are sent to the same reducer
• Mappers & Reducers can specify any computation – be careful with access to external resources!
• The execution framework handles everything else• Not quite… often, programmers also specify: partition (k2, number of partitions) → partition for k2
– often a simple hash of the key, e.g., hash(k2) mod n – divides up key space for parallel reduce operations
MapReduce: User Specification
![Page 33: Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b7c7f8b9ab0599b99d2/html5/thumbnails/33.jpg)
• Programmers specify two functions: map (k1, v1) → (k2, v2)* reduce (k2, v2*) → (k3, v3)* – all values with the same key are sent to the same reducer
• Mappers & Reducers can specify any computation – be careful with access to external resources!
• The execution framework handles everything else• Not quite… often, programmers also specify: partition (k2, number of partitions) → partition for k2
– often a simple hash of the key, e.g., hash(k2) mod n – divides up key space for parallel reduce operations combine (k2, v2) → (k2’, v2’) – mini-reducers that run in memory after the map phase – used as an optimization to reduce network traffic
MapReduce: User Specification
![Page 34: Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b7c7f8b9ab0599b99d2/html5/thumbnails/34.jpg)
map
docID text
map map map
Shuffle and Sort: aggregate values by keysa
2
1
2
b
1
1
1
c
1
1
2
reduce reducereduce
a 5 b 3 c 4
Example: Word CountMap(String docID, String text): for each word w in text: H[w] = H[w] + 1; for each word w in H Emit(w, H[w])
Reduce(String word, Iterator<int> values): int sum = 0; for each v in values: sum += v; Emit(word, sum);
MapReduce: Word Count
InputSplit
a
1
b
1
a
1
c
1
b
1
c
1
c
1
a
1
a
1
a
1
c
1
b
1 combin
ecombin
ecombin
ecombin
ea
2
b
1
a
c
2
b
1
c
c
1
a
2
a
a
1
c
1
b
1 partitio
npartitio
npartitio
npartitio
n
![Page 35: Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b7c7f8b9ab0599b99d2/html5/thumbnails/35.jpg)
Example: Shortest-Path
![Page 36: Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b7c7f8b9ab0599b99d2/html5/thumbnails/36.jpg)
Example: Shortest-Path
Data structure: •The adjacency list (with edge weights) for the graph •Each vertex v has a Node ID•Let Av be the set of neighbors of v
•Let dv be the current distance from source to v
![Page 37: Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b7c7f8b9ab0599b99d2/html5/thumbnails/37.jpg)
Example: Shortest-Path
Data structure: •The adjacency list (with edge weights) for the graph •Each vertex v has a Node ID•Let Av be the set of neighbors of v
•Let dv be the current distance from source to v
Basic ideas:•Original input is (s, [0, As]);
![Page 38: Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b7c7f8b9ab0599b99d2/html5/thumbnails/38.jpg)
Example: Shortest-Path
Data structure: •The adjacency list (with edge weights) for the graph •Each vertex v has a Node ID•Let Av be the set of neighbors of v
•Let dv be the current distance from source to v
Basic ideas:•Original input is (s, [0, As]);
•On an input (v, [dv, Av]), Mapper emits pairs whose key (i.e., vertex) is in Av, with a distance associated with dv
![Page 39: Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b7c7f8b9ab0599b99d2/html5/thumbnails/39.jpg)
Example: Shortest-PathData structure: •The adjacency list (with edge weights) for the graph •Each vertex v has a Node ID•Let Av be the set of neighbors of v
•Let dv be the current distance from source to v
Basic ideas:•Original input is (s, [0, As]);
•On an input (v, [dv, Av]), Mapper emits pairs whose key (i.e., vertex) is in Av, with a distance associated with dv
•On an input (v, [dv, Av]*), Reducer emits a pair (v, [dv, Av]) with the minimum distance dv.
![Page 40: Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b7c7f8b9ab0599b99d2/html5/thumbnails/40.jpg)
Example: Shortest-Path
Map(v, [dv, Av]) Emit(v, [dv, Av]);
for each w in Av do
Emit(w, [dv+wt(v, w), Aw]);
Reduce(v, [dv, Av]*) dmin = +∞;
for each [dv, Av] in [dv, Av]*
if dmin > dv then dmin = d;
Emit(v, [d, Av])
Data structure: •The adjacency list (with edge weights) for the graph •Each vertex v has a Node ID•Let Av be the set of neighbors of v
•Let dv be the current distance from source to v
![Page 41: Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b7c7f8b9ab0599b99d2/html5/thumbnails/41.jpg)
• MapReduce iterations– The first time we run the algorithm, we discover all
neighbors of the source s– The second iteration, we discover all “2nd level”
neighbors of s – Each iteration expands the “search frontier” by one hop
Example: Shortest-Path
![Page 42: Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b7c7f8b9ab0599b99d2/html5/thumbnails/42.jpg)
• MapReduce iterations– The first time we run the algorithm, we discover all
neighbors of the source s– The second iteration, we discover all “2nd level”
neighbors of s – Each iteration expands the “search frontier” by one hop
• The approach is suitable for graphs with small diameter (e.g., the “small-world graphs”)
Example: Shortest-Path
![Page 43: Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b7c7f8b9ab0599b99d2/html5/thumbnails/43.jpg)
• MapReduce iterations– The first time we run the algorithm, we discover all
neighbors of the source s– The second iteration, we discover all “2nd level”
neighbors of s – Each iteration expands the “search frontier” by one hop
• The approach is suitable for graphs with small diameter (e.g., the “small-world graphs”)
• Need a “driver” algorithm to check termination of the algorithm ( in practice: Hadoop counters)
Example: Shortest-Path
![Page 44: Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b7c7f8b9ab0599b99d2/html5/thumbnails/44.jpg)
• MapReduce iterations– The first time we run the algorithm, we discover all
neighbors of the source s– The second iteration, we discover all “2nd level”
neighbors of s – Each iteration expands the “search frontier” by one hop
• The approach is suitable for graphs with small diameter (e.g., the “small-world graphs”)
• Need a “driver” algorithm to check termination of the algorithm ( in practice: Hadoop counters)
• Can be extended to including the actual path.
Example: Shortest-Path
![Page 45: Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b7c7f8b9ab0599b99d2/html5/thumbnails/45.jpg)
• Store graphs as adjacency lists;• Graph algorithms with MapReduce: -- Each Map task receives a vertex and its outlinks; -- Map task computes some function of the link structure and then gives a value with target as the key; -- Reduce task collects these keys (target vertices) and aggregates
• Graph Iterate multiple MapReduce cycles until some termination condition
-- graph structure is passed from one iteration to next
• The idea can be used to solve other graph problems
Summary: MapReduce Graph Algorithms
![Page 46: Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b7c7f8b9ab0599b99d2/html5/thumbnails/46.jpg)
46
CSCE-629 Course Summary
• Basic notations, concepts, and
techniques• Data manipulation• Graph algorithms and applications• Computational optimization• NP-completeness theory
![Page 47: Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b7c7f8b9ab0599b99d2/html5/thumbnails/47.jpg)
47
CSCE-629 Course Summary
• Basic notations, concepts, and techniques
• Data manipulation• Graph algorithms and applications• Computational optimization• NP-completeness theory
Pseudo-code for algorithms Big-Oh notation Divide-and-conquer Dynamic programming Solving recurrence relations
![Page 48: Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b7c7f8b9ab0599b99d2/html5/thumbnails/48.jpg)
48
CSCE-629 Course Summary
• Basic notations, concepts, and techniques• Data manipulation
• Graph algorithms and applications• Computational optimization• NP-completeness theory
Data structures, algorithms, complexity Heap 2-3 trees Hashing Union-Find Finding median
![Page 49: Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b7c7f8b9ab0599b99d2/html5/thumbnails/49.jpg)
49
CSCE-629 Course Summary
• Basic notations, concepts, and techniques• Data manipulation• Graph algorithms and applications
• Computational optimization• NP-completeness theory
DFS and BFS, and simple applications Connected components Topological sorting Strongly connected components Longest path in DAG
![Page 50: Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b7c7f8b9ab0599b99d2/html5/thumbnails/50.jpg)
50
CSCE-629 Course Summary
• Basic notations, concepts, and techniques• Data manipulation• Graph algorithms and applications• Computational optimization
• NP-completeness theory
Maximum bandwidth paths Dijkstra’s algorithm (shortest path) Kruskal’s algorithm (MST) Bellman-Ford algorithm (shortest path) Matching in bipartite graphs Sequence alignment
![Page 51: Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015](https://reader036.vdocuments.us/reader036/viewer/2022062401/5a4d1b7c7f8b9ab0599b99d2/html5/thumbnails/51.jpg)
51
CSCE-629 Course Summary
• Basic notations, concepts, and
techniques• Data manipulation• Graph algorithms and applications• Computational optimization• NP-completeness theory P and polynomial-time computation
Definition of NP, membership in NP Polynomial-time reducibility NP-hardness and NP-completeness Proving NP-hardness and NP-completeness NP-complete problems: SAT, IS, VC, Partition