map-reduce: win -or- epic win

55
MAP-REDUCE: WIN -OR- EPIC WIN CSC313: Advanced Programming Topics

Upload: melba

Post on 25-Feb-2016

39 views

Category:

Documents


0 download

DESCRIPTION

CSC313: Advanced Programming Topics. Map-Reduce: Win -or- Epic Win. Brief History of Google. BackRub : 1996 4 disk drives 24 GB total storage. Brief History of Google. BackRub : 1996 4 disk drives 24 GB total storage. =. Brief History of Google. Google: 1998 44 disk drives - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Map-Reduce: Win -or- Epic Win

MAP-REDUCE:WIN-OR-

EPIC WIN

CSC313: Advanced Programming Topics

Page 2: Map-Reduce: Win -or- Epic Win

Brief History of Google

BackRub: 19964 disk drives

24 GB total storage

Page 3: Map-Reduce: Win -or- Epic Win

Brief History of Google

BackRub: 19964 disk drives

24 GB total storage

=

Page 4: Map-Reduce: Win -or- Epic Win

Brief History of Google

Google: 199844 disk drives

366 GB total storage

Page 5: Map-Reduce: Win -or- Epic Win

Brief History of Google

Google: 199844 disk drives

366 GB total storage

=

Page 6: Map-Reduce: Win -or- Epic Win

Traditional Design Principles If big enough, supercomputer processes

work Use desktop CPUs, just a lot more of them But it also provides huge bandwidth to

memory Equivalent to many machines bandwidth at

once But supercomputers are VERY, VERY

expensive Maintenance also expensive once machine

bought But do get something: high-quality == low

downtime Safe, expensive solution to very large

problems

Page 7: Map-Reduce: Win -or- Epic Win

Why Trade Money for Safety?

Page 8: Map-Reduce: Win -or- Epic Win

Why Trade Money for Safety?

Page 9: Map-Reduce: Win -or- Epic Win

How Was Search Performed?

http://www.yahoo.com/search?p=pager

DNS

Page 10: Map-Reduce: Win -or- Epic Win

How Was Search Performed?

http://www.yahoo.com/search?p=pager

DNS

Page 11: Map-Reduce: Win -or- Epic Win

How Was Search Performed?

http://www.yahoo.com/search?p=pager

DNS

http://209.191.122.70

Page 12: Map-Reduce: Win -or- Epic Win

How Was Search Performed?

DNS

http://209.191.122.70/search?p=pagerhttp://www.yahoo.com/search?p=pager

Page 13: Map-Reduce: Win -or- Epic Win

How Was Search Performed?

DNS

http://209.191.122.70/search?p=pagerhttp://www.yahoo.com/search?p=pager

Page 14: Map-Reduce: Win -or- Epic Win

Google’s Big Insight

Performing search is “embarrassingly parallel” No need for supercomputer and all that

expense Can instead do this using lots & lots of

desktops Identical effective bandwidth &

performance

Page 15: Map-Reduce: Win -or- Epic Win

Google’s Big Insight

Performing search is “embarrassingly parallel” No need for supercomputer and all that

expense Can instead do this using lots & lots of

desktops Identical effective bandwidth &

performance But problem is desktop machines

unreliable Budget for 2 replacements, since

machines cheap Just expect failure; software provides

quality

Page 16: Map-Reduce: Win -or- Epic Win

Google’s Big Insight

Performing search is “embarrassingly parallel” No need for supercomputer and all that

expense Can instead do this using lots & lots of

desktops Identical effective bandwidth &

performance But problem is desktop machines

unreliable Budget for 2 replacements, since

machines cheap Just expect failure; software provides

quality

Page 17: Map-Reduce: Win -or- Epic Win

Google’s Big Insight

Performing search is “embarrassingly parallel” No need for supercomputer and all that

expense Can instead do this using lots & lots of

desktops Identical effective bandwidth &

performance But problem is desktop machines

unreliable Budget for 2 replacements, since

machines cheap Just expect failure; software provides

quality

Page 18: Map-Reduce: Win -or- Epic Win

A brief history of Google

Google: 2012?0,000 total

servers ??? PB total storage

Page 19: Map-Reduce: Win -or- Epic Win

How Is Search Performed Now?

http://209.85.148.100/search?q=android

Page 20: Map-Reduce: Win -or- Epic Win

How Is Search Performed Now?

http://209.85.148.100/search?q=androidSpell Checker

Ad Server

Document Servers (TB)Index Servers (TB)

Page 21: Map-Reduce: Win -or- Epic Win

How Is Search Performed Now?

http://209.85.148.100/search?q=androidSpell Checker

Ad Server

Document Servers (TB)Index Servers (TB)

Page 22: Map-Reduce: Win -or- Epic Win

How Is Search Performed Now?

http://209.85.148.100/search?q=androidSpell Checker

Ad Server

Document Servers (TB)Index Servers (TB)

Page 23: Map-Reduce: Win -or- Epic Win

How Is Search Performed Now?

http://209.85.148.100/search?q=androidSpell Checker

Ad Server

Document Servers (TB)Index Servers (TB)

Page 24: Map-Reduce: Win -or- Epic Win

Google’s Processing Model

Buy cheap machines & prepare for worst Machines going to fail, but still cheaper

approach Important steps keep whole system

reliable Replicate data so that information losses

limited Move data freely so can always

rebalance loads These decisions lead to many other

benefits Scalability helped by focus on balancing Search speed improved; performance

much better Utilize resources fully, since search

demand varies

Page 25: Map-Reduce: Win -or- Epic Win

Heterogeneous processing

By buying cheapest computers, variances are high Programs must handle homo- & hetero-

systems Centralized workqueue helps with different

speeds

Page 26: Map-Reduce: Win -or- Epic Win

Heterogeneous processing

By buying cheapest computers, variances are high Programs must handle homo- & hetero-

systems Centralized workqueue helps with different

speeds This process also leads to a few small

downsides Space Power consumption Cooling costs

Page 27: Map-Reduce: Win -or- Epic Win

Complexity at Google

Page 28: Map-Reduce: Win -or- Epic Win

Complexity at Google

Avoid this nightmare using abstractions

Page 29: Map-Reduce: Win -or- Epic Win

Google Abstractions

Google File System Handles replication to provide scalability &

durability BigTable

Manages large relational data sets Chubby

Gonna skip past that joke; distributed locking service

MapReduce If job fits, easy parallelism possible without much

work

Page 30: Map-Reduce: Win -or- Epic Win

Google Abstractions

Google File System Handles replication to provide scalability &

durability BigTable

Manages large relational data sets Chubby

Gonna skip past that joke; distributed locking service

MapReduce If job fits, easy parallelism possible without much

work

Page 31: Map-Reduce: Win -or- Epic Win

Remember Google’s Problem

Page 32: Map-Reduce: Win -or- Epic Win

MapReduce Overview

Programming model makes details simple Automatic parallelization & load

balancing Network and disk I/O optimization Robust performance even if machines

fail

Page 33: Map-Reduce: Win -or- Epic Win

MapReduce Overview

Programming model provides good Façade Automatic parallelization & load balancing Network and disk I/O optimization Robust performance even if machines fail

Page 34: Map-Reduce: Win -or- Epic Win

MapReduce Overview

Programming model provides good Façade Automatic parallelization & load balancing Network and disk I/O optimization Robust performance even if machines fail

Idea came from 2 Lisp (functional) primitives Map Reduce

Page 35: Map-Reduce: Win -or- Epic Win

MapReduce Overview

Programming model provides good Façade Automatic parallelization & load balancing Network and disk I/O optimization Robust performance even if machines fail

Idea came from 2 Lisp (functional) primitives Map: process each entry in list using

some function Reduce: recombines data using given

function

Page 36: Map-Reduce: Win -or- Epic Win

Typical MapReduce problem

1. Read lots and lots of data (e.g., TBs)2. Map

Extract important data from each entry in input

3. Combine Maps and sort entries by key

4. Reduce Process each key’s entries to get result for

that key5. Output final result & watch money roll

in

Page 37: Map-Reduce: Win -or- Epic Win

Typical MapReduce problem

1. Read lots and lots of data (e.g., TBs)2. Map

Extract important data from each entry in input

3. Combine Maps and sort entries by key

4. Reduce Process each key’s entries to get result for

that key5. Output final result & watch money roll

in

Outline always same;Just map & reduce functions change

Page 38: Map-Reduce: Win -or- Epic Win

Typical MapReduce problem

1. Read lots and lots of data (e.g., TBs)2. Map

Extract important data from each entry in input

3. Combine Maps and sort entries by key

4. Reduce Process each key’s entries to get result for

that key5. Output final result & watch money roll

in

Algorithm always same;Just map & reduce functions change

Page 39: Map-Reduce: Win -or- Epic Win

Typical MapReduce problem

1. Read lots and lots of data (e.g., TBs)2. Map

Extract important data from each entry in input

3. Combine Maps and sort entries by key

4. Reduce Process each key’s entries to get result for

that key5. Output final result & watch money roll

in

Template method always same;Just the hook methods change

Page 40: Map-Reduce: Win -or- Epic Win

Pictorial View of MapReduce

Page 41: Map-Reduce: Win -or- Epic Win

Ex: Count Word Frequencies Processes files separately

MapKey=URLValue=text on page

Page 42: Map-Reduce: Win -or- Epic Win

Ex: Count Word Frequencies Processes files separately & count word

freq. in each

MapKey=URLValue=text on page

Key’=wordValue’=countKey’=word

Value’=countKey’=wordValue’=countKey’=word

Value’=count

Page 43: Map-Reduce: Win -or- Epic Win

Ex: Count Word Frequencies

Reduce

Key’=“to”Value’=“1”

Key’=“be”Value’=“1”

Key’=“or”Value’=“1”

Key’=“not”Value’=“1”

Key’=“to”Value’=“1”

Key’=“be”Value’=“1”

In shuffle step, Maps combined & entries sorted by key

Reduce

Page 44: Map-Reduce: Win -or- Epic Win

Ex: Count Word Frequencies In shuffle step, Maps combined & entries

sorted by key Reduce combines key’s results to compute

final output

Reduce

Key’=“to”Value’=“1”

Key’=“be”Value’=“1”

Key’=“or”Value’=“1”

Key’=“not”Value’=“1”

Key’=“to”Value’=“1”

Key’=“be”Value’=“1”

Key’’=“to”Value’’=“2”

Key’’=“be”Value’’=“2”

Key’’=“or”Value’’=“1”

Key’’=“not”Value’’=“1”

Page 45: Map-Reduce: Win -or- Epic Win

Word Frequency Pseudo-codeMap(String input_key, String input_values) {

String[] words = input_values.split(“ ”);foreach w in words { EmitIntermediate(w, "1");}

}

Reduce(String key, Iterator intermediate_values){int result = 0;foreach v in intermediate_values { result += ParseInt(v);}Emit(result);

}

Page 46: Map-Reduce: Win -or- Epic Win

Ex: Build Search Index

Processes files separately & record words found on each

MapKey=URLValue=text on page

Key’=wordValue’=countKey’=word

Value’=countKey’=wordValue’=countKey’=word

Value’=URL

Page 47: Map-Reduce: Win -or- Epic Win

Ex: Build Search Index

Processes files separately & record words found on each

To get search Map, combine key’s results in Reduce

MapKey=URLValue=text on page

Key’=wordValue’=countKey’=word

Value’=countKey’=wordValue’=countKey’=word

Value’=URL

ReduceKey’=wordValue’=countKey’=word

Value’=countKey’=wordValue’=countKey’=word

Value’=URL

Key=wordValue=URLs with word

Page 48: Map-Reduce: Win -or- Epic Win

Search Index Pseudo-codeMap(String input_key, String input_values) {

String[] words = input_values.split(“ ”);foreach w in words { EmitIntermediate(w, input_key);}

}

Reduce(String key, Iterator intermediate_values){List result = new ArrayList();foreach v in intermediate_values { result.addLast(v);}Emit(result);

}

Page 49: Map-Reduce: Win -or- Epic Win

Ex: Page Rank Computation

Google’s algorithm ranking pages’ relevance

Page 50: Map-Reduce: Win -or- Epic Win

Ex: Page Rank Computation

MapKey=<URL, rank>Value=links on page

Key’=wordValue’=countKey’=word

Value’=countKey’=wordValue’=countKey’=link on page

Value’=<URL, rank/N>

Reduce Key=<URL, rank>Value=links on page

Key=<URL, rank>Value=links on page

Key’=wordValue’=countKey’=word

Value’=countKey’=wordValue’=countKey’=link to URL

Value’=<src, rank/N>

Key=<URL, rank>Value=links on page

+

+

Page 51: Map-Reduce: Win -or- Epic Win

Ex: Page Rank Computation

MapKey=<URL, rank>Value=links on page

Key’=wordValue’=countKey’=word

Value’=countKey’=wordValue’=countKey’=link on page

Value’=<URL, rank/N>

Reduce Key=<URL, rank>Value=links on page

Key=<URL, rank>Value=links on page

Key’=wordValue’=countKey’=word

Value’=countKey’=wordValue’=countKey’=link to URL

Value’=<src, rank/N>

Key=<URL, rank>Value=links on page

+

+

Repeat entire process

(e.g., input Reduce results back into Map)

until page ranks stabilize

(sum of changes to the ranksdrops below some threshold)

Page 52: Map-Reduce: Win -or- Epic Win

Ex: Page Rank Computation

Google’s algorithm ranking pages’ relevance

Page 53: Map-Reduce: Win -or- Epic Win

Advanced MapReduce Ideas

How to implement? One master, many workers Split input data into tasks where each task

size fixed Will also be partitioning reduce phase into

tasks Dynamically assign tasks to workers during

each step Tasks assigned as needed & placed in in-

process list Once worker completes task, save result &

retire task Assume that a worker crashed, if not

complete in time Move incomplete tasks back into pool for

reassignment

Page 54: Map-Reduce: Win -or- Epic Win

Advanced MapReduce Ideas

How to implement? One master, many workers Split input data into tasks where each task

size fixed Will also be partitioning reduce phase into

tasks Dynamically assign tasks to workers during

each step Tasks assigned as needed & placed in in-

process list Once worker completes task, save result &

retire task Assume that a worker crashed, if not

complete in time Move incomplete tasks back into pool for

reassignment

Page 55: Map-Reduce: Win -or- Epic Win

Advanced MapReduce Ideas

How to implement? One invoker, many commands Split input data into tasks where each task

size fixed Will also be partitioning reduce phase into

tasks Dynamically assign tasks to workers during

each step Tasks assigned as needed & placed in in-

process list Once worker completes task, save result &

retire task Assume that a worker crashed, if not

complete in time Move incomplete tasks back into pool for

reassignment