mapreduce - duke universityexample4:matrixmultiplication 30 m n p 36 chapter 2. map-reduce and the...

36
MapReduce Everything Data CompSci 290.01 Spring 2014

Upload: others

Post on 09-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MapReduce - Duke UniversityExample4:MatrixMultiplication 30 M N P 36 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK the Reduce function applies each of them to the list of values

Map-­‐‑Reduce

Everything  Data CompSci  290.01  Spring  2014

Page 2: MapReduce - Duke UniversityExample4:MatrixMultiplication 30 M N P 36 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK the Reduce function applies each of them to the list of values

Announcements  (Thu.  Feb  27)

•  Homework  #8  will  be  posted  by  noon  tomorrow.

•  Project  deadlines:   – 2/25:  Project  team  formation – 3/4:  Project  Proposal  is  due.   – 3/4:  2  minute  presentation  in  class

2

Page 3: MapReduce - Duke UniversityExample4:MatrixMultiplication 30 M N P 36 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK the Reduce function applies each of them to the list of values

3

2,161,530,000,000 searches in 2013

Page 4: MapReduce - Duke UniversityExample4:MatrixMultiplication 30 M N P 36 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK the Reduce function applies each of them to the list of values

4

131,000,000 pages mentioning Einstein

Size of the entire corpus??

Page 5: MapReduce - Duke UniversityExample4:MatrixMultiplication 30 M N P 36 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK the Reduce function applies each of them to the list of values

5

http://www.worldwidewebsize.com/

Size of the entire corpus??

Page 6: MapReduce - Duke UniversityExample4:MatrixMultiplication 30 M N P 36 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK the Reduce function applies each of them to the list of values

Trend  1:  Data  centers 6

http://designtaxi.com/news/353991/Where-The-Internet-Lives-A-Glimpse-Inside-Google-s-Private-Data-Centers/

Page 7: MapReduce - Duke UniversityExample4:MatrixMultiplication 30 M N P 36 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK the Reduce function applies each of them to the list of values

Trend  2:  Multicore 7

Copyright © National Academy of Sciences. All rights reserved.

The Future of Computing Performance: Game Over or Next Level?

POWER IS NOW LIMITING GROWTH IN COMPUTING PERFORMANCE 91

expected performance. One might think that it should therefore be pos-sible to continue to scale performance by doubling the number of proces-sor cores. And, in fact, since the middle 1990s, some researchers have argued that chip multiprocessors (CMPs) can exploit capabilities of CMOS technology more effectively than single-processor chips.21 However, dur-ing the 1990s, the performance of single processors continued to scale at the rate of more than 50 percent per year, and power dissipation was still not a limiting factor, so those efforts did not receive wide attention. As single-processor performance scaling slowed down and the air-cooling power-dissipation limit became a major design constraint, researchers and industry shifted toward CMPs or multicore microprocessors.22

21 Kunle Olukotun, Basem A. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang, 1996, The case for a single-chip multiprocessor, Proceedings of 7th International Conference on Architectural Support for Programming Languages and Operating Systems, Cambridge, Mass., October 1-5, 1996, pp. 2-11.

22 Listed here are some of the references that document, describe, and analyze this shift: Michael Bedford Taylor, Walter Lee, Jason Miller, David Wentzlaff, Ian Bratt, Ben Greenwald, Henry Hoffmann, Paul Johnson, Jason Kim, James Psota, Arvind Saraf, Nathan Shnidman, Volker Strumpen, Matt Frank, Saman Amarasinghe, and Anant Agarwal, 2004, Evaluation of the raw microprocessor: An exposed-wire-delay architecture for ILP and streams, Pro-ceedings of the 31st Annual International Symposium on Computer Architecture, Munich,

10

100

1,000

10,000

1985 1990 1995 2000 2005 2010Year of Introduction

FIGURE 3.3 Microprocessor-clock frequency (MHz) over time (1985-2010).http://www.nap.edu/catalog.php?record_id=12980

Copy

right

© N

atio

nal A

cade

my

of S

cienc

es. A

ll rig

hts

rese

rved

.

The

Futu

re o

f Com

putin

g Pe

rform

ance

: G

ame

Ove

r or N

ext L

evel

?

POW

ER I

S N

OW

LIM

ITIN

G G

RO

WTH

IN

CO

MPU

TIN

G P

ERFO

RM

AN

CE

91

expe

cted

per

form

ance

. One

mig

ht t

hink

tha

t it

shou

ld t

here

fore

be

pos-

sibl

e to

con

tinue

to s

cale

per

form

ance

by

doub

ling

the

num

ber

of p

roce

s-so

r co

res.

And

, in

fac

t, si

nce

the

mid

dle

1990

s, s

ome

rese

arch

ers

have

ar

gued

that

chi

p m

ultip

roce

ssor

s (C

MPs

) can

exp

loit

capa

bilit

ies

of C

MO

S te

chno

logy

mor

e ef

fect

ivel

y th

an s

ingl

e-pr

oces

sor

chip

s.21

How

ever

, dur

-in

g th

e 19

90s,

the

perf

orm

ance

of s

ingl

e pr

oces

sors

con

tinue

d to

sca

le a

t th

e ra

te o

f mor

e th

an 5

0 pe

rcen

t per

yea

r, an

d po

wer

dis

sipa

tion

was

stil

l no

t a

limiti

ng f

acto

r, so

tho

se e

ffor

ts d

id n

ot r

ecei

ve w

ide

atte

ntio

n. A

s si

ngle

-pro

cess

or p

erfo

rman

ce s

calin

g sl

owed

dow

n an

d th

e ai

r-co

olin

g po

wer

-dis

sipa

tion

limit

beca

me

a m

ajor

des

ign

cons

trai

nt, r

esea

rche

rs a

nd

indu

stry

shi

fted

tow

ard

CM

Ps o

r m

ultic

ore

mic

ropr

oces

sors

.22

21 K

unle

Olu

kotu

n, B

asem

A. N

ayfe

h, L

ance

Ham

mon

d, K

en W

ilson

, and

Kun

yung

Cha

ng,

1996

, The

cas

e fo

r a

sing

le-c

hip

mul

tipro

cess

or, P

roce

edin

gs o

f 7th

Inte

rnat

iona

l Con

fere

nce

on A

rchi

tect

ural

Sup

port

for

Prog

ram

min

g La

ngua

ges

and

Ope

ratin

g Sy

stem

s, C

ambr

idge

, M

ass.

, Oct

ober

1-5

, 199

6, p

p. 2

-11.

22

Lis

ted

here

are

som

e of

the

ref

eren

ces

that

doc

umen

t, de

scri

be, a

nd a

naly

ze t

his

shift

: M

icha

el B

edfo

rd T

aylo

r, W

alte

r Lee

, Jas

on M

iller

, Dav

id W

entz

laff

, Ian

Bra

tt, B

en G

reen

wal

d,

Hen

ry H

offm

ann,

Pau

l Joh

nson

, Jas

on K

im, J

ames

Pso

ta, A

rvin

d Sa

raf,

Nat

han

Shni

dman

, Vo

lker

Str

umpe

n, M

att

Fran

k, S

aman

Am

aras

ingh

e, a

nd A

nant

Aga

rwal

, 200

4, E

valu

atio

n of

the

raw

mic

ropr

oces

sor:

An

expo

sed-

wir

e-de

lay

arch

itect

ure

for

ILP

and

stre

ams,

Pro

-ce

edin

gs o

f the

31s

t Ann

ual I

nter

natio

nal S

ympo

sium

on

Com

pute

r A

rchi

tect

ure,

Mun

ich,

10100

1,00

0

10,0

00

1985

1990

1995

2000

2005

2010

Year

of I

ntro

duct

ion

FIG

UR

E 3.

3 M

icro

proc

esso

r-cl

ock

freq

uenc

y (M

Hz)

ove

r tim

e (1

985-

2010

).

Moore’s  Law:    #  transistors  on  integrated  circuits  doubles  every  2  years

Page 8: MapReduce - Duke UniversityExample4:MatrixMultiplication 30 M N P 36 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK the Reduce function applies each of them to the list of values

Need  to  think  “parallel”  

•  Data  resides  on  different  machines •  Split  computation  onto  different  machines/cores

8

Page 9: MapReduce - Duke UniversityExample4:MatrixMultiplication 30 M N P 36 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK the Reduce function applies each of them to the list of values

But  …  parallel  programming  is  hard!

Low  level  code  needs  to  deal  with  a  lot  of  issues  …   •  Failures

– Loss  of  computation – Loss  of  data

•  Concurrency •  …

9

Page 10: MapReduce - Duke UniversityExample4:MatrixMultiplication 30 M N P 36 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK the Reduce function applies each of them to the list of values

Map-­‐‑Reduce 10

Programming  Model Distributed  System  

•  Simple  model  

•  Programmer  only    describes  the  logic

•  Works  on  commodity  hardware

•  Scales  to  thousands  of  machines

•  Ship  code  to  the  data,  rather    than  ship  data  to  code

•  Hides  all  the  hard  systems    problems  from  the  programmer •  Machine  failures •  Data  placement •  …

+

Page 11: MapReduce - Duke UniversityExample4:MatrixMultiplication 30 M N P 36 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK the Reduce function applies each of them to the list of values

Map  Reduce  Programming  Model 11

map

reduce

f f f f f f f f

rows  of  dataset

g

Page 12: MapReduce - Duke UniversityExample4:MatrixMultiplication 30 M N P 36 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK the Reduce function applies each of them to the list of values

Map-­‐‑Reduce  Programming  Model 12

map!

!!, !! ! list! !!, !! !!reduce

!!!, list(!!) ! list! !!, !! !

!

map

reduce

f f f f f f f

g g

Page 13: MapReduce - Duke UniversityExample4:MatrixMultiplication 30 M N P 36 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK the Reduce function applies each of them to the list of values

Example  1:  Word  Count

•  Input:  A  set  of  documents,  each  containing  a  list  of  words – Each  document  is  a  row – E.g.,  search  queries,  tweets,  reviews,  etc.  

•  Output:  A  list  of  pairs  <w,  c> –     c  is  the  number  of  times  w  appears  across  all  documents.  

13

Page 14: MapReduce - Duke UniversityExample4:MatrixMultiplication 30 M N P 36 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK the Reduce function applies each of them to the list of values

Word  Count:  Map

<docid,  {list  of  words}>  à  {list  of  <word,  1>}

•  The  mapper  takes  a  document  d  and  creates  n  key  value  pairs,  one  for  each  word  in  the  document.  

•  The  output  key  is  the  word •  The  output  value  is  1  

–  (count  of  each  appearance  of  a  word)

14

Page 15: MapReduce - Duke UniversityExample4:MatrixMultiplication 30 M N P 36 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK the Reduce function applies each of them to the list of values

Word  Count:  Reduce <word,  {list  of  counts}>  à  <word,  sum(counts)>

•  The  reducer  aggregates  the  counts  (in  this  case  1)  associated  with  a  specific  word.  

15

Page 16: MapReduce - Duke UniversityExample4:MatrixMultiplication 30 M N P 36 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK the Reduce function applies each of them to the list of values

Map-­‐‑Reduce  Implementation 16

Map  Phase  (per  record  computation)

Reduce  Phase  (global  computation)

Shuffle

map!

!!, !! ! list! !!, !! !!reduce

!!!, list(!!) ! list! !!, !! !

!

Split

Page 17: MapReduce - Duke UniversityExample4:MatrixMultiplication 30 M N P 36 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK the Reduce function applies each of them to the list of values

Map-­‐‑Reduce  Implementation 17

Map  Phase  (per  record  computation)

Reduce  Phase  (global  computation)

Shuffle

Split

Split  phase  partitions  the  data  across  different  mappers  (…  think  different  machines)

Page 18: MapReduce - Duke UniversityExample4:MatrixMultiplication 30 M N P 36 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK the Reduce function applies each of them to the list of values

Map-­‐‑Reduce  Implementation 18

Map  Phase  (per  record  computation)

Reduce  Phase  (global  computation)

Shuffle

Split

Each  mapper  executes  user  defined  map  code  on  the  partitions  in  parallel

Page 19: MapReduce - Duke UniversityExample4:MatrixMultiplication 30 M N P 36 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK the Reduce function applies each of them to the list of values

Map-­‐‑Reduce  Implementation 19

Map  Phase  (per  record  computation)

Reduce  Phase  (global  computation)

Shuffle

Split

Data  is  shuffled  such  that  there  is  one  reducer  per    output  key  (…  again  think  different  machines)

Page 20: MapReduce - Duke UniversityExample4:MatrixMultiplication 30 M N P 36 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK the Reduce function applies each of them to the list of values

Map-­‐‑Reduce  Implementation 20

Map  Phase  (per  record  computation)

Reduce  Phase  (global  computation)

Shuffle

Split

Each  reducer  executes  the  user  defined  reduce  code  in  parallel.  

Page 21: MapReduce - Duke UniversityExample4:MatrixMultiplication 30 M N P 36 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK the Reduce function applies each of them to the list of values

Map  Reduce  Implementation

•  After  every  map  and  reduce  phase,  data  is  wriien  onto  disk –  If  machines  fail  during  the  reduce  phase,  then  no  need  to  rerun  the  mappers.  

21

Writing  to  disk  is  slow.    Should  minimize  number  of  map-­‐‑reduce  phases.  

Page 22: MapReduce - Duke UniversityExample4:MatrixMultiplication 30 M N P 36 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK the Reduce function applies each of them to the list of values

Mappers,  Reducers  and  Workers

•  Physical  machines  are  called  workers •  Multiple  mappers  and  reducers  can  run  on  the  same  worker.  

•  More  workers  implies  … …  more  parallelism  (faster  computation)  …   …  but  more  (communication)  overhead  …  

22

Page 23: MapReduce - Duke UniversityExample4:MatrixMultiplication 30 M N P 36 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK the Reduce function applies each of them to the list of values

Map  Reduce  Implementation

•  All  reducers  start  only  after  all  the  mappers  complete.

•  Straggler:  A  mapper  or  reducer  that  takes  a  long  time  

23

Page 24: MapReduce - Duke UniversityExample4:MatrixMultiplication 30 M N P 36 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK the Reduce function applies each of them to the list of values

Back  to  Word  Count

•  Map:    <docid,  {list  of  words}>  à  {list  of  <word,  1>}

•  Reduce:    <word,  {list  of  counts}>  à  <word,  sum(counts)>

•  Number  of  records  output  by  the  map  phase  equals  the  number  of  words  across  all  documents.  

24

Page 25: MapReduce - Duke UniversityExample4:MatrixMultiplication 30 M N P 36 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK the Reduce function applies each of them to the list of values

Map-­‐‑Combine-­‐‑Reduce

•  Combiner  is  a  mini-­‐‑reducer  within  each  mapper.   – Helps  when  the  reduce  function  is  commutative  and  associative.

•  Aggregation  within  each  mapper  reduces  the  communication  cost.

25

Page 26: MapReduce - Duke UniversityExample4:MatrixMultiplication 30 M N P 36 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK the Reduce function applies each of them to the list of values

Word  Count  …  with  combiner •  Map:    

<docid,  {list  of  words}>  à  {list  of  <word,  1>}

•  Combine:  <word,  {list  of  counts}>  à  <word,  sum(counts)>

•  Reduce:    <word,  {list  of  counts}>  à  <word,  sum(counts)>

26

Mapper

Reducer

Page 27: MapReduce - Duke UniversityExample4:MatrixMultiplication 30 M N P 36 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK the Reduce function applies each of them to the list of values

Word  Count  …  in  python 27

One  of  the  many  MapReduce  libraries  for  python

Page 28: MapReduce - Duke UniversityExample4:MatrixMultiplication 30 M N P 36 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK the Reduce function applies each of them to the list of values

Example  2:  K  most  frequent  words

•  Need  multiple  Map-­‐‑Reduce  steps

•  Map:    <docid,  {list  of  words}>  à  {list  of  <word,  1>}

•  Reduce:    <word,  {list  of  counts}>  à  <_  ,  (word,  sum(counts))>

•  Reduce:  <_  ,  {list  of  (word,  count)  pairs}>  à     <_  ,  {list  of  words  with  k  most  frequent  counts}>  

28

Page 29: MapReduce - Duke UniversityExample4:MatrixMultiplication 30 M N P 36 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK the Reduce function applies each of them to the list of values

Example  3:  Distributed  Grep

•  Input:  String •  Output:  {list  of  lines  that  match  the  string}

•  Map: <lineid,  line>  à  <lineid,  line>  //  if  line  matches  string

•  Reduce:          //  do  nothing

29

Page 30: MapReduce - Duke UniversityExample4:MatrixMultiplication 30 M N P 36 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK the Reduce function applies each of them to the list of values

Example  4:  Matrix  Multiplication 30

M N P

36 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK

the Reduce function applies each of them to the list of values associated witha given key and produces a tuple consisting of the key, including componentsfor all grouping attributes if there is more than one, followed by the results ofeach of the aggregations.

2.3.9 Matrix Multiplication

If M is a matrix with element mij in row i and column j, and N is a matrixwith element njk in row j and column k, then the product P = MN is thematrix P with element pik in row i and column k, where

pik =!

j

mijnjk

It is required that the number of columns of M equals the number of rows ofN , so the sum over j makes sense.

We can think of a matrix as a relation with three attributes: the row number,the column number, and the value in that row and column. Thus, we could viewmatrix M as a relation M(I, J, V ), with tuples (i, j, mij), and we could viewmatrix N as a relation N(J, K, W ), with tuples (j, k, njk). As large matrices areoften sparse (mostly 0’s), and since we can omit the tuples for matrix elementsthat are 0, this relational representation is often a very good one for a largematrix. However, it is possible that i, j, and k are implicit in the position of amatrix element in the file that represents it, rather than written explicitly withthe element itself. In that case, the Map function will have to be designed toconstruct the I, J , and K components of tuples from the position of the data.

The product MN is almost a natural join followed by grouping and ag-gregation. That is, the natural join of M(I, J, V ) and N(J, K, W ), havingonly attribute J in common, would produce tuples (i, j, k, v, w) from each tuple(i, j, v) in M and tuple (j, k, w) in N . This five-component tuple represents thepair of matrix elements (mij , njk). What we want instead is the product ofthese elements, that is, the four-component tuple (i, j, k, v × w), because thatrepresents the product mijnjk. Once we have this relation as the result of onemap-reduce operation, we can perform grouping and aggregation, with I andK as the grouping attributes and the sum of V × W as the aggregation. Thatis, we can implement matrix multiplication as the cascade of two map-reduceoperations, as follows. First:

The Map Function: For each matrix element mij , produce the key value pair"

j, (M, i, mij)#

. Likewise, for each matrix element njk, produce the key valuepair

"

j, (N, k, njk)#

. Note that M and N in the values are not the matricesthemselves. Rather they are names of the matrices or (as we mentioned for thesimilar Map function used for natural join) better, a bit indicating whether theelement comes from M or N .

The Reduce Function: For each key j, examine its list of associated values.For each value that comes from M , say (M, i, mij)

#

, and each value that comes

x =

Page 31: MapReduce - Duke UniversityExample4:MatrixMultiplication 30 M N P 36 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK the Reduce function applies each of them to the list of values

Matrix  Multiplication •  Assume  the  input  format  is  <matrix  id,  row,  col,  entry>

•  E.g.:  <M,  i,  j,  mij>

•  Map: <_  ,  (M,  i,  j,  mij)>  à  <(i,k),  (M,  i,  mij)>  …  for  all  k  

        <_  ,  (N,  j,  k,  njk)>  à  <(i,k),  (N,  k,  njk)>  …  for  all  i •  Reduce: <(i,k),  {(M,  i,  j,  mij),  (N,  j,  k,  njk)  …}>     à  {  <(i,k),  Σj  mijnjk>}

31

Page 32: MapReduce - Duke UniversityExample4:MatrixMultiplication 30 M N P 36 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK the Reduce function applies each of them to the list of values

Example  5:  Join  two  tables

•  Input:  file1  and  file2,  with  schema  <key,  value> •  Output:  keys  appearing  in  both  files.  

•  Map:   <_  ,  (file1,  key,  value)>  à  (key,  file1) <_  ,  (file2,  key,  value)>  à  (key,  file2)

•  Reduce:   <key  ,  {list  of  fileids}>  à  <_,  key> //  if  list  contains  both  file1  and  file2.  

32

Page 33: MapReduce - Duke UniversityExample4:MatrixMultiplication 30 M N P 36 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK the Reduce function applies each of them to the list of values

Map-­‐‑side  Join

•  Suppose  file2  is  very  small    …

•  Map:   <_  ,  (file1,  key,  value,  {keys  in  file2})>     à  (_  ,  key) //  If  key  is  also  in  file2

•  Reduce:  //  do  nothing     <_  ,  {list  of  keys}>  à  <_  ,  {list  of  keys}>

33

Send  contents  of  file2  to  all  the  mappers

Page 34: MapReduce - Duke UniversityExample4:MatrixMultiplication 30 M N P 36 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK the Reduce function applies each of them to the list of values

Example  5:  Join  3  tables

•  Input:  3  Tables – User  (id:int,  age:int) – Page  (url:varchar,  category:varchar) – Log  (userid:  int,  url:varchar)

•  Output:  Ages  of  users  and  types  of  urls  they  clicked.  

34

Page 35: MapReduce - Duke UniversityExample4:MatrixMultiplication 30 M N P 36 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK the Reduce function applies each of them to the list of values

Multiway  Join •  Join(  Page,  Join  (User  ,  Log)) •  Map:  

<_  ,  (User,  id,  age)>  à  (id,  (User,  key,  value)) <_  ,  (Log,  userid,  url)>  à  (userid,  (Log,  userid,  url))

•  Reduce:   <id,  {list  of  records}>     à  <_,  {records  from  User}  x  {records  from  Log}> //  if  list  contains  records  both  from  User  and  Log.  

•  Map:   <_  ,  (User,  Log, id,  age,  userid,  url)>  à  (url,  (User,  Log, id,  age,  userid,  url)) <_  ,  (Page,  url,  category)>  à  (url,  (Page,  url,  category))

•  Reduce:   <url,  {list  of  records}>     à  <_,  {records  from  User x Log}  x  {records  from  Page}> //  if  list  contains  records  both  from  User x Log  and  Page.  

35

Page 36: MapReduce - Duke UniversityExample4:MatrixMultiplication 30 M N P 36 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK the Reduce function applies each of them to the list of values

Summary

•  Map-­‐‑reduce  is  a  programming  model  +  distributed  system  implementation  that  make  parallel  programming  easy. – User  does  not  need  to  worry  about  systems  issues.  

•  Computation  is  a  series  of  Map  and  Reduce  jobs.   – Parallelism  is  achieved  within  each  Map  and  Reduce  phase.  

36