mapreduce)and)beyond)stevko/courses/cse704/fall10/... · google)mapreduce) •...

69
MapReduce and Beyond Steve Ko 1

Upload: others

Post on 11-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

MapReduce  and  Beyond  

Steve  Ko  

1  

Page 2: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

Trivia  Quiz:  What’s  Common?  

2  

Data-­‐intensive  compu0ng  with  MapReduce!  

Page 3: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

What  is  MapReduce?  

•  A  system  for  processing  large  amounts  of  data  •  Introduced  by  Google  in  2004  •  Inspired  by  map  &  reduce  in  Lisp  

•  OpenSource  implementaMon:  Hadoop  by  Yahoo!  

•  Used  by  many,  many  companies  – A9.com,  AOL,  Facebook,  The  New  York  Times,  Last.fm,  Baidu.com,  Joost,  Veoh,  etc.  

3  

Page 4: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

Background:  Map  &  Reduce  in  Lisp  

•  Sum  of  squares  of  a  list  (in  Lisp)  •  (map  square  ‘(1  2  3  4))  

–  Output:  (1  4  9  16)  [processes  each  record  individually]  

4  

4  

4   9   16  

f   f   f  

3  2  

1  

f  

1  

Page 5: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

Background:  Map  &  Reduce  in  Lisp  

•  Sum  of  squares  of  a  list  (in  Lisp)  •  (reduce  +  ‘(1  4  9  16))  

–  (+  16  (+  9  (+  4  1)  )  )  –  Output:  30  [processes  set  of  all  records  in  a  batch]  

5  

16  

5   14   30  

f   f   f  

9  4  

1  iniMal  

returned  

Page 6: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

Background:  Map  &  Reduce  in  Lisp  

•  Map  – processes  each  record  individually  

•  Reduce  – processes  (combines)  set  of  all  records  in  a  batch  

6  

Page 7: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

What  Google  People  Have  NoMced  

•  Keyword  search  – Find  a  keyword  in  each  web  page  individually,  and  if  it  is  found,  return  the  URL  of  the  web  page  

– Combine  all  results  (URLs)  and  return  it  

•  Count  of  the  #  of  occurrences  of  each  word  – Count  the  #  of  occurrences  in  each  web  page  individually,  and  return  the  list  of  <word,  #>  

– For  each  word,  sum  up  (combine)  the  count  

•  NoMce  the  similariMes?  

7  

Map  

Reduce  

Map  

Reduce  

Page 8: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

What  Google  People  Have  NoMced  

•  Lots  of  storage  +  compute  cycles  nearby  •  Opportunity  

– Files  are  distributed  already!  (GFS)  – A  machine  can  processes  its  own  web  pages  (map)  

8  

CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU

CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU

Page 9: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

Google  MapReduce  

•  Took  the  concept  from  Lisp,  and  applied  to  large-­‐scale  data-­‐processing  

•  Takes  two  funcMons  from  a  programmer  (map  and  reduce),  and  performs  three  steps  

•  Map  –  Runs  map  for  each  file  individually  in  parallel  

•  Shuffle  –  Collects  the  output  from  all  map  execuMons  –  Transforms  the  map  output  into  the  reduce  input  –  Divides  the  map  output  into  chunks  

•  Reduce  –  Runs  reduce  (using  a  map  output  chunk  as  the  input)  in  parallel  

9  

Page 10: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

Programmer’s  Point  of  View  

•  Programmer  writes  two  funcMons  –  map()  and  reduce()  

•  The  programming  interface  is  fixed  – map  (in_key,  in_value)  -­‐>                    list  of  (out_key,  intermediate_value)  

–  reduce  (out_key,  list  of  intermediate_value)  -­‐>        (out_key,  out_value)  

•  CauMon:  not  exactly  the  same  as  Lisp  

10  

Page 11: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

Inverted  Indexing  Example  

•  Word  -­‐>  list  of  web  pages  containing  the  word  

11  

every  -­‐>              hsp://m-­‐w.com/…              hsp://answers.….              …  its  -­‐>              hsp://itsa.org/….              hsp://youtube…              …  

Input:  web  pages   Output:  word-­‐>  urls  

Page 12: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

Map  

•  Interface  –  Input:  <in_key,  in_value>  pair  =>  <url,  content>  – Output:  list  of  intermediate  <key,  value>  pairs    =>  list  of  <word,  url>  

12  

key  =  hsp://url0.com  value  =  “every  happy  family  is  alike.”  

<every,  hsp://url0.com>  <happy,  hsp://url0.com>  <family,  hsp://url0.com>  …  

Map  Input:  <url,  content>  

<every,  hsp://url1.com>  <unhappy,  hsp://url1.com>  <family,  hsp://url1.com>  …  

key  =  hsp://url1.com  value  =  “every  unhappy  family  is  unhappy  in  its  own  way.”  

Map  Output:  list  of  <word,  url>  

Page 13: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

Shuffle  

•  MapReduce  system  – Collects  outputs  from  all  map  execuMons  

– Groups  all  intermediate  values  by  the  same  key  

13  

every  -­‐>  hsp://url0.com    hsp://url1.com  

<every,  hsp://url0.com>  <happy,  hsp://url0.com>  <family,  hsp://url0.com>  …  

<every,  hsp://url1.com>  <unhappy,  hsp://url1.com>  <family,  hsp://url1.com>  …  

Map  Output:  list  of  <word,  url>   Reduce  Input:  <word,  list  of  urls>  

happy  -­‐>  hsp://url0.com  

unhappy  -­‐>  hsp://url1.com  

family  -­‐>  hsp://url0.com    hsp://url1.com  

Page 14: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

Reduce  

•  Interface  –  Input:  <out_key,  list  of  intermediate_value>  

– Output:  <out_key,  out_value>  

14  

every  -­‐>  hsp://url0.com    hsp://url1.com  

Reduce  Input:  <word,  list  of  urls>  

happy  -­‐>  hsp://url0.com  

unhappy  -­‐>  hsp://url1.com  

family  -­‐>  hsp://url0.com    hsp://url1.com  

<every,  “hsp://url0.com,  hsp://url1.com”>  

<happy,    “hsp://url0.com”>  

<unhappy,    “hsp://url1.com”>  

<family,  “hsp://url0.com,    hsp://url1.com”>  

Reduce  Output:  <word,  string  of  urls>  

Page 15: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

ExecuMon  Overview  

15  

Map  phase  

Shuffle  phase  

Reduce  phase  

Page 16: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

ImplemenMng  MapReduce  

•  Externally  for  user  – Write  a  map  funcMon,  and  a  reduce  funcMon  –  Submit  a  job;  wait  for  result  – No  need  to  know  anything  about  the  environment  (Google:  4000  servers  +  48000  disks,  many  failures)  

•  Internally  for  MapReduce  system  designer  –  Run  map  in  parallel  –  Shuffle:  combine  map  results  to  produce  reduce  input  –  Run  reduce  in  parallel  – Deal  with  failures  

16  

Page 17: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

ExecuMon  Overview  

17  

Master

Input  Files Output

Map  workers Reduce  workers

M  

M  

M  

R  

R  

Input  files  sent  to  map  tasks   Intermediate  keys  

parMMoned  into  reduce  tasks  

Page 18: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

Task  Assignment  

18  

Master

Map  workers Reduce  workers

M  

M  

M  

R  

R  

Worker  pull  1.  Worker  signals  idle  2.  Master  assigns  task  3.  Task  retrieves  data  4.  Task  executes  

Output Input  Splits

Page 19: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

Fault-­‐tolerance:  re-­‐execuMon  

19  

1

1

Master

Map  workers Reduce  workers

M  

M  

M  

R  

R  

Re-­‐execute  on  failure  

Input  Splits Output

Page 20: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

Machines  share  roles  

20  

Master

•  So  far,  logical  view  of  cluster  •  In  reality  

–  Each  cluster  machine  stores  data  

–  And  runs  MapReduce  workers  

•  Lots  of  storage  +  compute  cycles  nearby  

M  

R  

M  

R  

M  

R  

M  

R  

M  

R  

M  

R  

Page 21: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

MapReduce  Summary  

•  Programming  paradigm  for  data-­‐intensive  compuMng  

•  Simple  to  program  (for  programmers)  

•  Distributed  &  parallel  execuMon  model  

•  The  framework  automates  many  tedious  tasks  (machine  selecMon,  failure  handling,  etc.)  

21  

Page 22: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

Hadoop  Demo  

22  

Page 23: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

Beyond  MapReduce  

•  As  a  programming  model  – Limited:  only  Map  and  Reduce  –  Improvements:  Pig,  Dryad,  Hive,  Sawzall,  Map-­‐Reduce-­‐Merge,  etc.  

•  As  a  runMme  system  – Beser  scheduling  (e.g.,  LATE  scheduler)  – Beser  fault  handling  (e.g.,  ISS)  – Pipelining  (e.g.,  HOP)  – Etc.  

23  

Page 24: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

Making  Cloud  Intermediate  Data  Fault-­‐Tolerant  

Steve  Ko*  (Princeton  University),  Imranul  Hoque  (UIUC),  

Brian  Cho  (UIUC),  and  Indranil  Gupta  (UIUC)  

*  work  done  at  UIUC  

Page 25: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

Our  PosiMon  

•  Intermediate  data  as  a  first-­‐class  ciMzen  for  dataflow  programming  frameworks  in  clouds  

25  

Page 26: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

Our  PosiMon  

•  Intermediate  data  as  a  first-­‐class  ciMzen  for  dataflow  programming  frameworks  in  clouds  – Dataflow  programming  frameworks  

26  

Page 27: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

Our  PosiMon  

•  Intermediate  data  as  a  first-­‐class  ciMzen  for  dataflow  programming  frameworks  in  clouds  – Dataflow  programming  frameworks  – The  importance  of  intermediate  data  

27  

Page 28: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

Our  PosiMon  

•  Intermediate  data  as  a  first-­‐class  ciMzen  for  dataflow  programming  frameworks  in  clouds  – Dataflow  programming  frameworks  – The  importance  of  intermediate  data  –  ISS  (Intermediate  Storage  System)  

•  Not  to  be  confused  with,        InternaMonal  Space  StaMon  

     IBM  Internet  Security  Systems  

28  

Page 29: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

Dataflow  Programming  Frameworks  

•  RunMme  systems  that  execute  dataflow  programs  – MapReduce  (Hadoop),  Pig,  Hive,  etc.  – Gaining  popularity  for  massive-­‐scale  data  processing  

– Distributed  and  parallel  execuMon  on  clusters  •  A  dataflow  program  consists  of  

– MulM-­‐stage  computaMon  – CommunicaMon  paserns  between  stages  

29  

Page 30: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

Example  1:  MapReduce  

•  Two-­‐stage  computaMon  with  all-­‐to-­‐all  comm.  –  Google  introduced,  Yahoo!  open-­‐sourced  (Hadoop)  –  Two  funcMons  –  Map  and  Reduce  –  supplied  by  a  programmer  

– Massively  parallel  execuMon  of  Map  and  Reduce  

30  

Stage  1:  Map  

Stage  2:  Reduce  

Shuffle  (all-­‐to-­‐all)  

Page 31: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

Example  2:  Pig  and  Hive  

•  MulM-­‐stage  with  either  all-­‐to-­‐all  or  1-­‐to-­‐1  

Stage  1:  Map  

Stage  2:  Reduce  

Stage  3:  Map  

Stage  4:  Reduce  31  

Shuffle  (all-­‐to-­‐all)  

1-­‐to-­‐1  comm.  

Page 32: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

Usage  •  Google  (MapReduce)  

–  Indexing:  a  chain  of  24  MapReduce  jobs  –  ~200K  jobs  processing  50PB/month  (in  2006)  

•  Yahoo!  (Hadoop  +  Pig)  –  WebMap:  a  chain  of  100  MapReduce  jobs  

•  Facebook  (Hadoop  +  Hive)  –  ~300TB  total,  adding  2TB/day  (in  2008)  –  3K  jobs  processing  55TB/day  

•  Amazon  –  ElasMc  MapReduce  service  (pay-­‐as-­‐you-­‐go)  

•  Academic  clouds  –  Google-­‐IBM  Cluster  at  UW  (Hadoop  service)  –  CCT  at  UIUC  (Hadoop  &  Pig  service)  

32  

Page 33: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

One  Common  CharacterisMc  

•  Intermediate  data  –  Intermediate  data?  data  between  stages  

•  SimilariMes  to  tradiMonal  intermediate  data  [Bak91,  Vog99]  – E.g.,  .o  files  – CriMcal  to  produce  the  final  output  – Short-­‐lived,  wrisen-­‐once  and  read-­‐once,  &  used-­‐immediately  

– ComputaMonal  barrier  33  

Page 34: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

One  Common  CharacterisMc  

•  ComputaMonal  Barrier  

34  

Stage  1:  Map  

Stage  2:  Reduce  

Computa0onal  Barrier  

Page 35: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

Why  Important?  

•  Large-­‐scale:  possibly  very  large  amount  of  intermediate  data  

•  Barrier:  Loss  of  intermediate  data  =>  the  task  can’t  proceed  

35  

Stage  1:  Map  

Stage  2:  Reduce  

Page 36: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

Failure  Stats  

•  5  average  worker  deaths  per  MapReduce  job  (Google  in  2006)  

•  One  disk  failure  in  every  run  of  a  6-­‐hour  MapReduce  job  with  4000  machines  (Google  in  2008)  

•  50  machine  failures  out  of  20K  machine  cluster  (Yahoo!  in  2009)  

36  

Page 37: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

Hadoop  Failure  InjecMon  Experiment  •  Emulab  se~ng  

–  20  machines  sorMng  36GB  –  4  LANs  and  a  core  switch  (all  100  Mbps)  

•  1  failure  a�er  Map  –  Re-­‐execuMon  of  Map-­‐Shuffle-­‐Reduce  

•  ~33%  increase  in  compleMon  Mme  

37  Map  Shuffle   Reduce  

0

5

10

15

20

25

30

35

40

0 200 400 600 800 1000 1200 1400 1600 1800

# o

f T

asks

Time (sec)

MapShuffle

Reduce

Map  Shuffle  

Reduce  

Page 38: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

Re-­‐GeneraMon  for  MulM-­‐Stage  

•  Cascaded  re-­‐execuMon:  expensive  

Stage  1:  Map  

Stage  2:  Reduce  

Stage  3:  Map  

Stage  4:  Reduce  38  

Page 39: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

Importance  of  Intermediate  Data  

•  Why?  –  (PotenMally)  a  lot  of  data  – When  lost,  very  costly  

•  Current  systems  handle  it  themselves.  – Re-­‐generate  when  lost:  can  lead  to  expensive  cascaded  re-­‐execuMon  

•  We  believe  that  the  storage  can  provide  a  beser  soluMon  than  the  dataflow  programming  frameworks  

39  

Page 40: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

Our  PosiMon  

•  Intermediate  data  as  a  first-­‐class  ciMzen  for  dataflow  programming  frameworks  in  clouds   Dataflow  programming  frameworks   The  importance  of  intermediate  data  –  ISS  (Intermediate  Storage  System)  

• Why  storage?  •  Challenges  •  SoluMon  hypotheses  •  Hypotheses  validaMon  

40  

Page 41: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

Why  Storage?  

•  ReplicaMon  stops  cascaded  re-­‐execuMon  

41  

Stage  1:  Map  

Stage  2:  Reduce  

Stage  3:  Map  

Stage  4:  Reduce  

Page 42: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

So,  Are  We  Done?  

•  No!  •  Challenge:  minimal  interference  

– Network  is  heavily  uMlized  during  Shuffle.  

– ReplicaMon  requires  network  transmission  too,  and  needs  to  replicate  a  large  amount.  

– Minimizing  interference  is  criMcal  for  the  overall  job  compleMon  Mme.  

•  HDFS  (Hadoop  Distributed  File  System):  much  interference  

42  

Page 43: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

0

500

1000

1500

2000

2500

3000

3500

4000

Normal R MR

Com

ple

tion T

ime (

sec)

Replication

MapShuffle

Reduce

Default  HDFS  Interference  

•  ReplicaMon  of  Map  and  Reduce  outputs  (2  copies  in  total)  

43  

Page 44: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

Background  Transport  Protocols  

•  TCP-­‐Nice  [Ven02]  &  TCP-­‐LP  [Kuz06]  –  Support  background  &  foreground  flows  

•  Pros  –  Background  flows  do  not  interfere  with  foreground  flows  (funcMonality)  

•  Cons  – Designed  for  wide-­‐area  Internet  – ApplicaMon-­‐agnosMc  – Not  a  comprehensive  soluMon:  not  designed  for  data  center  replicaMon  

•  Can  do  beser!  44  

Page 45: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

Our  PosiMon  

•  Intermediate  data  as  a  first-­‐class  ciMzen  for  dataflow  programming  frameworks  in  clouds   Dataflow  programming  frameworks   The  importance  of  intermediate  data  –  ISS  (Intermediate  Storage  System)  

 Why  storage?   Challenges  •  SoluMon  hypotheses  •  Hypotheses  validaMon  

45  

Page 46: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

Three  Hypotheses  

1.  Asynchronous  replicaMon  can  help.  –  HDFS  replicaMon  works  synchronously.  

2.  The  replicaMon  process  can  exploit  the  inherent  bandwidth  heterogeneity  of  data  centers  (next).  

3.  Data  selecMon  can  help  (later).  

46  

Page 47: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

Bandwidth  Heterogeneity  

•  Data  center  topology:  hierarchical  – Top-­‐of-­‐the-­‐rack  switches  (under-­‐uMlized)  – Shared  core  switch  (fully-­‐uMlized)  

47  

Page 48: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

Data  SelecMon  

•  Only  replicate  locally-­‐consumed  data  

48  

Stage  1:  Map  

Stage  2:  Reduce  

Stage  3:  Map  

Stage  4:  Reduce  

Page 49: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

Three  Hypotheses  

1.  Asynchronous  replicaMon  can  help.  2.  The  replicaMon  process  can  exploit  the  

inherent  bandwidth  heterogeneity  of  data  centers.  

3.  Data  selecMon  can  help.  

•  The  quesMon  is  not  if,  but  how  much.  

•  If  effecMve,  these  become  techniques.  

49  

Page 50: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

Experimental  Se~ng  

•  Emulab  with  80  machines  –  4  X  1  LAN  with  20  machines  –  4  X  100Mbps  top-­‐of-­‐the-­‐rack  switch  –  1  X  1Gbps  core  switch  –  Various  configuraMons  give  similar  results.  

•  Input  data:  2GB/machine,  random-­‐generaMon  •  Workload:  sort  •  5  runs  

–  Std.  dev.  ~  100  sec.:  small  compared  to  the  overall  compleMon  Mme  

•  2  replicas  of  Map  outputs  in  total  50  

Page 51: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

Asynchronous  ReplicaMon  

•  ModificaMon  for  asynchronous  replicaMon  – With  an  increasing  level  of  interference  

•  Four  levels  of  interference  – Hadoop:  original,  no  replicaMon,  no  interference  – Read:  disk  read,  no  network  transfer,  no  actual  replicaMon  

– Read-­‐Send:  disk  read  &  network  send,  no  actual  replicaMon  

– Rep.:  full  replicaMon  

51  

Page 52: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

Asynchronous  ReplicaMon  

•  Network  uMlizaMon  makes  the  difference  •  Both  Map  &  Shuffle  get  affected  

– Some  Maps  need  to  read  remotely  

52  

0 200 400 600 800

1000 1200 1400 1600 1800 2000 2200

Hadoop Read Read-Send Rep.

Co

mple

tion

Tim

e (

sec)

Interference Level

MapShuffle

Reduce

Page 53: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

Three  Hypotheses  (ValidaMon)  

 Asynchronous  replicaMon  can  help,  but  sMll  can’t  eliminate  the  interference.  

•  The  replicaMon  process  can  exploit  the  inherent  bandwidth  heterogeneity  of  data  centers.  

•  Data  selecMon  can  help.  

53  

Page 54: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

Rack-­‐Level  ReplicaMon  

•  Rack-­‐level  replicaMon  is  effecMve.  – Only  20~30  rack  failures  per  year,  mostly  planned  (Google  2008)  

54  

0

200

400

600

800

1000

1200

1400

Hadoop HDFS Rack-Rep.

Com

ple

tion T

ime (

sec)

Replication Type

MapShuffle

Reduce

Page 55: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

Three  Hypotheses  (ValidaMon)  

 Asynchronous  replicaMon  can  help,  but  sMll  can’t  eliminate  the  interference  

 The  rack-­‐level  replicaMon  can  reduce  the  interference  significantly.  

•  Data  selecMon  can  help.  

55  

Page 56: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

Locally-­‐Consumed  Data  ReplicaMon  

•  It  significantly  reduces  the  amount  of  replicaMon.  

56  

0

200

400

600

800

1000

1200

1400

Hadoop HDFS Local-Rep.

Com

ple

tion T

ime (

sec)

Replication Type

MapShuffle

Reduce

Page 57: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

Three  Hypotheses  (ValidaMon)  

 Asynchronous  replicaMon  can  help,  but  sMll  can’t  eliminate  the  interference  

 The  rack-­‐level  replicaMon  can  reduce  the  interference  significantly.  

 Data  selecMon  can  reduce  the  interference  significantly.  

57  

Page 58: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

ISS  Design  Overview  

•  Implements  asynchronous  rack-­‐level  selecMve  replicaMon  (all  three  hypotheses)  

•  Replaces  the  Shuffle  phase  – MapReduce  does  not  implement  Shuffle.  – Map  tasks  write  intermediate  data  to  ISS,  and  Reduce  tasks  read  intermediate  data  from  ISS.  

•  Extends  HDFS  (next)  

58  

Page 59: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

ISS  Design  Overview  

•  Extends  HDFS  –  iss_create()  –  iss_open()  –  iss_write()  –  iss_read()  –  iss_close()  

•  Map  tasks  –  iss_create()  =>  iss_write()  =>  iss_close()  

•  Reduce  tasks  –  iss_open()  =>  iss_read()  =>  iss_close()  

59  

Page 60: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

Performance  under  Failure  

•  5  scenarios  – Hadoop  (no  rep)  with  one  permanent  machine  failure  

– Hadoop  (reduce  rep=2)  with  one  permanent  machine  failure  

–  ISS  (map  &  reduce  rep=2)  with  one  permanent  machine  failure  

– Hadoop  (no  rep)  with  one  transient  failure  –  ISS  (map  &  reduce  rep=2)  with  one  transient  failure  

60  

Page 61: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

Summary  of  Results  

•  Comparison  to  no  failure  Hadoop  – One  failure  ISS:  18%  increase  in  compleMon  Mme  

– One  failure  Hadoop:  59%  increase  •  One  failure  Hadoop  vs.  One  failure  ISS  

– 45%  speedup  

61  

Page 62: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

Performance  under  Failure  

•  Hadoop  (rep=1)  with  one  machine  failure  

62  

0

5

10

15

20

25

30

35

40

0 500 1000 1500 2000 2500

# o

f T

asks

Time (sec)

MapShuffle

Reduce

Page 63: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

Performance  under  Failure  

•  Hadoop  (rep=2)  with  one  machine  failure  

63  

0

5

10

15

20

25

30

35

40

0 500 1000 1500 2000 2500

# o

f T

asks

Time (sec)

MapShuffle

Reduce

Page 64: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

Performance  under  Failure  

•  ISS  with  one  machine  failure  

64  

0

5

10

15

20

25

30

35

40

0 500 1000 1500 2000 2500

# o

f T

asks

Time (sec)

MapShuffle

Reduce

Page 65: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

Performance  under  Failure  

•  Hadoop  (rep=1)  with  one  transient  failure  

65  

0

5

10

15

20

25

30

35

40

0 500 1000 1500 2000 2500

# o

f T

asks

Time (sec)

MapShuffle

Reduce

Page 66: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

Performance  under  Failure  

•  ISS-­‐Hadoop  with  one  transient  failure  

66  

0

5

10

15

20

25

30

35

40

0 500 1000 1500 2000 2500

# o

f T

asks

Time (sec)

MapShuffle

Reduce

Page 67: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

ReplicaMon  CompleMon  Time  

•  ReplicaMon  completes  before  Reduce  –  ‘+’  indicates  replicaMon  Mme  for  each  block  

67  

0

50

100

150

200

0 200 400 600 800 1000 1200 0

200

400

600

800

1000

1200

# o

f T

asks

Rep. D

ura

tion (

sec)

Time (sec)

MapShuffle

Reduce

Page 68: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

Summary  •  Our  posiMon  

–  Intermediate  data  as  a  first-­‐class  ciMzen  for  dataflow  programming  frameworks  in  clouds  

•  Problem:  cascaded  re-­‐execuMon  •  Requirements  

–  Intermediate  data  availability  (scale  and  dynamism)  –  Interference  minimizaMon  (efficiency)  

•  Asynchronous  replicaMon  can  help,  but  sMll  can’t  eliminate  the  interference  

•  The  rack-­‐level  replicaMon  can  reduce  the  interference  significantly.  •  Data  selecMon  can  reduce  the  interference  significantly.  •  Hadoop  &  Hadoop  +  ISS  show  comparable  compleMon  Mmes.  

68  

Page 69: MapReduce)and)Beyond)stevko/courses/cse704/fall10/... · Google)MapReduce) • Took)the)concept)from)Lisp,)and)applied)to)largekscale)data processing) • Takes)two)funcMons)from)aprogrammer)(map)and)reduce),)and)performs

References  •  [Vog99]  W.  Vogels.  File  System  Usage  in  Windows  NT  4.0.  In  SOSP,  

1999.  

•  [Bak91]  M.  G.  Baker,  J.  H.  Hartman,  M.  D.  Kupfer,  K.  W.  Shirriff,  and  J.  K.  Ousterhout.  Measurements  of  a  Distributed  File  System.  SIGOPS  OSR,  25(5),  1991.  

•  [Ven02]  A.  Venkataramani,  R.  Kokku,  and  M.  Dahlin.  TCP  Nice:  A  Mechanism  for  Background  Transfers.  In  OSDI,  2002.  

•  [Kuz06]  A.  Kuzmanovic  and  E.  W.  Knightly.  TCP-­‐LP:  Low-­‐Priority  Service  via  End-­‐Point  CongesMon  Control.  IEEE/ACM  TON,  14(4):739–752,  2006.  

69