big$data,$hadoop$and$nosql$ - irityoann.pitarch/docs/coursbigdata/bigdatalesson.pdf · agenda 1....

79
Big Data, Hadoop and NoSQL (Almost) Everything you have ever wanted to know about Yoann Pitarch

Upload: others

Post on 20-May-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

Big  Data,  Hadoop  and  NoSQL  (Almost)  Everything  you  have  ever  

wanted  to  know  about    

Yoann  Pitarch  

Page 2: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

Agenda  

1.  Big  Data  1.  DefiniHon  2.  ApplicaHons  

2.  Hadoop  /  MapReduce  CompuHng  Paradigm  1.  Overview  2.  How  it  works  

3.  NoSQL  databases    

Page 3: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

Agenda  

1.  Big  Data  1.  DefiniHon  2.  ApplicaHons  

2.  Hadoop  /  MapReduce  CompuHng  Paradigm  1.  Overview  2.  How  it  works  

3.  NoSQL  databases    

Page 4: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

IntroducHon  

J.  Manyika  et  al.,  Big  data,  the  next  fron0er  for  innova0on,  compe00on,  and  produc0vity,  McKinsey  Global  InsHtute,  2011  

Page 5: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

Some  DefiniHons  •  A  buzz  word  

•  «   Data   which   size   is   too   large   and   complex   to   be   treated  (harversted,  stored,  analysed,  spreaded)  by  usual  systems  »  

•  «  A  data  can  be  considered  as  big  when  tradi8onal  analysis  methods  fail  in  dealing  with  them  »  

•  «   A   collecHon   of   data   from   tradi8onal   and   digital   sources  inside   and   outside   a   company   that   represents   a   source   for  ongoing  discovery  and  analysis  »  

Page 6: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

The  3,  4  or  5  “Vs”  

•  Possibly  Veracity  and/or  Variability  as  well  Source:  h_p://www.datasciencecentral.com/  

Page 7: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

Digital  Informa8on  and  usage:  what  has  changed  ?  

Page 8: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

Digital  Data  

94%  

6%  

25%  

75%  

Analogique  

1%  

99%  

3%  

97%  

Digital  

1986   1993   2000   2007  

3  16   54  

295  

0  

100  

200  

300  

400  

NOTE:  Numbers  may  not  sum  to  rounding  Hilbert  and  Lopez,  «  The  world’s  technological  capacity  to  store,  communicate,  and  compute  informa0on  »,  Science,  2011  J.  Manyika  et  al.,  Big  data,  the  next  fron0er  for  innova0on,  compe00on,  and  produc0vity,  McKinsey  Global  InsHtute,  2011.          

     

Exabytes  

Page 9: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

Storage  and  CompuHng  

1986   1993   2000   2007  

NOTE:  Numbers  may  not  sum  to  rounding  SOURCE  :  Hilbert  and  Lopez,  «  The  world’s  technological  capacity  to  store,  communicate,  and  compute  informa0on  »,  Science,  2011    J.  Manyika  et  al.,  Big  data,  the  next  fron0er  for  innova0on,  compe00on,  and  produc0vity,  McKinsey  Global  InsHtute,  2011.        

     

1012  million  instruc5ons  per  second        

41%  

17%  

9%  

33%  

6%  

23%  

6%  65%  

6%   3%  5%  

86%  

3%   6%  

25%  

66%  

<0,001   0.004   0.289  

6.379  

0  

2  

4  

6  

Page 10: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

Types  Video   Image   Audio   Texte/  

Numbers  

Insurance  

Banking  

CommunicaHon  and  media  

ConstrucHon  

EducaHon  

Gouvernement  

Health  care  

PénétraHon      

             Low  

             Medium  

             High                

SOURCE:  McKinsey  Global  Ins0tute  analysis  J.  Manyika  et  al.,  Big  data,  the  next  fron0er  for  innova0on,  compe00on,  and  produc0vity,  McKinsey  Global  InsHtute,  2011.    

Page 11: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

Big  Data  from  Internet/Web  2.0  

R.  Kalakota,  2012          

     

Page 12: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

Internet  Users  

World:  2  405  518  376  internet  users  in  june  2012  (populaHon:  7  017  846  922)  France:  52,228,905  Internet  users  in  june  2012  Indexed  Web  :  more  than  8.15  billion  pages  (Sept,  2012)  

Internet  World  Stats  –  www.internetworldstats.com/stats.htm    

Internet  users  

PopulaHon  0  

500  000  000  

1  000  000  000  

1  500  000  000  

2  000  000  000  

2  500  000  000  

3  000  000  000  

3  500  000  000  

4  000  000  000  

Africa   Asia   Europe   Middle  East   North  

America   LaHn  America   Oceania  

Page 13: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

Source:  h_p://fr.slideshare.net/stevenvanbelleghem/social-­‐media-­‐around-­‐the-­‐world-­‐2011  

North   Ouest   South   East   Europe  

Know  at  least  one  SN   98%   97%   99%   99%   98%  

Member  of  at  least  one  SN   75%   66%   77%   79%   73%  

Avg  nber  of  SN   1,8   1,5   2,2   1,9   1,9  

North  

Ouest  

South  

East  

Social  Network  PenetraHon  

Page 14: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

Social  Network  •  Facebook  users    

–   835  525  280  (march  2012)  –  50%  via  mobile  –   25  000  000  in  France  (penetraHon  rate  38%)  

Page 15: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

Social  Networks  •  Twi_er  users    

–  140  000  000  acHve  users  –  340  000  000  tweets  per  day  

•  Some  stats  (h_p://fr.slideshare.net/stevenvanbelleghem)  

–  400  billion  users  daily  (Facebook)  –  Session:  facebook  37mn;  twi_er  23  mn  –  50%  follow  a  trade  mark  –  36%  write  about  a  company  or  a  trade  mark  –  19%  write  about  the  company  they  belong  to  

Page 16: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

Big  Data  from  Science  •  Astronomy    

–  THéMIS  solar  telescope:  20  TB  raw  reduced  data  –  European  Solar  Telescope:  ½  PB/day  (2018)  –  CDS:  data  base  of  all  the  astronomical  objects,  related  papers,  …  

•  Fluid  analysis  –  ParHcle  Image  Velocimetry:  1  mn  ;  1000  images/seconde  ;  1  TB  (encodage  16  bits)  

–  EquaHon  Navier-­‐Stokes:  grild  108  noeuds;  105  steps;  40  TB  of  data  per  variable  

•  Medecine  –  Radiology,  video  of  surgeries,  reports  

Page 17: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

Harvested  by  Smart  Sensors  

• Car  industry  • Meters:  water,  power,  …  • Networks:  communicaHons  (emails,  telephone,  …)    

Page 18: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

Submarine  Cable  Map  

Page 19: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

Submarine  Cable  Map  

•  Prism  •  Qui  ?  

Page 20: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

Data  Centers  

•  «   Within   four   years,   two-­‐thirds   of   all   data   center  traffic  across  the  world  —  as  well  as  workloads  —  will  be  cloud  based  »    Cisco  –  2012  

•  Global   data   center   traffic  will   grow   fourfold   between  2011   and   2016,   reaching   a   total     of   6.6   ze_abytes  annually  

•  Global  cloud  traffic,  the  fastest-­‐growing  component  of  data  center  traffic,  will  grow  sixfold  –  a  44%  combined  annual   growth   rate   (CAGR)   –   from   683   exabytes   of  annual  traffic  in  2011  to  4.3  ze_abytes  by  2016  

Page 21: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

Data  Centers  

h_p://www.cisco.com/en/US/soluHons/collateral/ns341/ns525/ns537/ns705/ns1175/Cloud_Index_White_Paper.html  

Global  Data  Center  IP  Traffic  Growth  

Page 22: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

Data  Centers  

h_p://www.cisco.com/en/US/soluHons/collateral/ns341/ns525/ns537/ns705/ns1175/Cloud_Index_White_Paper.html  

Workload  DistribuHon:  2011-­‐2016  

Page 23: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

ApplicaHons  

documentaHon.Hce.ac-­‐orleans-­‐tours.fr  

Page 24: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

ApplicaHons  

•  Science  monitoring  &  technical  watch  –  Analysis  of  the  compeHtors  –  Trend  detecHon  

•  Market  segmentaHon  /  customer  micro-­‐segmentaHon    •  Social  network  analysis  

–  Opinion  mining  –  Digital  idenHty  

•  ReacHon  to  failures  •  AutomaHc  and  smart  regulaHon  

–  Water,  electricity,  …  

•  AutomaHc  monitoring  of  domesHc  equipments  

Page 25: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

Technologies  and  Methods  

– CS:  Collect,  Store,  Treat,  Spread  – Agregate,  Analyse,  Visualize  – MulHdisciplinary:  CS,  mathemaHcs,  economy,  sociology  

– Methods  •  MulHdimensional  analysis  and  visualiszaHon  •  Machine  learning  •  Data  mining  

Page 26: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

Methods  

– Not  necessarly  fully  new  methods  but  – Volume  implies  new  treatment  principles  

•  Large  processing  and  storage  capacity  =>  cloud  compuHng  

•  High-­‐performance  and  parallel  compuHng  =>  MAHOUT  Apache  Mahout:  Scalable  machine  learning  and  data  mining    

•  Data  reducHon  /  dimensionality  reducHon  •  Distributed  data  &  treatments  =>  NOSQL  ;  MapReduce    

Apache™  Hadoop®!  

 

Page 27: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

Cloud  CompuHng  

4+ billion phones by 2010 [Source: Nokia]

Web 2.0-enabled PCs, TVs, etc.

Businesses, from startups to enterprises

An   emerging   compuHng   paradigm   where   data   and   services   reside   in  massively  scalable  data  centers  and  can  be  ubiquitously  accessed  from  any  connected  devices  over  the  internet.  

Credit:  IBM  Corp.  [F.  Desprez,  LIP  ENS  Lyon  /  INRIA]    

Page 28: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

Cloud  CompuHng  •  Two  key  concepts  

–  Processing  1000x  more  data  does  not  have  to  be  1000x  harder  

–  Cycles  and  bytes,  not  hardware  are  the  new  commodity  

•  Cloud  compuHng  is  –  Providing  services  on  virtual  machines    allocated  on  top  of  a  large    physical  machine  pool  

–  A  method  to  address  scalability    and  availability  concerns  for  large    scale  applicaHons  

–  DemocraHzed  distributed  compuHng  

[F.  Desprez,  LIP  ENS  Lyon  /  INRIA]    

Page 29: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

Amazon  ElasHc  Compute  Cloud  A set of APIs and business models which give developer-level access to Amazon’s infrastructure and content:

"   Data  As  A  Service  "   Amazon  E-­‐Commerce  Service  "   Amazon  Historical  Pricing  

"   Search  As  A  Service  "   Alexa  Web  InformaHon  Service  "   Alexa  Top  Sites  "   Alexa  Site  Thumbnail  "   Alexa  Web  Search  Playorm  

 

"   Infrastructure  As  A  Service  "   Amazon  Simple  Queue  Service  "   Amazon  Simple  Storage  Service  "   Amazon  ElasHc  Compute  Cloud  

"   People  As  A  Service  "   Amazon  Mechanical  Trunk  

Compute  

Store   Message  

ElasHc  Compute  Cloud  

Simple  Storage    Service  

Simple  Queue  Service  

Credits:  Jeff  Barr,  Amazon  

Page 30: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

Before  Analysis  

HARVESTING  

Page 31: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

World  Digital  Library  

h_p://www.wdl.org/en/  

Page 32: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

HarvesHng  

•  Technical  issues  – Different  API  to  query  – Different  formats  

•  Other  issues  – SelecHng  sources  – Defining  queries  – Need  for  homogeneizaHon  

Page 33: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

Issues  and  Challenges  •  Data  heterogenity  

–  Formats  (pages  vs  tweets  vs  videos)  –  Fiability  :  quality,  validity,  privacy  – Accessibility:  open  data  /    

•  Technics  and  technologies  – Hardware  (capacity,  security)  –  So{ware  

•  OrganisaHonnal  –  Various  skills  and  competences  (computer  science,  mathemaHcs,  economics,  social  science,  ….)  

Page 34: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

Agenda  

1.  Big  Data  1.  DefiniHon  2.  ApplicaHons  

2.  Hadoop  /  MapReduce  CompuHng  Paradigm  1.  Overview  2.  How  it  works  

3.  NoSQL  databases  

Page 35: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

Large-­‐Scale  Data  AnalyHcs  •  MapReduce  compuHng  paradigm,  e.g.,  Hadoop,  vs.  tradiHonal  

database  systems  

35  

Database  

vs.  

�  Many  enterprises  are  turning  to  Hadoop  

¡  Especially  applicaHons  generaHng  big  data  

¡ Web  applicaHons,  social  networks,  scienHfic  applicaHons  

 

Page 36: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

Why  Hadoop  is  Able  to  Compete?    

36  

Scalability  (petabytes  of  data,  thousands  of  machines)  

Database  vs.  

Flexibility  in  accepHng  all  data  formats  (no  schema)  

Commodity  inexpensive  hardware  

Efficient  and  simple  fault-­‐tolerant  mechanism  

Performance  (tons  of  indexing,  tuning,  data  organizaHon  tech.)    

Features:          -­‐  Provenance  tracking            -­‐  AnnotaHon  management            -­‐  ….                    

Page 37: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

What  is  Hadoop  (1)  

•  Hadoop  is  a  so{ware  framework  for  distributed  processing  of  large  datasets  across  large  clusters  of  computers  –  Large  datasets  à  Terabytes  or  petabytes  of  data  –  Large  clusters  à  hundreds  or  thousands  of  nodes    

•  Hadoop   is   open-­‐source   implementaHon   for   Google  MapReduce  

•  Hadoop   is   based   on   a   simple   programming   model   called  MapReduce  

•  Hadoop  is  based  on  a  simple  data  model,  any  data  will  fit  

37  

Page 38: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

What  is  Hadoop  (2)  •  Hadoop  framework  consists  on  two  main  layers  

–  Distributed  file  system  (HDFS)  –  ExecuHon  engine  (MapReduce)  

38  

Page 39: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

Hadoop  Master/Slave  Architecture  

•  Hadoop  is  designed  as  a  master-­‐slave  shared-­‐nothing  architecture  

39  

Master  node  (single  node)  

Many  slave  nodes  

Page 40: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

Design  Principles  of  Hadoop  

•  Need  to  process  big  data    •  Need  to  parallelize  computaHon  across  thousands  of  nodes  

•  Commodity  hardware  –  Large  number  of  low-­‐end  cheap  machines  working  in  parallel  to  solve  a  compuHng  problem  

•  This  is  in  contrast  to  Parallel  DBs  –  Small  number  of  high-­‐end  expensive  machines  

40  

Page 41: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

Design  Principles  of  Hadoop  

•  Automa8c  paralleliza8on  &  distribu8on  – Hidden  from  the  end-­‐user  

•  Fault  tolerance  and  automa8c  recovery  – Nodes/tasks  will  fail  and  will  recover  automaHcally  

•  Clean  and  simple  programming  abstrac8on  – Users  only  provide  two  funcHons  “map”  and  “reduce”  

41  

Page 42: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

Who  Uses  MapReduce/Hadoop  

•  Google:  Inventors  of  MapReduce  compuHng  paradigm  

•  Yahoo:  Developing  Hadoop  open-­‐source  of  MapReduce  

•  IBM,  Microso{,  Oracle  •  Facebook,  Amazon,  AOL,  NetFlex  •  Many  others  +  universiHes  and  research  labs  

42  

Page 43: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

Hadoop:  How  it  Works  

43  

Page 44: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

Hadoop  Architecture  

44  

Master  node  (single  node)  

Many  slave  nodes  

•  Distributed  file  system  (HDFS)  •  ExecuHon  engine  (MapReduce)  

Page 45: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

Hadoop  Distributed  File  System  (HDFS)  

45  

Centralized  namenode            -­‐  Maintains  metadata  info  about  files  

Many  datanode  (1000s)          -­‐  Store  the  actual  data          -­‐  Files  are  divided  into  blocks          -­‐  Each  block  is  replicated  N  Hmes                          (Default  =  3)  

File  F   1   2   3   4   5  

Blocks  (64  MB)  

Page 46: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

Main  ProperHes  of  HDFS  

•  Large:  A  HDFS  instance  may  consist  of  thousands  of  server  machines,  each  storing  part  of  the  file  system’s  data  

•  ReplicaHon:  Each  data  block  is  replicated  many  Hmes  (default  is  3)  

•  Failure:  Failure  is  the  norm  rather  than  excepHon  •  Fault  Tolerance:  DetecHon  of  faults  and  quick,  automaHc  recovery  from  them  is  a  core  architectural  goal  of  HDFS  – Namenode  is  consistently  checking  Datanodes    

46  

Page 47: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

Map-­‐Reduce  ExecuHon  Engine  (Example:  Color  Count)  

47  

Shuffle  &  SorHng  based  on  k  

Reduce

Reduce

Reduce

Map

Map

Map

Map

Input  blocks  on  HDFS  

Produces  (k,  v)        (        ,  1)  

Parse-hash

Parse-hash

Parse-hash

Parse-hash

Consumes(k,  [v])          (          ,  [1,1,1,1,1,1..])  

Produces(k’,  v’)          (          ,  100)  

Users  only  provide  the  “Map”  and  “Reduce”  func5ons  

Page 48: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

ProperHes  of  MapReduce  Engine  

•  Job  Tracker  is  the  master  node  (runs  with  the  namenode)  –  Receives  the  user’s  job  –  Decides  on  how  many  tasks  will  run  (number  of  mappers)  –  Decides  on  where  to  run  each  mapper  (concept  of  locality)  

48  

•  This  file  has  5  Blocks  à  run  5  map  tasks  

•  Where  to  run  the  task  reading  block  “1”  •  Try  to  run  it  on  Node  1  or  Node  3  

Node  1   Node  2   Node  3  

Page 49: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

ProperHes  of  MapReduce  Engine  (Cont’d)  

•  Task  Tracker  is  the  slave  node  (runs  on  each  datanode)  –  Receives  the  task  from  Job  Tracker  –  Runs  the  task  unHl  compleHon  (either  map  or  reduce  task)  –  Always  in  communicaHon  with  the  Job  Tracker  reporHng  progress  

49  

Reduce

Reduce

Reduce

Map

Map

Map

Map

Parse-hash

Parse-hash

Parse-hash

Parse-hash

In  this  example,  1  map-­‐reduce  job  consists  of    4  map  tasks  and  3  reduce  tasks  

Page 50: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

Key-­‐Value  Pairs    

•  Mappers  and  Reducers  are  users’  code  (provided  funcHons)  •  Just  need  to  obey  the  Key-­‐Value  pairs  interface    •  Mappers:  

–  Consume  <key,  value>  pairs  –  Produce  <key,  value>  pairs  

•  Reducers:  –  Consume  <key,  <list  of  values>>  –  Produce  <key,  value>  

•  Shuffling  and  Sor8ng:  –  Hidden  phase  between  mappers  and  reducers  –  Groups  all  similar  keys  from  all  mappers,  sorts  and  passes  them  to  a  certain  reducer  in  the  form  of  <key,  <list  of  values>>  

50  

Page 51: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

MapReduce  Phases  

51  

Deciding  on  what  will  be  the  key  and  what  will  be  the  value  è  developer’s  responsibility  

Page 52: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

Example  1:  Word  Count  

52  

Map    Tasks  

Reduce  Tasks  

•  Job:  Count  the  occurrences  of  each  word  in  a  data  set  

Page 53: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

Example  2:  Color  Count  

53  

Shuffle  &  SorHng  based  on  k  

Reduce

Reduce

Reduce

Map

Map

Map

Map

Input  blocks  on  HDFS  

Produces  (k,  v)        (        ,  1)  

Parse-hash

Parse-hash

Parse-hash

Parse-hash

Consumes(k,  [v])          (          ,  [1,1,1,1,1,1..])  

Produces(k’,  v’)          (          ,  100)  

Job:  Count  the  number  of  each  color  in  a  data  set  

Part0003  

Part0002  

Part0001  

That’s  the  output  file,  it  has  3  parts  on  probably  3  different  machines  

Page 54: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

Example  3:  Color  Filter  

54  

Job:  Select  only  the  blue  and  the  green  colors  Input  blocks  on  HDFS  

Map

Map

Map

Map

Produces  (k,  v)        (        ,  1)  

Write  to  HDFS  

Write  to  HDFS  

Write  to  HDFS  

Write  to  HDFS  

•  Each  map  task  will  select  only  the  blue  or  green  colors  

•  No  need  for  reduce  phase  Part0001  

Part0002  

Part0003  

Part0004  

That’s  the  output  file,  it  has  4  parts  on  probably  4  different  machines  

Page 55: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

Bigger  Picture:  Hadoop  vs.  Other  Systems  

55  

Distributed  Databases   Hadoop  

CompuHng  Model   -­‐  NoHon  of  transacHons  -­‐  TransacHon  is  the  unit  of  work  -­‐  ACID  properHes,  Concurrency  

control  

-­‐  NoHon  of  jobs  -­‐  Job  is  the  unit  of  work  -­‐  No  concurrency  control  

Data  Model   -­‐  Structured  data  with  known  schema  -­‐  Read/Write  mode  

-­‐  Any  data  will  fit  in  any  format    -­‐  (un)(semi)structured  -­‐  ReadOnly  mode  

Cost  Model   -­‐  Expensive  servers   -­‐  Cheap  commodity  machines    

Fault  Tolerance   -­‐  Failures  are  rare  -­‐  Recovery  mechanisms  

-­‐  Failures  are  common  over  thousands  of  machines  

-­‐  Simple  yet  efficient  fault  tolerance  

Key  CharacterisHcs   -­‐  Efficiency,  opHmizaHons,  fine-­‐tuning   -­‐  Scalability,  flexibility,  fault  tolerance    

Page 56: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

Agenda  

1.  Big  Data  1.  DefiniHon  2.  ApplicaHons  

2.  Hadoop  /  MapReduce  CompuHng  Paradigm  1.  Overview  2.  How  it  works  

3.  NoSQL  databases    

Page 57: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

57  

IntroducHon  

•  Relaxing  ACID  properHes  •  NoSQL  databases  

–  Categories  of  NoSQL  databases  –  Typical  NoSQL  API  –  RepresentaHves  of  NoSQL  databases  

•  Conclusions  

Page 58: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

58  

Relaxing  ACID  properHes  •  Atomicity,  Consistency,  IsolaHon,  Durability  •  Cloud  compuHng:  ACID  is  hard  to  achieve,  moreover,  it  is  not  

always  required,  e.g.  for  blogs,  status  updates,  product  lisHngs,  etc.  

•  Availability  –  TradiHonally,  thought  of  as  the  server/process  available  99.999  %  

of  Hme  

–  For  a  large-­‐scale  node  system,  there  is  a  high  probability  that  a  node  is  either  down  or  that  there  is  a  network  parHHoning      

•  ParHHon  tolerance    –  Ensures  that  write  and  read  operaHons  are  redirected  to  available  

replicas  when  segments  of  the  network  become  disconnected    

Page 59: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

59  

Eventual  Consistency  •  Eventual  Consistency  

–  When  no  updates  occur  for  a  long  period  of  Hme,  eventually  all  updates  will  propagate  through  the  system  and  all  the  nodes  will  be  consistent  

–  For  a  given  accepted  update  and  a  given  node,  eventually  either  the  update  reaches  the  node  or  the  node  is  removed  from  service  

•  BASE  (Basically  Available,  So{  state,  Eventual  consistency)  properHes,  as  opposed  to  ACID  

•  So{  state:  copies  of  a  data  item  may  be  inconsistent  •  Eventually  Consistent  –  copies  becomes  consistent  at  some  later  Hme  if  there  are  no  more  updates  to  that  data  item  

•  Basically  Available  –  possibiliHes  of  faults  but  not  a  fault  of  the  whole  system    

Page 60: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

60  

CAP  Theorem  •  Suppose  3  properHes  of  a  system  

–  Consistency  (all  copies  have  same  value)  –  Availability  (system  can  run  even  if  parts  have  failed)  –  ParHHons  (network  can  break  into  two  or  more  parts,  each  with  acHve  

systems  that  can  not  influence  other  parts)  

•  Brewer’s  CAP  “Theorem”:  for  any  system  sharing  data  it  is  impossible  to  guarantee  simultaneously  all  of  these  three  properHes  

•  Very  large  systems  will  parHHon  at  some  point  –  it  is  necessary  to  decide  between  C  and  A  –  tradiHonal  DBMS  prefer  C  over  A  and  P  –  most  Web  applicaHons  choose  A  (except  in  specific  applicaHons  such  

as  order  processing)  

Page 61: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

61  

CAP  Theorem  

•  Drop  A  or  C  of  ACID  –  Relaxing  C  makes  replicaHon  easy,  facilitates  fault  tolerance,  

–  Relaxing  A  reduces  (or  eliminates)  need  for  distributed  concurrency  control.  

Page 62: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

62  

NoSQL  databases  •  The  name  stands  for  Not  Only  SQL  •  Common  features:  

–  Non-­‐relaHonal    –  Usually  do  not  require  a  fixed  table  schema  –  Horizontal  scalable    –  Mostly  open  source  

•  More  characterisHcs  –  Relax  one  or  more  of  the  ACID  properHes  (see  CAP  theorem)  –  ReplicaHon  support  –  Easy  API  (if  SQL,  then  only  its  very  restricted  variant)    

•  Do  not  fully  support  relaHonal  features  –  No  join  operaHons  (except  within  parHHons),  –  No  referenHal  integrity  constraints  across  parHHons.  

Page 63: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

63  

Categories  of  NoSQL  databases  •  Key-­‐value  stores  •  Column  NoSQL  databases    •  Document-­‐based  •  XML  databases  (myXMLDB,  Tamino,  Sedna)    •  Graph  database  (neo4j,  InfoGrid)  

Page 64: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

64  

Categories  of  NoSQL  databases  •  Key-­‐value  stores  •  Column  NoSQL  databases    •  Document-­‐based  •  XML  databases  (myXMLDB,  Tamino,  Sedna)    •  Graph  database  (neo4j,  InfoGrid)  

Page 65: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

65  

Key-­‐Value  Data  Stores  

•  Example:  SimpleDB  – Based  on  Amazon’s  Single  Storage  Service  (S3)  –  Items  (represent  objects)  having  one  or  more  pairs  (name,  value),  where  name  denotes  an  a_ribute.  

– An  a_ribute  can  have  mulHple  values.  –  Items  are  combined  into  domains.  

Page 66: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

66  

Column-­‐oriented*  •  Store  data  in  column  order    •  Allow  key-­‐value  pairs  to  be  stored  (and  retrieved  on  key)  

in  a  massively  parallel  system  –  Data  model:  families  of  a_ributes  defined  in  a  schema,  new  a_ributes  can  be  added  

–  Storing  principle:  big  hashed  distributed  tables  –  ProperHes:  parHHoning  (horizontally  and/or  verHcally),  high  availability  etc.  completely  transparent  to  applicaHon  

*  Be_er:  extendible  records  

Page 67: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

67  

Column-­‐oriented  

•  Example:  BigTable  –  Indexed  by  row  key,  column  key  and    timestamp.  i.e.  (row:  string  ,  column:  string  ,  Hme:  int64  )  à  String.  

–  Rows  are  ordered  in  lexicographic  order  by  row  key.  –  Row  range  for  a  table  is  dynamically  parHHoned,  each  row  range  is  called  a  tablet.  

–  Columns:  syntax  is  family:qualifier  

“Contents:”   “anchor:cnnsi.com”   “anchor:my.look.ca”  

“mff.ksi.www”   “MFF”   “MFF.cz”  t3  

t5  t6   t9   t8  

<html>  <html>  <html>  

column  family  

Page 68: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

68  

A  table  representaHon  of  a  row  in  BigTable  

Row  key Time  stamp Column  name Column  family  Grandchildren

h_p://ksi.... t1 "Jack" "Claire"  7    

t2 "Jack" "Claire"  7 "Barbara"  6  

t3 "Jack" "Claire"  7 "Barbara"  6 "Magda"  3

Page 69: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

69  

Column-­‐oriented  

•  Example:  Cassandra  – Keyspace:  Usually  the  name  of  the  applicaHon;  e.g.,  'Twi_er',  'Wordpress‘.  

– Column  family:  structure  containing  an  unlimited  number  of  rows  

– Column:  a  tuple  with  name,  value  and  Hmestamp  – Key:  name  of  record    – Super  column:  contains  more  columns  

Page 70: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

70  

Document-­‐based  

•  Based  on  JSON  format:  a  data  model  which  supports  lists,  maps,  dates,  Boolean  with  nesHng  

•  Really:  indexed  semistructured  documents      •  Example:  Mongo  

–  {Name:"Jaroslav",    Address:"Malostranske  nám.  25,  118  00  Praha  1“  Grandchildren:  [Claire:  "7",  Barbara:  "6",  "Magda:  "3",  "Kirsten:  "1",  "OHs:  "3",  Richard:  "1"]  

}  

Page 71: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

71  

Typical  NoSQL  API  •  Basic  API  access:  

– get(key)  •  Extract  the  value  given  a  key  

– put(key,  value)  •  Create  or  update  the  value  given  its  key  

– delete(key)  •  Remove  the  key  and  its  associated  value  

– Execute(key,  operaHon,  parameters)  •  Invoke  an  operaHon  to  the  value  (given  its  key)  which  is  a  special  data  structure  (e.g.  List,  Set,  Map  ....  etc).  

Page 72: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

72  

RepresentaHves  of  NoSQL  databases  key-­‐valued  

Name Producer Data model Querying

SimpleDB Amazon set of couples (key, {attribute}), where attribute is a couple (name, value)

restricted SQL; select, delete, GetAttributes, and PutAttributes operations

Redis Salvatore Sanfilippo

set of couples (key, value), where value is simple typed value, list, ordered (according to ranking) or unordered set, hash value

primitive operations for each value type

Dynamo Amazon like SimpleDB simple get operation and put in a context

Voldemort LinkeId like SimpleDB similar to Dynamo

Page 73: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

73  

RepresentaHves  of  NoSQL  databases  column-­‐oriented    

 Name Producer Data model Querying

BigTable Google set of couples (key, {value}) selection (by combination of row, column, and time stamp ranges)

HBase Apache groups of columns (a BigTable clone)

JRUBY IRB-based shell (similar to SQL)

Hypertable Hypertable like BigTable HQL (Hypertext Query Language)

CASSANDRA Apache (originally Facebook)

columns, groups of columns corresponding to a key (supercolumns)

simple selections on key, range queries, column or columns ranges

PNUTS Yahoo (hashed or ordered) tables, typed arrays, flexible schema

selection and projection from a single table (retrieve an arbitrary single record by primary key, range queries, complex predicates, ordering, top-k)

Page 74: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

74  

RepresentaHves  of  NoSQL  databases  document-­‐based  

Name Producer Data model Querying

MongoDB 10gen object-structured documents stored in collections; each object has a primary key called ObjectId

manipulations with objects in collections (find object or objects via simple selections and logical expressions, delete, update,)

Couchbase Couchbase1 document as a list of named (structured) items (JSON document)

by key and key range, views via Javascript and MapReduce

1a{er  merging  Membase  and  CouchOne  

Page 75: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

DATAKON 2011 J. Pokorný 75

Page 76: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

76  

Conclusions  of  the  NoSQL  secHon  

•  NoSQL  database  cover  only  a  part  of  data-­‐intensive  cloud  applicaHons  (mainly  Web  applicaHons).    

•  Problems  with  cloud  compuHng:  –  SaaS  applicaHons  require  enterprise-­‐level  funcHonality,  including  ACID  transacHons,  security,  and  other  features  associated  with  commercial  RDBMS  technology,  i.e.  NoSQL  should  not  be  the  only  opHon  in  the  cloud.  

–  Hybrid  soluHons:    •  Voldemort  with  MySQL  as  one  of  storage  backend    •  deal  with  NoSQL  data  as  semistructured  data    ⇒  integraHng  RDBMS  and  NoSQL  via  SQL/XML  

Page 77: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

77  

Conclusions  of  the  NoSQL  secHon  •  next  generaHon  of  highly  scalable  and  elasHc  RDBMS:  

NewSQL  databases  (from  April  2011)  –  they  are  designed  to  scale  out  horizontally  on  shared  nothing  machines,  

–  sHll  provide  ACID  guarantees,    –  applicaHons  interact  with  the  database  primarily  using  SQL,  –  the  system  employs  a  lock-­‐free  concurrency  control  scheme  to  avoid  user  shut  down,  

–  the  system  provides  higher  performance  than  available  from  the  tradiHonal  systems.    

•  Examples:  MySQL  Cluster  (most  mature  soluHon),  VoltDB,  Clustrix,  ScalArc,  …  

 

Page 78: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

DATAKON 2011 J. Pokorný 78

Conclusions  of  the  NoSQL  secHon  

•  New  buzzword:  SPRAIN  –  6  key  factors  for  alternaHve  data  management:  –  Scalability  –  Performance  –  Relaxed  consistency  –  Agility  –  Intricacy  –  Necessity  

•  Conclusion  in  conclusions:  These  are  exciHng  Hmes  for  Data  Management  research  and  development    

 

Page 79: Big$Data,$Hadoop$and$NoSQL$ - IRITYoann.Pitarch/Docs/CoursBigData/BigDataLesson.pdf · Agenda 1. Big$Data 1. Definion 2. Applicaons$ 2. Hadoop/ MapReduce$CompuHng$Paradigm$ 1. Overview

Références  •  J.  Manyika  et  al.,  Big  data,  the  next  fron0er  for  innova0on,  compe00on,  and  produc0vity,  

McKinsey  Global  InsHtute,  2011.  •  S.  Déjean,  J.  Mothe,  Visual  clustering  for  data  analysis  and  graphical  user  interfaces,  

Handbook  of  Cluster  Analysis,  2012.  •   D.  Oelke  et  al.,  Visual  Opinion  Analysis  of  Customer  Feedback  Data,  IEEE  Symposium  of  

Visual  AnalyHcs  in  Science  and  Technology,  2009  •   D.  Oelke  et  al.,  Visual  Readability  Analysis:  How  to  Make  Your  WriHngs  Easier  to  Read,  IEEE  

Trans.  on  VisualizaHon  and  computer  graphics,  18(5):662-­‐674,  2012  •  A.  Ynnerman,  Visualizing  the  medical  data  explosion,  TED,  2011  •  N.  Kejzar,  S.  Korenjak-­‐Cerne,  and  V.  Batagelj.  Network  analysis  of  works  on  clustering  and  

classificaHon  from  web  of  science.  In  Proceedings  of  IFCS’09,  pages  525–536.  Springer,  2010.  •  S.  Déjean,  P  MarHn,  A.  Baccini,  and  P.  Besse.  Clustering  Hme  series  gene  expression  data  

using  smoothing  spline  derivaHves.  EURASIP  Journal  on  Bioinforma0cs  and  Systems  Biology,  2007.  arHcle  ID  70561.  

•  B.  Dousset.  Tetralogie  :  interacHvity  for  compeHHve  intelligence,  2012.  h_p://atlas.irit.fr/PIE/OuHls/Tetralogie.html  

•  A.  Baccini,  S.  Déjean,  L.  Lafage,  and  J.  Mothe.  How  many  performance  measures  to  evaluate  informaHon  retrieval  systems?  Knowledge  and  Informa0on  System,  pages  693–713,  2011.