using*rdbms,*nosqlor* hadoop?

32
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Using RDBMS, NoSQL or Hadoop? JeanPierre Dijcks Big Data Product Management Server Technologies DOAG Conference 2015

Upload: others

Post on 16-Oct-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Using*RDBMS,*NoSQLor* Hadoop?

Copyright  ©  2014  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Using  RDBMS,  NoSQL  or  Hadoop?  

Jean-­‐Pierre  Dijcks  Big  Data  Product  Management  Server  Technologies    

1  

DOAG  Conference  2015  

Page 2: Using*RDBMS,*NoSQLor* Hadoop?

Copyright  ©  2015  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Data  Ingest  

2  

Page 3: Using*RDBMS,*NoSQLor* Hadoop?

Copyright  ©  2015  Oracle  and/or  its  affiliates.  All  rights  reserved.    |   3  

Ingest  

“Chunk”  256  MB  

Node  1  

Node  2  

Node  3  

HDFS  

Synchronous  

Page 4: Using*RDBMS,*NoSQLor* Hadoop?

Copyright  ©  2015  Oracle  and/or  its  affiliates.  All  rights  reserved.    |   4  

Ingest  

“Chunk”  256  MB  

Node  1  

Node  2  

Node  3  

HDFS  

Synchronous  

Master  Node  

Eventually  Consistent  

“Records”  

Replica  Node  1  

Replica  Node  2  

<key><value>  

<key><value>  

<key><value>  

t  

t  +  1  

t  +  2  

Ingest  

Page 5: Using*RDBMS,*NoSQLor* Hadoop?

Copyright  ©  2015  Oracle  and/or  its  affiliates.  All  rights  reserved.    |   5  

Ingest  

“Chunk”  256  MB  

Node  1  

Node  2  

Node  3  

HDFS  

Synchronous  

Master  Node  

Eventually  Consistent  

“Records”  

Replica  Node  1  

Replica  Node  2  

<key><value>  

<key><value>  

<key><value>  

t  

t  +  1  

t  +  2  

Ingest  Compute  

Parse  /  Validate  

DBF   DBF  

Storage  

ASM  

SQL  Ingest  

“Transac^ons”  Op^mized  

Page 6: Using*RDBMS,*NoSQLor* Hadoop?

Copyright  ©  2015  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

High-­‐level  Comparison  HDFS   NoSQL   RDBMS  

Data  Type   Chunk   Record   Transac^on  

Write  Type   Synchronous   Eventually  Consistent   ACID    Compliant  

Data  Prepara^on   No  Parsing   No  Parsing   Parsing  and  Valida^on  

6  

Ingest  

Page 7: Using*RDBMS,*NoSQLor* Hadoop?

Copyright  ©  2015  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Disaster  Recovery  -­‐  High  Availability  Across  Geography  

7  

Page 8: Using*RDBMS,*NoSQLor* Hadoop?

Copyright  ©  2015  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Disaster  Recovery:  Hadoop  

8  

Ingest  

“Chunk”  256  MB  

Node  1  

Node  2  

Node  3  Synchronous  

between  nodes  

Data  Center  1  

Node  1  

Node  2  

Node  3  

Data  Center  2  

MUST  Duplicate  to  a  separate  Cluster  in  Data  Center  2  

Page 9: Using*RDBMS,*NoSQLor* Hadoop?

Copyright  ©  2015  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Disaster  Recovery:  RDBMS  

9  

Ingest  Compute  

“Transac^ons”  Parse  /  Validate  

DBF   DBF  

Storage  

ASM  

Op^mized  

SQL  

Data  Guard  Golden  Gate  

Data  Center  1   Data  Center  2  

Page 10: Using*RDBMS,*NoSQLor* Hadoop?

Copyright  ©  2015  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Disaster  Recovery:  NoSQL  Op^on  

10  

Ingest  Master  Node  

Eventually  Consistent  

“Records”  

Replica  Node  1  

Replica  Node  2  

<key><value>  

<key><value>  

<key><value>  

t  

t  +  1  

t  +  2  

Data  Center  1  or  Region  1  

Data  Center  2  or  Region  2  

Page 11: Using*RDBMS,*NoSQLor* Hadoop?

Copyright  ©  2015  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Note  on  HBase  • Don't  confuse  HBase  with  NoSQL  in  terms  of  geographical  distribu^on  op^ons  • Hbase  is  based  on  HDFS  storage  • Consequently,  HBase  cannot  span  data  centers  as  we  see  in  other  NoSQL  systems  

11  

Page 12: Using*RDBMS,*NoSQLor* Hadoop?

Copyright  ©  2015  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Big  Data  Appliance  includes  Cloudera  BDR  • Oracle  packages  this  on  BDA  as  a  Cloudera  op^on  called  BDR  (Backup  and  Disaster  Recovery)  • BDR  =  DistCP  – distributed  copy  method  leveraging  parallel  compute  (MapReduce  like)  

• BDR  is  NOT  proven  as  Data  Guard  or  GG  – Most  importantly,  think  of  BDR  as  a  batch  process  not  a  trickle  

• BDR  is  included  (no  extra  charge)  on  BDA  

12  

Page 13: Using*RDBMS,*NoSQLor* Hadoop?

Copyright  ©  2015  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

High-­‐level  Comparison  HDFS   NoSQL   RDBMS  

Data  Type   Chunk   Record   Transac^on  

Write  Type   Synchronous   Eventually  Consistent   ACID    Compliant  

Data  Prepara^on   No  Parsing   No  Parsing   Parsing  and  Valida^on  

DR  Type   Second  Cluster   Node  Replica   Second  RDBMS  

DR  Unit   File   Record   Transac^on  

DR  Timing   Batch   Record   Transac^on  

13  

Ingest  

DR  

Page 14: Using*RDBMS,*NoSQLor* Hadoop?

Copyright  ©  2015  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Access  Paths  to  Data  Sets  

14  

Page 15: Using*RDBMS,*NoSQLor* Hadoop?

Copyright  ©  2015  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Data  Sets  and  Analy^cal  SQL  Queries:  Hadoop  

15  

“Chunk”  256  MB  block  

No  Parse  

json  INSERT  •  Data  is  NOT  parsed  on  ingest,  example  JSON  

document  is  not  parsed..  Original  files  loaded  an  broken  into  chunks  

•  Does  not  like  small  files    UDPATE  •  NO  update  allowed  (append  only)  

•  Expensive  to  update    a  few  KB  by  replacing  a  256MB  chunk  (as  part  of  a  file  replacement)  

SELECT  •  Scan  ALL  data  even  for  a  single  “row”  answer  •  SQL  access  path  is  a  “full  table  scan”  •  Hive  op^mizes  this  with  

•  Metadata  (Par^^on  Pruning)  •  MapReduce  in  Parallel  to  achieve  scale  

NO    UPDATE   Hive  uses  

MapReduce  In    

Parallel  Achieves  SCALE  

Page 16: Using*RDBMS,*NoSQLor* Hadoop?

Copyright  ©  2015  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Data  Sets  and  Analy^cal  SQL  Queries:  NoSQL  

16  

<key><Value>  

json  

Index  Lookup    

API:  Get  Put  Mget  Mput  

PUT  (Insert)  •  Data  is  typically  not  parsed,  example,  JSON  

document  is  loaded  as  a  value  with  a  key    GET  (Select)  •  Data  Retrieval  through  “Index  Lookup”  based  

on  KEY  •  Only  access  path  is  index  (primary  or  

secondary)  •  If  retrieving  from  Replica  may  not  be  

consistent  (yet)  •  Generally  do  not  issue  SQL  over  NoSQL  •  Not  used  for  large  analysis,  instead  used  to  

receive  and  provide  single  records  or  small  sets  based  on  the  same  key  

Page 17: Using*RDBMS,*NoSQLor* Hadoop?

Copyright  ©  2015  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Data  Sets  and  Analy^cal  SQL  Queries:  RDBMS  

17  

TABLE   INDEX   INSERT  •  Parse  data  and  op^mize  for  retrieval  •  Adhere  to  transac^on  consistency    UPDATE  •  Enables  individual  records  to  be  updated  •  With  read-­‐consistency  •  Row  level  locking  etc.  

SELECT  •  Read-­‐consistent  guaranteed  •  SQL  Op^miza^on:  

•  Choose  the  best  access  path  based  on  the  ques^on  and  database  sta^s^cs  

•  Op^mized  joins,  spills  to  disk  etc.  •  Supports  very  complex  ques^ons  

Data  is  parsed    

Metadata  is  available  on  

write    

Auxiliary  structures  

enable  myriad  of  access  paths  

Page 18: Using*RDBMS,*NoSQLor* Hadoop?

Copyright  ©  2015  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

High-­‐level  Comparison  HDFS   NoSQL   RDBMS  

Data  Type   Chunk   Record   Transac^on  

Write  Type   Synchronous   Eventually  Consistent   ACID    Compliant  

Data  Prepara^on   No  Parsing   No  Parsing   Parsing  and  Valida^on  

DR  Type   Second  Cluster   Node  Replica   Second  RDBMS  

DR  Unit   File   Record   Transac^on  

DR  Timing   Batch   Record   Transac^on  

Complex  Analy^cs?   Yes   No   Yes  

Query  Speed   Slow   Fast  for  simple  ques^ons   Fast    

#  of  Data  Access  Methods   One  (full  table  scan)   One  (index  lookup)   Many  (Op^mized)  

18  

Ingest  

DR  

Acces  

Affordable  Scale   Low  Predictable  Latency   Flexible  Performance  

Page 19: Using*RDBMS,*NoSQLor* Hadoop?

Copyright  ©  2015  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Performance  Aspects  Ingest  and  Query  

19  

Page 20: Using*RDBMS,*NoSQLor* Hadoop?

Copyright  ©  2015  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

A  simple  set  of  criteria  

20  

0  

0.5  

1  

1.5  

2  

2.5  

3  

3.5  

4  

4.5  

5  Concurrency  

Complex  Query  Response  Times  

Single  Record  Read/Write  Performance  

Bulk  Write  Performance  

Privileged  User  Security  

General  User  Security  

Governance  Tools  

System  per  TB  Cost  

Backup  per  TB  Cost  

Skills  Acquisi^on  Cost  

RDBMS  

NoSQL  DB  

Hadoop  

Page 21: Using*RDBMS,*NoSQLor* Hadoop?

Copyright  ©  2015  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Ingest  Performance  Considera^ons    HDFS  •  No  parsing  •  Fastest  for  Inges^ng  Large  Data  Sets,  like  log  files  

•  Bulk  write  of  sets  of  data,  not  individual  records  

•  Slower  write  on  a  record-­‐by-­‐record  basis  than  NoSQL  

NoSQL  •  No  parsing  •  Fastest  for  Wri^ng  Individual  Records  to  the  master  

•  Direct  writes  on  a  per-­‐record  basis  

•  Best  for  Millisecond  Write  

RDBMS  

•  RDBMS  does  work  to  data  (parsing)  on  ingest    

•  Ingest  speed  is  slower  than  NoSQL  on  a  record  by  record  basis  due  to  parsing  

•  Ingest  speed  is  slower  than  Hadoop  on  a  file  by  file  basis  due  to  parsing  

•  Benefits  of  RDBMS  parsing  become  evident  on  query  performance…  

21  

Page 22: Using*RDBMS,*NoSQLor* Hadoop?

Copyright  ©  2015  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Query  Performance  Considera^ons    HDFS  •  Requires  parsing  •  Slowest  on  Query  Time    –  Due  to  Full  Table  Scans  on  top  of  parsing  

–  No  query  caching  

•  Able  to  run  complex  queries  –  Requires  use  of  MR  or  Spark  programs  

–  SQL  immature  (SQL-­‐92)  

NoSQL  •  No  parsing  •  Fastest  on  Query  Time  for  “get”  

•  Consistently  fast  is  the  goal  

•  No  syntax  available  to  run  complex  queries  

RDBMS  

•  Fastest  SQL  query  ^mes  because  of  parsing  work  done  on  ingest    –  Completely  op^mized  storage  formats  for  IO  

–  Oracle  knows  how  to  pick  stuff  out,  metadata  on  files..    

–  Least  amount  of  I/O  to  retrieve  records  

–  Advanced  Caching  

•  RDBMS  can  run  more  complex  SQL  queries  than  NoSQL  or  Hadoop  

22  

Page 23: Using*RDBMS,*NoSQLor* Hadoop?

Copyright  ©  2015  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

SQL  on  Hadoop  vs  RDBMS  

Hadoop  

• Pay  on  QUERY  for  GAINS  on  INGEST    •  Full  table  scan  – Shorten  full  table  scan  by  adding  more  nodes  to  scan  through  data  

• Problem  remains,  data  isn’t  parsed  – All  raw  data  must  be  read  for  EACH  and  EVERY  read  

RDBMS  

• Pay  on  INGEST  (Parsing)  for  GAINS  on  QUERY  • Op^mized  Query  Path    

23  

Page 24: Using*RDBMS,*NoSQLor* Hadoop?

Copyright  ©  2015  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Benefits  Compared  Hadoop   NoSQL     RDBMS  

“Schema  on  Read”  Flexibility  Write  file  on  disk  without  error  checking    Programma^cally  s^tch  disparate  data  together    

 

Low  Latency  for  simple  ac^ons    

High  performance  for  lots  of  workloads  without  end  user  knowing    

Parsed  Data  Mul^ple  Access  Paths  Advanced  Query  Caching  In-­‐Memory  Formats    

 

Affordable  Scale   HOWEVER  No  API  to  run  analy^cal  query    

Consistency  -­‐  exact  same  result  at  exact  same  ^me    

Traceability  of  transac^ons  

Applica^ons  become  simpler    

24  

Page 25: Using*RDBMS,*NoSQLor* Hadoop?

Copyright  ©  2015  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Concurrency  

25  

Page 26: Using*RDBMS,*NoSQLor* Hadoop?

Copyright  ©  2015  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Concurrency  Comparisons  

26  

HDFS  

•  Most  Hadoop  systems  have  small  number  of  users  running  large  number  of  jobs  –  Batch  or  very  frequent  micro  batch  calcula^ons  

•  Customers  report  Impala  struggles  with  concurrency  (much  like  Netezza,  or  worse)  

•  Immature  resource  management,  not  all  solu^ons  work  nicely  

NoSQL  

•  Very  high  concurrency  •  Geographically  distributed  •  Must  deal  with  consistency  issues  

•  Great  reader  farm  for  publishing  data  to  for  example  web  apps  

•  Load  balancing  built-­‐in  

RDBMS  

•  Hundreds  of  users,  hundreds  of  queries  running  at  the  same  ^me  

•  Queries  are  balanced  over  the  system  

•  Resource  Management  able  to  control  the  whole  system  

•  Proven  in  many  large  organiza^ons  

Page 27: Using*RDBMS,*NoSQLor* Hadoop?

Copyright  ©  2015  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Cost  

27  

Page 28: Using*RDBMS,*NoSQLor* Hadoop?

Copyright  ©  2015  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Cost  • Customers  want  reduce  cost  by  moving  from  RDBMS  to  Hadoop  – Hadoop  delivers  lowest  cost  per  TB  

• But  which  workloads  can  move  from  RDBMS  to  Hadoop?  – Dealing  with  transac^ons?    Stay  in  RDBMS  – Running  Advanced  SQL  queries?    Stay  in  RDBMS  – Large  numbers  of  concurrent  users?    Stay  in  RDBMS  

• Will  hit  high  performance  penalty  for  moving  DW  to  Hadoop  because  of  full  table  scans  

28  

Page 29: Using*RDBMS,*NoSQLor* Hadoop?

Copyright  ©  2015  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Can  SQL  on  Hadoop  solve  the  problems  without  sacrifice?  

29  

Page 30: Using*RDBMS,*NoSQLor* Hadoop?

Copyright  ©  2015  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

No..  1  ½  Axempts  

Impala  

•  Solu^on  to  speed  up  Hive  and  MR  – Wrote  own  op^mizer  and  query  engine  

•  Large  speedup  but  at  the  cost  of:  – No  MR  so  loses  scalability  – Less  SQL  capabili^es  so  loses  BI  transparency  

– System  cost  increases  to  run  Impala  (needs  add’l  memory)  so  loses  some  cost  advantage  

Impala  with  Parquet  

•  Convert  files  into  columnar  data  blocks  (Parquet  file  format)  

•  Speed  up  because  of  columnar  formats  etc:  

•  But  at  a  cost:  – Must  run  ETL  to  Parquet  (comparable  to  ingest  into  RDBMS)  

–  Loose  schema  on  read  =>  flexibility  –  Double  the  data  

•  Essen^ally  a  new  columnar  DB  on  HDFS  

30  

Page 31: Using*RDBMS,*NoSQLor* Hadoop?

Copyright  ©  2015  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Conclusion:  Choose  the  Right  Tool  for  the  Right  Job  Don’t  Use  something  it’s  not  meant  to  be  

31  

Page 32: Using*RDBMS,*NoSQLor* Hadoop?