performance evaluation of a mongodb and hadoop platform...

27
1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific Data Analysis Elif Dede, Madhusudhan Govindaraju Lavanya Ramakrishnan, Dan Gunter, Shane Canon Department of Computer Science, Binghamton University (SUNY) Lawrence Berkeley National Laboratory

Upload: others

Post on 14-Aug-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Performance Evaluation of a MongoDB and Hadoop Platform ...datasys.cs.iit.edu/events/ScienceCloud2013/s02.pdf1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific

1

Performance Evaluation of a MongoDB and Hadoop Platform for Scientific Data Analysis

Elif Dede, Madhusudhan Govindaraju

Lavanya Ramakrishnan, Dan Gunter, Shane Canon

Department of Computer Science, Binghamton University (SUNY) Lawrence Berkeley National Laboratory

Page 2: Performance Evaluation of a MongoDB and Hadoop Platform ...datasys.cs.iit.edu/events/ScienceCloud2013/s02.pdf1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific

Computa(on  and  Data  are  cri(cal  parts  of  the  scien(fic  process  

Experiment

Theory

Computation

Data (Fourth Paradigm)

Advance  Light  Source  Data  Rates  

2009          65  TB/yr  

2011      312  TB/yr  

2013   1900  TB/yr  

Three Pillars of Science

Page 3: Performance Evaluation of a MongoDB and Hadoop Platform ...datasys.cs.iit.edu/events/ScienceCloud2013/s02.pdf1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific

Materials  Project  

3

Schemaless database

manager.x manager.x manager.x

Brain

www.materialsproject.org Source: Michael Kocher, Daniel Gunter

Page 4: Performance Evaluation of a MongoDB and Hadoop Platform ...datasys.cs.iit.edu/events/ScienceCloud2013/s02.pdf1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific

Data  is  “Big”  

4

Page 5: Performance Evaluation of a MongoDB and Hadoop Platform ...datasys.cs.iit.edu/events/ScienceCloud2013/s02.pdf1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific

Processing  “Big  Data”:  MapReduce  

5

• Introduced in OSDI 2004 by Dean and Ghemawat from Google

• Programming model for processing large data sets

• Exploits large a set of commodity machines

• Characteristics of the model: •  Relaxed synchronization

constraints •  Locality optimization •  Fault-tolerance •  Load balancing OSDI  2004  

Page 6: Performance Evaluation of a MongoDB and Hadoop Platform ...datasys.cs.iit.edu/events/ScienceCloud2013/s02.pdf1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific

Map  and  Reduce    Map/Reduce:

•  The map() function is called on every item in the input set and emits a series of intermediate key/value pairs

•  All values associated with a given intermediate key are grouped together

•  The reduce() function is called on every unique intermediate key, and its value list, and emits a final output value

6

Page 7: Performance Evaluation of a MongoDB and Hadoop Platform ...datasys.cs.iit.edu/events/ScienceCloud2013/s02.pdf1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific

Apache  Hadoop  •   Open-­‐source  MapReduce  implementa;on  in  Java  •   Easy  scalability    •   Built-­‐in  I/O  management  

•   Hadoop  Distributed  File  System(HDFS)  • Data  distribu(on,  management  and  replica(on  

•   Load  balancing  •   Handles  stragglers  

•   Fault  tolerance  •   Commodity  hardware  •   Heartbeats  •   Specula(ve  execu(on  and  data  replica(on  

• Hadoop  Streaming  • Create  and  run  MapReduce  jobs  with  any  executable  or  script  as  the  mapper  and/or  the  reducer  

7

Page 8: Performance Evaluation of a MongoDB and Hadoop Platform ...datasys.cs.iit.edu/events/ScienceCloud2013/s02.pdf1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific

Scien;fic  Compu;ng  and  Hadoop    Hadoop  provides:  •  Data  Flow  Parallelism    

•  Data  goes  through  different  steps  of  processing  •  Similar  Job  Phases  

•  Data  prepara(on,  transforma(on  and  reduc(on    •  MapReduce:  maps  (transforma(on)  and  reduces  

(reduc(on)  •  Number  of  maps  >>>  Number  of  reduce  

•  Data  transforma(on  is  typically  more  parallel    than  data  reduc(on    

•  Fault  Tolerance  and  Data  Locality  •  Data  intensive  loads  •  Long  running  scien(fic  jobs  

8

Page 9: Performance Evaluation of a MongoDB and Hadoop Platform ...datasys.cs.iit.edu/events/ScienceCloud2013/s02.pdf1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific

Scien;fic  Compu;ng  and  Hadoop  (Cont.)    

Hadoop  does  not  provide:  •  Java  implementa(on  

•  Legacy  scien(fic  code  mostly  is  not  in  java  and  hard  to  rewrite  as  map  and  reduce  func(ons  

•  Hadoop  Streaming  allows  other  modes    •  HDFS  is  a  non-­‐POSIX  file  system  

•  HDFS  java  library  calls  needed  to  create,  read  and  write  files  •  HDFS  data  locality  good  but  does  not  handle  applica(ons  that  

might  have  mul(ple  data  sets  •  Scien(fic  data  formats  do  not  fit  in  the  line/block  oriented  inputs  of  

typical  Hadoop  jobs  •  Scien(fic  applica(ons  o]en  work  with  files  where  the  logical  

division  of  work  is  per  file  •  New  file  formats  require  addi(onal  java  programming  to  define  

the  format,  appropriate  split  for  a  single  map  task  

9

Page 10: Performance Evaluation of a MongoDB and Hadoop Platform ...datasys.cs.iit.edu/events/ScienceCloud2013/s02.pdf1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific

Scien;fic  Compu;ng  and  Hadoop  (Cont.)    

Hadoop  does  not  provide:  

•  Maps  and  reduces  are  considered  iden(cal  (executables/arguments)  

•  Implemen(ng  different  tasks  requires  logic  in  the    tasks  that  differen(ate  the  func(onality  

•  This  can  cause  worker  processing  (mes  to  vary  widely  an  lead  to  (meouts  and  restarted  tasks  due  to  the  specula(ve  execu(on  in  Hadoop  

•  No  built-­‐in  dynamic  and  itera(ve  applica(on  support  

10

Page 11: Performance Evaluation of a MongoDB and Hadoop Platform ...datasys.cs.iit.edu/events/ScienceCloud2013/s02.pdf1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific

New  Genera;on  Data  

11

•  Dynamic Data •  Size and Content

•  Structured? •  Semi structured,

unstructured

•  Relational? • Not always

Page 12: Performance Evaluation of a MongoDB and Hadoop Platform ...datasys.cs.iit.edu/events/ScienceCloud2013/s02.pdf1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific

NoSQL  

12

A broad class of data management systems where the data is partitioned across a set of servers, where no server plays a privileged role

• NoSQL has emerged as an alternative model for this new non-relational data model.

• Address the ``Big Data'' challenge by providing horizontal scalability.

• Lower maintenance costs and flexibility.

• There are various data models that are represented under NoSQL including key-value, column-oriented and document-oriented stores.

• Each of these models has its own interpretation of data storage and makes different tradeoffs within the Consistency, Availability and Performance

Page 13: Performance Evaluation of a MongoDB and Hadoop Platform ...datasys.cs.iit.edu/events/ScienceCloud2013/s02.pdf1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific

What  is  MongoDB?  

13

• Open source document-oriented database

•  Data is not in tables with rows and columns

• Data is stored as “documents”, each of which is a associative array of scalar values, or nested associative arrays

•  Javascript Object Notation (JSON) format

•  Stored as BSON

• MongoDB uses sharding to split the data evenly across the cluster to parallelize access.

• This is done through front-end “routing servers” and back-end “data servers”

• Provides a built-in MapReduce • Drawbacks • The MapReduce scripts should be written in JavaScript •  Slow and poor analytics libraries • The JavaScript implementation used by the MongoDB is not thread safe

Page 14: Performance Evaluation of a MongoDB and Hadoop Platform ...datasys.cs.iit.edu/events/ScienceCloud2013/s02.pdf1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific

Why  MongoDB?  

14

Materials Project: •  A community accessible data store of calculated materials.

• Data store is complex with hundreds of attributes and constantly evolving.

• MongoDB provides an appropriate data model and query language.

• The project also needs to perform complex statistical data mining to discover patterns in materials and validate/verify correctness.

• These task are difficult with MongoDB but natural for MapReduce

ALS: • Advanced Light Source’s Tomogropy beamline uses MongoDB to store metadata from experiments (Summer’12, LBNL)

Schemaless database

manager.x manager.x manager.x

Brain  

www.materialsproject.org  Source:  Michael  Kocher,  Daniel  Gunter  (LBNL)  

Page 15: Performance Evaluation of a MongoDB and Hadoop Platform ...datasys.cs.iit.edu/events/ScienceCloud2013/s02.pdf1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific

Hadoop-­‐MongoDB  Connector  

15

• Input splits are retrieved from a MongoDB server(s)

• Each mapper can read its splits in parallel

• Results are written back to MongoDB by the Hadoop reducer(s)

• It works with single MongoDB server or with a sharding setup

• User determines the split size

Page 16: Performance Evaluation of a MongoDB and Hadoop Platform ...datasys.cs.iit.edu/events/ScienceCloud2013/s02.pdf1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific

MongoDB:  Overhead  of  mul;ple  connec;ons  

16

• Test ability to handle large number of simultaneous connections •  768 tasks with different checkpoint intervals compared to when there is no checkpoint

•  overhead

•  Connections increased from 154 to 768 per second, write volume increased to 768 MBs/.

Page 17: Performance Evaluation of a MongoDB and Hadoop Platform ...datasys.cs.iit.edu/events/ScienceCloud2013/s02.pdf1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific

MongoDB:  Overhead,  when  using  more  nodes,  tasks  

17

• 10 min per task, •  All tasks run in parallel •  10 sec checkpoint interval •  Overhead observed after 1000 parallel tasks

• Large number of connections is the bottleneck

•  More than the data volume

Page 18: Performance Evaluation of a MongoDB and Hadoop Platform ...datasys.cs.iit.edu/events/ScienceCloud2013/s02.pdf1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific

MongoDB  MapReduce  vs.  Hadoop-­‐MongoDB    Read/Write  Performance  Comparison  

18

• Data is stored on a single MongoDB server

• Hadoop cluster consists of 2 worker nodes

• The mongo-hadoop plug-in provides roughly five times better performance.

Page 19: Performance Evaluation of a MongoDB and Hadoop Platform ...datasys.cs.iit.edu/events/ScienceCloud2013/s02.pdf1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific

Hadoop-­‐MongoDB:  Choosing  the  Split  Size    

19

• Processing 9.3 million input records with Hadoop

• Each mapper reads an input split from the MongoDB server, does processing and sends its intermediate output to the reducer

• Split size varies:16, 32, 64, 128, 254 MBs

•  sweet spot: 128 MB

•  With the default split size of 8MB, Hadoop schedules over 500 mappers; by increasing the split size, this number drops around to 40

Page 20: Performance Evaluation of a MongoDB and Hadoop Platform ...datasys.cs.iit.edu/events/ScienceCloud2013/s02.pdf1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific

Hadoop-­‐MongoDB:  Increasing  Data    

20

• For 4.6 million input records, HDFS Hadoop is two times better than MongoDB, and at 37.2 million records it is five times

• At 37.2 million input records mongo-hadoop is more than 3 times slower in reading and more than nine times in writing than Hadoop-HDFS.

• In a sharded setup, mongo-hadoop reading times improve considerably.

2-node Hadoop Cluster and 2 Mongo-DB servers.

Page 21: Performance Evaluation of a MongoDB and Hadoop Platform ...datasys.cs.iit.edu/events/ScienceCloud2013/s02.pdf1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific

Hadoop-­‐MongoDB:  Sharding  and  processing  on  local  nodes  vs  different  nodes    

21

• The performance slightly worsened compared to running the servers on different machines.

•  MongoDB uses mmap to aggressively cache data from disk into memory

•  With increasing input size growing memory and CPU usage is observed on the worker/server nodes

• This effects the performance of the MapReduce job

Performance bottleneck is due to memory Contention. Locality has minimal effect.

Page 22: Performance Evaluation of a MongoDB and Hadoop Platform ...datasys.cs.iit.edu/events/ScienceCloud2013/s02.pdf1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific

Hadoop-­‐MongoDB:  Increasing  #Workers    

22

•  The performance over increasing cluster sizes from 16 to 64 cores

•  Single to two MongoDB sharded servers

• The write time is bound by the reduce phase for this MapReduce job

• Number of mappers >> number of reducers

• The write performance of MongoDB still remains to be a bottleneck along with the overhead of routing data to be written between sharding servers.

Write performance of MongoDB is a bottleneck.

Page 23: Performance Evaluation of a MongoDB and Hadoop Platform ...datasys.cs.iit.edu/events/ScienceCloud2013/s02.pdf1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific

Hadoop-­‐MongoDB:  Different  Setups    (given  that  the  data  is  in  MongoDB)  

23

• Best performance achieved reading from MongoDB and writing the output to HDFS

• Downloading the data to HDFS before running the analysis is the slowest .

Hadoop-HDFS provides the best peformance.

Page 24: Performance Evaluation of a MongoDB and Hadoop Platform ...datasys.cs.iit.edu/events/ScienceCloud2013/s02.pdf1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific

Hadoop-­‐MongoDB:  Different  Setups    

24

•  Increasing cluster size (from 8 cores to 64) for 37.2 million input records

• With an increasing number of worker nodes the concurrency of the map phase increases

• The map times get considerably faster

Page 25: Performance Evaluation of a MongoDB and Hadoop Platform ...datasys.cs.iit.edu/events/ScienceCloud2013/s02.pdf1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific

Hadoop-­‐MongoDB:  Fault  Tolerance    

25

• 32 node Hadoop cluster processing ~37 million input records

• After eight faulted worker nodes Hadoop-HDFS loses too many data nodes and fails to complete the MapReduce job

•  Mongo-hadoop gets the input splits from the MongoDB server therefore losing worker nodes does not lead to loss of input data

Page 26: Performance Evaluation of a MongoDB and Hadoop Platform ...datasys.cs.iit.edu/events/ScienceCloud2013/s02.pdf1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific

Conclusions  •  Sharding  helps  to  improve  MongoDB’s  performance  especially  for  reads.    

•  In  a  sharded  setup,  mongo-­‐hadoop  reading  (mes  improve  considerably,  as  there  are  mul(ple  servers  to  respond  to  parallel  worker  requests  

•  In  cases  where  data  is  stored  in  MongoDB  and  needs  to  be  analyzed,  the  mongo-­‐hadoop  connector  is  a  convenient  way  to  use  Hadoop.    

•  Performance  improves    when  output  is  wriben  to  HDFS  

•  MongoDB  performance  degrada(on  observed  with  the  increasing  number  of  connec(ons,  increasing  write  requests  per  second,  as  well  as  the  increase  in  total  write  volume  

•  The  mongo-­‐hadoop  plug-­‐in  provides  roughly  five  (mes  beber  performance  compared  to  using  MongoDB’s  na(ve  MapReduce  implementa(on.    

•  The  performance  gain  from  using  mongo-­‐hadoop  increases  linearly  with  input  size.  

26

Page 27: Performance Evaluation of a MongoDB and Hadoop Platform ...datasys.cs.iit.edu/events/ScienceCloud2013/s02.pdf1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific

27

Contact

Madhu Govindaraju [email protected] Binghamton University State University of New York (SUNY)

Dan Gunter [email protected] Lawrence Berkeley National Laboratory