hadoop analytics provisioning based on a virtual infrastructure

EGI-InSPIRE RI-261323

Hadoop analytics provisioning based on a virtual infrastructure

J. López Cacheiro, A. Simón, J. Villasuso, E. Freire,

Iván Díaz, C. Fernández, and B. Parak

EGI-InSPIRE

Outline

● Introduction● Methodology● Results● Conclusions● Future work

Introduction

● Federation of resources poses serious challenges both from the cloud and Big Data perspectives.

● FedCloud is in production since mid May 2014● Is it suitable to perform Big Data analytics using Hadoop?

Introduction: FedCloud

● A federated cloud infrastructure developed inside EGI● EGI Federated Cloud Task Force started its activity in

September 2011● FedCloud is in production since mid May 2014● During the last year it has been already available for

experimentation by the different user communities

Introduction: Big Data

● Big Data is an emerging field ● It can benefit from existing developments in cloud

technologies● One of the most popular solutions is Apache Hadoop

Introduction: FedCloud

● A federated cloud infrastructure developed inside EGI● EGI Federated Cloud Task Force started its activity in

September 2011● FedCloud is in production since mid May 2014● During the last year it has been already available for

experimentation by the different user communities

Introduction: Hadoop

● A distributed filesystem (HDFS)● A framework for job scheduling● A MapReduce implementation for parallel processing of

large data sets

Leveraging FedCloud for Hadoop

The aim of our work is to do an initial assessment of the suitability of FedCloud to run Hadoop jobs through a series of real-world benchmarks.

Methodology

● Cluster startup● Benchmarks

Methodology: Cluster startup

● Hadoop clusters consists of one master node and a variable number of slave nodes depending on the size of the cluster

● The master node will run three services● namenode● secondarynamenode● jobtracker

● The slaves will run two services:● tasktracker ● datanode

● A customized VM image was created including different versions of Hadoop (up to 1.2.1) and Oracle Java JDK

● This image was registered in the EGI Marketplace ● Each RP manually downloaded the image to the local site in

order to make it available at its endpoint● To instantiate the VM that will form the Hadoop cluster we

used rOCCI● FedCloud does not count with a WMS, so each VM creation

request must be sent to the appropriate endpoint

Methodology: Benchmarks

● Two standard Big Data benchmarks: ● TeraGen and TeraSort

● Two use cases:● Representing typical small and medium-size jobs● Using real data sets and common MapReduce operations

Methodology: Teragen/Terasort

● The “equivalent” of Linpack in HPC● The Sort Benchmarks list: “equivalent” to Top500● TeraGen generates a input data set in parallel● TeraSort is a standard MapReduce sort job

Methodology: Use cases

● Small-size jobs:● Encyclopaedia Britannica: 176 MB

● Medium-size jobs:● Wikipedia: 41 GB

Methodology: Use cases

● Both use cases were run for different cluster sizes ranging from 10 to 101

● In all the executions we measured two parameters● The put time● The wordcount time

Results

● Cluster startup: including comparison with Amazon EC2● Benchmarks

Results: Cluster startup

● Startup times for each cluster deployment varied● The most representative ones are those obtained for the

larger deployment: the 101-node cluster● The time needed by the rOCCI client to return all the

resource endpoints ranged from 71 to 86 minutes● To have all the VMs running took around 80 minutes more● Total cluster startup time ranged from 2.5 to 3 hours

OpenNebula Frontend Load

More than 20 simultaneous scp processesaffect considerably the system performance

Results: Startup Optimizations

● We reduced the size of the VM image to 4GB● An additional disk of 70GB was created on-the-fly● We modified the datastore used in OpenNebula, so that the

transfer manager used is shared instead of ssh● Instances are declared as non-persistent● The image copy process is no longer done in a centralized

way from the front-end node but in parallel from each node using a simple copy operation from the shared storage

Results: Optimized Startup Times

Cluster size(#)

Startup time (s)

10 249

21 269

51 822

101 1096

Results: Startup Amazon EC2

Cluster size(#)

Startup time(s)

Comments

10 191

21 178

51 583* 2 nodes not 2 nodes not workingworking

101 276* 16 nodes not working

101 297* 1 node not working

Results: Benchmarks Teragen/Terasort

Results: Enciclopaedia Britannica

Results: Wikipedia

Conclusions

● FedCloud is especially suitable for:● small and medium-size Hadoop jobs ● where the data set is already pre-deployed in HDFS

● Optimized startup times are very close to those obtained in Amazon EC2

● The initial data set deployment (put time) imposes a large overhead

● Scalability: we were able to run the TeraGen benchmark up to 2TB and the TeraSort benchmark up to 500GB

Future work: FedCloud Wish List

● Adding a central workload management system● An automated way to distribute and synchronize the image

between all sites● A mechanism to query the resources available at a given site

Thanks For Your Attention!Questions?

hadoop analytics provisioning based on a virtual infrastructure

big data big data

apache hadoop egi

hadoop jobs

egi marketplace

big data analytics

big data perspectives

standard big data benchmarks

suitability of fedcloud

Internet

social analytics via hadoop

hadoop and mapreduce big data analytics

big data analytics using apache hadoop

hadoop* analytics with cloudian solution reference ... ·...

virtualizing big data/hadoop workloads update for vsphere...

hadoop data analytics 4

hive: data warehousing & analytics on hadoop

sas on hadoop - analytics, business intelligence and … on...

devnet-1141dynamic dockerized hadoop provisioning

docker based hadoop provisioning - hadoop summit 2014

big data & analytics – beyond hadoop

bare metal hadoop provisioning

big data analytics - hadoop - sergio uassouf

bigdata analytics and hadoop brochure 2014

weather data analytics using hadoop

realtime analytics + hadoop 2.0

healthcare analytics using hadoop

big data analytics and hadoop

clouderaodbc driverforapache hive - analytics ·...

sql on hadoop for enterprise analytics