hadoop analytics provisioning based on a virtual infrastructure

28
EGI-InSPIRE RI-261323 Hadoop analytics provisioning based on a virtual infrastructure J. López Cacheiro, A. Simón, J. Villasuso, E. Freire, Iván Díaz , C. Fernández, and B. Parak 1 EGI-InSPIRE

Upload: cesga-foundation

Post on 18-Jul-2015

51 views

Category:

Internet


4 download

TRANSCRIPT

EGI-InSPIRE RI-261323

Hadoop analytics provisioning based on a virtual infrastructure

J. López Cacheiro, A. Simón, J. Villasuso, E. Freire,

Iván Díaz, C. Fernández, and B. Parak

1

EGI-InSPIRE

EGI-InSPIRE RI-261323

Outline

● Introduction● Methodology● Results● Conclusions● Future work

EGI-InSPIRE RI-261323

Introduction

● Federation of resources poses serious challenges both from the cloud and Big Data perspectives.

● FedCloud is in production since mid May 2014● Is it suitable to perform Big Data analytics using Hadoop?

EGI-InSPIRE RI-261323

Introduction: FedCloud

● A federated cloud infrastructure developed inside EGI● EGI Federated Cloud Task Force started its activity in

September 2011● FedCloud is in production since mid May 2014● During the last year it has been already available for

experimentation by the different user communities

EGI-InSPIRE RI-261323

Introduction: Big Data

● Big Data is an emerging field ● It can benefit from existing developments in cloud

technologies● One of the most popular solutions is Apache Hadoop

EGI-InSPIRE RI-261323

Introduction: FedCloud

● A federated cloud infrastructure developed inside EGI● EGI Federated Cloud Task Force started its activity in

September 2011● FedCloud is in production since mid May 2014● During the last year it has been already available for

experimentation by the different user communities

EGI-InSPIRE RI-261323

Introduction: Hadoop

● A distributed filesystem (HDFS)● A framework for job scheduling● A MapReduce implementation for parallel processing of

large data sets

EGI-InSPIRE RI-261323

Leveraging FedCloud for Hadoop

The aim of our work is to do an initial assessment of the suitability of FedCloud to run Hadoop jobs through a series of real-world benchmarks.

EGI-InSPIRE RI-261323

Methodology

● Cluster startup● Benchmarks

EGI-InSPIRE RI-261323

Methodology: Cluster startup

● Hadoop clusters consists of one master node and a variable number of slave nodes depending on the size of the cluster

● The master node will run three services● namenode● secondarynamenode● jobtracker

● The slaves will run two services:● tasktracker ● datanode

EGI-InSPIRE RI-261323

Methodology: Cluster startup

● A customized VM image was created including different versions of Hadoop (up to 1.2.1) and Oracle Java JDK

● This image was registered in the EGI Marketplace ● Each RP manually downloaded the image to the local site in

order to make it available at its endpoint● To instantiate the VM that will form the Hadoop cluster we

used rOCCI● FedCloud does not count with a WMS, so each VM creation

request must be sent to the appropriate endpoint

EGI-InSPIRE RI-261323

Methodology: Cluster startup

EGI-InSPIRE RI-261323

Methodology: Benchmarks

● Two standard Big Data benchmarks: ● TeraGen and TeraSort

● Two use cases:● Representing typical small and medium-size jobs● Using real data sets and common MapReduce operations

EGI-InSPIRE RI-261323

Methodology: Teragen/Terasort

● The “equivalent” of Linpack in HPC● The Sort Benchmarks list: “equivalent” to Top500● TeraGen generates a input data set in parallel● TeraSort is a standard MapReduce sort job

EGI-InSPIRE RI-261323

Methodology: Use cases

● Small-size jobs:● Encyclopaedia Britannica: 176 MB

● Medium-size jobs:● Wikipedia: 41 GB

EGI-InSPIRE RI-261323

Methodology: Use cases

● Both use cases were run for different cluster sizes ranging from 10 to 101

● In all the executions we measured two parameters● The put time● The wordcount time

EGI-InSPIRE RI-261323

Results

● Cluster startup: including comparison with Amazon EC2● Benchmarks

EGI-InSPIRE RI-261323

Results: Cluster startup

● Startup times for each cluster deployment varied● The most representative ones are those obtained for the

larger deployment: the 101-node cluster● The time needed by the rOCCI client to return all the

resource endpoints ranged from 71 to 86 minutes● To have all the VMs running took around 80 minutes more● Total cluster startup time ranged from 2.5 to 3 hours

EGI-InSPIRE RI-261323

OpenNebula Frontend Load

More than 20 simultaneous scp processesaffect considerably the system performance

EGI-InSPIRE RI-261323

Results: Startup Optimizations

● We reduced the size of the VM image to 4GB● An additional disk of 70GB was created on-the-fly● We modified the datastore used in OpenNebula, so that the

transfer manager used is shared instead of ssh● Instances are declared as non-persistent● The image copy process is no longer done in a centralized

way from the front-end node but in parallel from each node using a simple copy operation from the shared storage

EGI-InSPIRE RI-261323

Results: Optimized Startup Times

Cluster size(#)

Startup time (s)

10 249

21 269

51 822

101 1096

EGI-InSPIRE RI-261323

Results: Startup Amazon EC2

Cluster size(#)

Startup time(s)

Comments

10 191

21 178

51 583* 2 nodes not 2 nodes not workingworking

101 276* 16 nodes not working

101 297* 1 node not working

EGI-InSPIRE RI-261323

Results: Benchmarks Teragen/Terasort

EGI-InSPIRE RI-261323

Results: Enciclopaedia Britannica

EGI-InSPIRE RI-261323

Results: Wikipedia

EGI-InSPIRE RI-261323

Conclusions

● FedCloud is especially suitable for:● small and medium-size Hadoop jobs ● where the data set is already pre-deployed in HDFS

● Optimized startup times are very close to those obtained in Amazon EC2

● The initial data set deployment (put time) imposes a large overhead

● Scalability: we were able to run the TeraGen benchmark up to 2TB and the TeraSort benchmark up to 500GB

EGI-InSPIRE RI-261323

Future work: FedCloud Wish List

● Adding a central workload management system● An automated way to distribute and synchronize the image

between all sites● A mechanism to query the resources available at a given site

EGI-InSPIRE RI-261323

Thanks For Your Attention!Questions?