hadoop analytics provisioning based on a virtual infrastructure
Post on 18-Jul-2015
51 Views
Preview:
TRANSCRIPT
EGI-InSPIRE RI-261323
Hadoop analytics provisioning based on a virtual infrastructure
J. López Cacheiro, A. Simón, J. Villasuso, E. Freire,
Iván Díaz, C. Fernández, and B. Parak
1
EGI-InSPIRE
EGI-InSPIRE RI-261323
Introduction
● Federation of resources poses serious challenges both from the cloud and Big Data perspectives.
● FedCloud is in production since mid May 2014● Is it suitable to perform Big Data analytics using Hadoop?
EGI-InSPIRE RI-261323
Introduction: FedCloud
● A federated cloud infrastructure developed inside EGI● EGI Federated Cloud Task Force started its activity in
September 2011● FedCloud is in production since mid May 2014● During the last year it has been already available for
experimentation by the different user communities
EGI-InSPIRE RI-261323
Introduction: Big Data
● Big Data is an emerging field ● It can benefit from existing developments in cloud
technologies● One of the most popular solutions is Apache Hadoop
EGI-InSPIRE RI-261323
Introduction: FedCloud
● A federated cloud infrastructure developed inside EGI● EGI Federated Cloud Task Force started its activity in
September 2011● FedCloud is in production since mid May 2014● During the last year it has been already available for
experimentation by the different user communities
EGI-InSPIRE RI-261323
Introduction: Hadoop
● A distributed filesystem (HDFS)● A framework for job scheduling● A MapReduce implementation for parallel processing of
large data sets
EGI-InSPIRE RI-261323
Leveraging FedCloud for Hadoop
The aim of our work is to do an initial assessment of the suitability of FedCloud to run Hadoop jobs through a series of real-world benchmarks.
EGI-InSPIRE RI-261323
Methodology: Cluster startup
● Hadoop clusters consists of one master node and a variable number of slave nodes depending on the size of the cluster
● The master node will run three services● namenode● secondarynamenode● jobtracker
● The slaves will run two services:● tasktracker ● datanode
EGI-InSPIRE RI-261323
Methodology: Cluster startup
● A customized VM image was created including different versions of Hadoop (up to 1.2.1) and Oracle Java JDK
● This image was registered in the EGI Marketplace ● Each RP manually downloaded the image to the local site in
order to make it available at its endpoint● To instantiate the VM that will form the Hadoop cluster we
used rOCCI● FedCloud does not count with a WMS, so each VM creation
request must be sent to the appropriate endpoint
EGI-InSPIRE RI-261323
Methodology: Benchmarks
● Two standard Big Data benchmarks: ● TeraGen and TeraSort
● Two use cases:● Representing typical small and medium-size jobs● Using real data sets and common MapReduce operations
EGI-InSPIRE RI-261323
Methodology: Teragen/Terasort
● The “equivalent” of Linpack in HPC● The Sort Benchmarks list: “equivalent” to Top500● TeraGen generates a input data set in parallel● TeraSort is a standard MapReduce sort job
EGI-InSPIRE RI-261323
Methodology: Use cases
● Small-size jobs:● Encyclopaedia Britannica: 176 MB
● Medium-size jobs:● Wikipedia: 41 GB
EGI-InSPIRE RI-261323
Methodology: Use cases
● Both use cases were run for different cluster sizes ranging from 10 to 101
● In all the executions we measured two parameters● The put time● The wordcount time
EGI-InSPIRE RI-261323
Results: Cluster startup
● Startup times for each cluster deployment varied● The most representative ones are those obtained for the
larger deployment: the 101-node cluster● The time needed by the rOCCI client to return all the
resource endpoints ranged from 71 to 86 minutes● To have all the VMs running took around 80 minutes more● Total cluster startup time ranged from 2.5 to 3 hours
EGI-InSPIRE RI-261323
OpenNebula Frontend Load
More than 20 simultaneous scp processesaffect considerably the system performance
EGI-InSPIRE RI-261323
Results: Startup Optimizations
● We reduced the size of the VM image to 4GB● An additional disk of 70GB was created on-the-fly● We modified the datastore used in OpenNebula, so that the
transfer manager used is shared instead of ssh● Instances are declared as non-persistent● The image copy process is no longer done in a centralized
way from the front-end node but in parallel from each node using a simple copy operation from the shared storage
EGI-InSPIRE RI-261323
Results: Optimized Startup Times
Cluster size(#)
Startup time (s)
10 249
21 269
51 822
101 1096
EGI-InSPIRE RI-261323
Results: Startup Amazon EC2
Cluster size(#)
Startup time(s)
Comments
10 191
21 178
51 583* 2 nodes not 2 nodes not workingworking
101 276* 16 nodes not working
101 297* 1 node not working
EGI-InSPIRE RI-261323
Conclusions
● FedCloud is especially suitable for:● small and medium-size Hadoop jobs ● where the data set is already pre-deployed in HDFS
● Optimized startup times are very close to those obtained in Amazon EC2
● The initial data set deployment (put time) imposes a large overhead
● Scalability: we were able to run the TeraGen benchmark up to 2TB and the TeraSort benchmark up to 500GB
EGI-InSPIRE RI-261323
Future work: FedCloud Wish List
● Adding a central workload management system● An automated way to distribute and synchronize the image
between all sites● A mechanism to query the resources available at a given site
top related