supporting big data processing via science gateways egi cf 2015, 10-13 november, bari, italy dr...
DESCRIPTION
Motivation * Many scientific applications (like weather forecasting, DNA sequencing, molecular dynamics) parallelized using the MapReduce framework * Installation and configuration of a Hadoop cluster well beyond the capabilities of domain scientists Aim * Integration of Hadoop with workflow systems and science gateways * Automatic setup of Hadoop software and infrastructure * Utilization of the power of Cloud Computing MotivationsTRANSCRIPT
![Page 1: Supporting Big Data Processing via Science Gateways EGI CF 2015, 10-13 November, Bari, Italy Dr Tamas…](https://reader035.vdocuments.us/reader035/viewer/2022070616/5a4d1c127f8b9ab0599f7c87/html5/thumbnails/1.jpg)
Supporting Big Data Processing via Science GatewaysEGI CF 2015, 10-13 November, Bari, Italy
Dr Tamas Kiss, CloudSME Project DirectorUniversity of Westminster,
London, [email protected]
Authors: Tamas Kiss, Shashank Gugnani, Gabor Terstyanszky, Peter Kacsuk, Carlos Blanco, Giuliano Castelli
![Page 2: Supporting Big Data Processing via Science Gateways EGI CF 2015, 10-13 November, Bari, Italy Dr Tamas…](https://reader035.vdocuments.us/reader035/viewer/2022070616/5a4d1c127f8b9ab0599f7c87/html5/thumbnails/2.jpg)
MapReduce/Hadoop*MapReduce: to process large datasets in parallel and on thousands of nodes in a reliable and fault-tolerant manner*Map: input data in divided into chunks and analysed on different nodes in a parallel manner*Reduce: collating the work and combining the results into a single value*Monitoring, scheduling and re-executing the failed tasks are the responsibility of the MapReduce framework*Originally for bare-metal clusters – popularity in cloud is growing*Hadoop: Open source implementation of the MapReduce framework introduced by Google in 2004
IntroductionMapReduce and big data
![Page 3: Supporting Big Data Processing via Science Gateways EGI CF 2015, 10-13 November, Bari, Italy Dr Tamas…](https://reader035.vdocuments.us/reader035/viewer/2022070616/5a4d1c127f8b9ab0599f7c87/html5/thumbnails/3.jpg)
Motivation*Many scientific applications (like weather forecasting, DNA sequencing, molecular dynamics) parallelized using the MapReduce framework*Installation and configuration of a Hadoop cluster well beyond the capabilities of domain scientistsAim*Integration of Hadoop with workflow systems and science gateways*Automatic setup of Hadoop software and infrastructure*Utilization of the power of Cloud Computing
Motivations
![Page 4: Supporting Big Data Processing via Science Gateways EGI CF 2015, 10-13 November, Bari, Italy Dr Tamas…](https://reader035.vdocuments.us/reader035/viewer/2022070616/5a4d1c127f8b9ab0599f7c87/html5/thumbnails/4.jpg)
CloudSME project*To develop a cloud-based simulation platform for manufacturing and engineering
• Funded by the European Commission FP7 programme, FoF: Factories of the Future• July 2013 – March 2016• EUR 4.5 million overall funding• Coordinated by the University of Westminster• 29 project partners from 8 European countries• 24 companies (all SMEs) and 5 academic/research institutions• Spin-off company established – CloudSME UG• One of the industrial use-cases: datamining of aircraft maintenance data
using MapReduce based parallelisation
Motivations
![Page 5: Supporting Big Data Processing via Science Gateways EGI CF 2015, 10-13 November, Bari, Italy Dr Tamas…](https://reader035.vdocuments.us/reader035/viewer/2022070616/5a4d1c127f8b9ab0599f7c87/html5/thumbnails/5.jpg)
Motivations
![Page 6: Supporting Big Data Processing via Science Gateways EGI CF 2015, 10-13 November, Bari, Italy Dr Tamas…](https://reader035.vdocuments.us/reader035/viewer/2022070616/5a4d1c127f8b9ab0599f7c87/html5/thumbnails/6.jpg)
*Set up a disposable cluster in the cloud, execute Hadoop job and destroy cluster*Cluster related parameters and input files provided by user*Workflow node executable would be a program that sets up Hadoop cluster, transfers files to and from the cluster and executes the Hadoop job*Two methods proposed:*Single Node Method*Three Node Method
Approach
![Page 7: Supporting Big Data Processing via Science Gateways EGI CF 2015, 10-13 November, Bari, Italy Dr Tamas…](https://reader035.vdocuments.us/reader035/viewer/2022070616/5a4d1c127f8b9ab0599f7c87/html5/thumbnails/7.jpg)
*Aim: *execute MapReduce job in Cloud resources*automatically set-up and destroy execution environment in the
cloud*Infrastructure aware workflow: * the necessary execution environment should also be transparently
set up before and destroyed after execution*carried out from the workflow without further user intervention.
*Steps1.execution environment is created dynamically in the cloud2.execution of workflow tasks3.breaking down of the infrastructure releasing resources
ApproachInfrastructure aware
workflow
![Page 8: Supporting Big Data Processing via Science Gateways EGI CF 2015, 10-13 November, Bari, Italy Dr Tamas…](https://reader035.vdocuments.us/reader035/viewer/2022070616/5a4d1c127f8b9ab0599f7c87/html5/thumbnails/8.jpg)
*Connect to cloud and launch servers*Connect to the master node server and setup cluster
configuration*Transfer input files and job executable to master node*Start the Hadoop job by running a script in the master node*When the job is finished, delete servers from cloud and retrieve
output if the job is successful
ApproachSingle node method
![Page 9: Supporting Big Data Processing via Science Gateways EGI CF 2015, 10-13 November, Bari, Italy Dr Tamas…](https://reader035.vdocuments.us/reader035/viewer/2022070616/5a4d1c127f8b9ab0599f7c87/html5/thumbnails/9.jpg)
*Stage 1 or Deploy Hadoop Node: Launch servers in cloud, connect to master node, setup Hadoop cluster and save Hadoop cluster configuration
*Stage 2 or Execute Node: Upload input files and job executable to master node, execute job and get result back
*Stage 3 or Destroy Hadoop Node: Destroy cluster to free up resources
ApproachThree node method
![Page 10: Supporting Big Data Processing via Science Gateways EGI CF 2015, 10-13 November, Bari, Italy Dr Tamas…](https://reader035.vdocuments.us/reader035/viewer/2022070616/5a4d1c127f8b9ab0599f7c87/html5/thumbnails/10.jpg)
Implementation
![Page 11: Supporting Big Data Processing via Science Gateways EGI CF 2015, 10-13 November, Bari, Italy Dr Tamas…](https://reader035.vdocuments.us/reader035/viewer/2022070616/5a4d1c127f8b9ab0599f7c87/html5/thumbnails/11.jpg)
© CloudBroker GmbHAll rights reserved.
User Tools
Java Client Library*
CloudBroker Platform*
…Cloud
ChemistryAppli-cations
BiologyAppli-cations
PharmaAppli-cations
WebBrowser
UI*
…Appli-cations
REST Web Service API*
End Users, Software Vendors, Resource Providers
CLI*
EngineeringAppli-
cations
Euca-lyptusCloud
Open-NebulaCloud*
Open-StackCloud*
AmazonCloud*
CloudSigma Cloud*
• Seamless access to heterogeneous cloud resources – high level interoperability
ImplementationCloudBroker platform
![Page 12: Supporting Big Data Processing via Science Gateways EGI CF 2015, 10-13 November, Bari, Italy Dr Tamas…](https://reader035.vdocuments.us/reader035/viewer/2022070616/5a4d1c127f8b9ab0599f7c87/html5/thumbnails/12.jpg)
• General purpose, workflow-oriented gateway framework
• Supports the development and execution of workflow-based applications
• Enables the multi-cloud and multi-grid execution of any workflow
• Supports the fast development of gateway instances by a customization technology
ImplementationWS-PGRADE/gUSE
![Page 13: Supporting Big Data Processing via Science Gateways EGI CF 2015, 10-13 November, Bari, Italy Dr Tamas…](https://reader035.vdocuments.us/reader035/viewer/2022070616/5a4d1c127f8b9ab0599f7c87/html5/thumbnails/13.jpg)
*Each box describes a task*Each arrow
describes information flow such as input files and output files*Special node
describes parameter sweeps
ImplementationWS-PGRADE/gUSE
![Page 14: Supporting Big Data Processing via Science Gateways EGI CF 2015, 10-13 November, Bari, Italy Dr Tamas…](https://reader035.vdocuments.us/reader035/viewer/2022070616/5a4d1c127f8b9ab0599f7c87/html5/thumbnails/14.jpg)
ImplementationSHIWA workflow
repository
• Workflow repository to store directly executable workflows
• Supports various workflow system including WS-PGRADE, Taverna, Moteur, Galaxy etc.
• Fully integrated with WS-PGRADE/gUSE
![Page 15: Supporting Big Data Processing via Science Gateways EGI CF 2015, 10-13 November, Bari, Italy Dr Tamas…](https://reader035.vdocuments.us/reader035/viewer/2022070616/5a4d1c127f8b9ab0599f7c87/html5/thumbnails/15.jpg)
ImplementationSupported storage solutions
Local (user’s machine):*Bottleneck for large files*Multiple file transfers: local machine – WS-PGRADE – CloudBroker – Bootstap node – Master node – HDFSSwift:*Two file transfers: Swift – Master node – HDFSAmazon S3:*Direct transfer from S3 to HDFS*using Hadoop’s distributed copy application
Input/output locations can be mixed and matched in one workflow
![Page 16: Supporting Big Data Processing via Science Gateways EGI CF 2015, 10-13 November, Bari, Italy Dr Tamas…](https://reader035.vdocuments.us/reader035/viewer/2022070616/5a4d1c127f8b9ab0599f7c87/html5/thumbnails/16.jpg)
Experiments and results
Initial testbed*CloudSME production gUSE (v 3.6.6) portal*Jobs submitted using the CloudSME CloudBroker platform*All jobs submitted to University of Westminster OpenStack Cloud*Hadoop v2.5.1 on Ubuntu 14.04 trusty servers
![Page 17: Supporting Big Data Processing via Science Gateways EGI CF 2015, 10-13 November, Bari, Italy Dr Tamas…](https://reader035.vdocuments.us/reader035/viewer/2022070616/5a4d1c127f8b9ab0599f7c87/html5/thumbnails/17.jpg)
Experiments and results
Hadoop applications used for experiments*WordCount - the standard Hadoop example*Rule Based Classification - A classification algorithm adapted for MapReduce*Prefix Span - MapReduce version of the popular sequential pattern mining algorithm
![Page 18: Supporting Big Data Processing via Science Gateways EGI CF 2015, 10-13 November, Bari, Italy Dr Tamas…](https://reader035.vdocuments.us/reader035/viewer/2022070616/5a4d1c127f8b9ab0599f7c87/html5/thumbnails/18.jpg)
Experiments and results
Single node: Hadoop cluster created and destroyed multiple timesThree node: multiple Hadoop jobs between single create/destroy nodes
![Page 19: Supporting Big Data Processing via Science Gateways EGI CF 2015, 10-13 November, Bari, Italy Dr Tamas…](https://reader035.vdocuments.us/reader035/viewer/2022070616/5a4d1c127f8b9ab0599f7c87/html5/thumbnails/19.jpg)
Experiments and results
![Page 20: Supporting Big Data Processing via Science Gateways EGI CF 2015, 10-13 November, Bari, Italy Dr Tamas…](https://reader035.vdocuments.us/reader035/viewer/2022070616/5a4d1c127f8b9ab0599f7c87/html5/thumbnails/20.jpg)
Experiments and results
5 jobs on a 5 node cluster each, using WS-PGRADE parameter sweep featureSingle node method
Single Hadoop jobs on 5 node clusterSingle node method
![Page 21: Supporting Big Data Processing via Science Gateways EGI CF 2015, 10-13 November, Bari, Italy Dr Tamas…](https://reader035.vdocuments.us/reader035/viewer/2022070616/5a4d1c127f8b9ab0599f7c87/html5/thumbnails/21.jpg)
*Solution works for any Hadoop application*Proposed approach is generic and can be used for any gateway environment and cloud*User can choose the appropriate method (Single or Three Node) according to the application*Parameter sweep feature of WS-PGRADE can be used to run Hadoop jobs with multiple input datasets simultaneously*Can be used for large scale scientific simulations* EGI Federated Cloud integration:
*Already runs on some EGI FedCloud resources: SZTAKI, BIFI*WS/PGRADE is fully integrated with EGI FedCloud*CloudBroker does not currently support EGI FedCloud
directly
Conclusion
![Page 22: Supporting Big Data Processing via Science Gateways EGI CF 2015, 10-13 November, Bari, Italy Dr Tamas…](https://reader035.vdocuments.us/reader035/viewer/2022070616/5a4d1c127f8b9ab0599f7c87/html5/thumbnails/22.jpg)
Any questions?
http://cloudsme.eu
http://www.cloudsme-apps.com/