Download - BigData Meetup - OpenStack Sahara
Sergey LukjanovAndrew Lazarev
The State of OpenStackData Processing: Sahara
Agenda
• Sahara overview• Status & Roadmap• EDP Technical Concepts• Live demo
Agenda
• Sahara overview• Status & Roadmap• EDP Technical Concepts• Live demo
What is OpenStack?
https://www.openstack.org/software/
OpenStack Data Processing: Sahara
Mission: To provide a scalable data processing stack and associated management interfaces.
• provision and operate data processing clusters • schedule and operate data processing jobs
Data processing in Sahara == Hadoop, Spark, etc.
Hadoop - Big Data Platform
© http://hortonworks.com/hadoop/yarn/
Trends
http://www.google.com/trends/
Use cases
• Self-service provisioning of Hadoop clusters• Utilization of unused compute capacity for
bursty workloads• Dev -> Stage -> Prod lifecycle• Run Hadoop workloads in few clicks without
expertise in Hadoop ops
Contributors
Architecture overview
Data Sources
Savanna Python Client RE
ST A
PI
Cluster Configuration
Manager
Horizon
Keystone
Auth
Data Access Layer
Swift
Savanna Pages
HadoopVM
Vendors Plugins
HadoopVM
HadoopVM
HadoopVM
Resources Orchestration
Manager
Job Sources Job
Manager
Heat
Nova
Glance
Cinder
Neutron
Trove DB
Agenda
• Sahara overview• Status & Roadmap• EDP Technical Concepts• Live demo
● Part of Mirantis OpenStack● Part of OpenStack Integrated release from Juno● Launchpad home page https://launchpad.net/sahara● Integrated with OpenStack CI/CD
○ https://github.com/openstack/sahara● Features: cluster provisioning and basic EDP● Active contributors: Red Hat and Hortonworks● Supported Hadoop distros:
○ Vanilla Apache Hadoop 1.2.1, 2.3.0 and 2.4.1○ Hortonworks Data Platform 1.3.2 and 2.0.6○ Cloudera CDH5○ Spark 0.9.1 and 1.0.0
Current Status
● Hadoop clusters operation and provisioning○ Templates for Hadoop cluster configuration○ REST API for cluster startup and operations○ Manual cluster scaling (add/remove nodes)○ Data node anti-affinity○ Swift integration
● UI integrated into Horizon● Plugin mechanism for integration with different Hadoop
distributions - Vanilla Apache, Hortonworks, Cloudera, Spark
Features - Cluster Ops
● EDP - API to execute MapReduce jobs without exposing details of underlying infrastructure (similar to AWS EMR)○ Pluggable workflow engine: Oozie, Spark ○ Pluggable data sources: Swift, HDFS, Ceph○ Supported job types: Jar, Pig, Hive
● User-friendly UI for ad-hoc analytics queries based on Hive or Pig
● Transient clusters creation for a single job
Features - Jobs Ops
● Neutron and nova networking support● Keystone trust model for async operations● Full support of data locality - rack and 4-level
awareness for HDFS and Swift● Python client● Integration with OpenStack ecosystem: Heat,
Tempest, Devstack, Ceilometer
Features - OpenStack Integration
● Support of more distributives○ MapR plugin (on review now)○ Storm plugin (work in progress)
● Native Ceph support● Ironic integration (Bare metal provisioning)● Complete work on distributed Sahara engine
Kilo Tentative Plans
Agenda
• Sahara overview• Status & Roadmap• EDP Technical Concepts• Live demo
Elastic Data Processing
● EDP - API for executing MapReduce jobs on Hadoop clusters (similar to AWS EMR)○ Supported data sources: Swift, HDFS, Ceph○ Supported job types: Java actions,
MapReduce, MapReduce.Streaming, Pig, Hive○ Pluggable workflow management engine:
Oozie, Spark ● Supports both Hadoop 1 & 2● Job executions on transient clusters
EDP Use Cases
● Simplified task executions. You don’t need to know Hadoop!
● Bursty workload: ad-hoc queries requiring a significant resource only for short time period
● Utilization of free IaaS capacity for Hadoop tasks
EDP - Data Sources
Swift Sahara EDP
INPUT
OUTPUT
HadoopVM
HadoopVM
HadoopVM
HadoopVM
swift://some_container/INPUT
swift://some_container/OUTPUT
EDP - Job Binaries
Swift
Sahara DB
Sahara EDP
internal-db://script.pig
swift://some_container/mapreduce.jar
1. Pig, Hive scripts2. Executable Jar files3. Pluggable binaries and
libraries
EDP - Job Execution. Step 1
Sahara
SwiftINPUT
DB: Jar, Pig
EDP
Jar, Pig
EDP - Job Execution. Step 2
Sahara
SwiftINPUT
DB: Jar, Pig
EDP
Jar, Pig
JobTracker
Oozie
HadoopVM
HadoopVM
HadoopVM
EDP - Job Execution. Step 3
Sahara
SwiftINPUT
DB: Jar, Pig
EDP
Jar, Pig
HadoopVM
HadoopVM
HadoopVM
JobTracker
OozieExecute a job
EDP - Job Execution. Step 4
Sahara
SwiftINPUT
DB: Jar, Pig
EDP
Jar, Pig
HadoopVM
HadoopVM
HadoopVM
JobTracker
Oozie
EDP - Job Execution. Step 5
Sahara
SwiftINPUT
DB: Jar, Pig
EDP
Jar, Pig
HadoopVM
HadoopVM
HadoopVM
workflow.xm
l
1. Job-specific configurations
2. URLs to binaries
3. URLs for data sources
4. Credentials
JobTracker
Oozie
EDP - Job Execution. Step 6
Sahara
SwiftINPUT
DB: Jar, Pig
EDP
Jar, Pig
HadoopVM
HadoopVM
HadoopVM
workflow.xm
l
Data Processing
OUTPUT
1. Job-specific configurations
2. URLs to binaries
3. URLs for data sources
4. Credentials
JobTracker
Oozie
Agenda
• Sahara overview• Status & Roadmap• EDP Technical Concepts• Live demo
Q&A