Download - BigData Meetup - OpenStack Sahara

Sergey LukjanovAndrew Lazarev

The State of OpenStackData Processing: Sahara

Agenda

• Sahara overview• Status & Roadmap• EDP Technical Concepts• Live demo

What is OpenStack?

https://www.openstack.org/software/

OpenStack Data Processing: Sahara

Mission: To provide a scalable data processing stack and associated management interfaces.

• provision and operate data processing clusters • schedule and operate data processing jobs

Data processing in Sahara == Hadoop, Spark, etc.

Hadoop - Big Data Platform

© http://hortonworks.com/hadoop/yarn/

Trends

http://www.google.com/trends/

Use cases

• Self-service provisioning of Hadoop clusters• Utilization of unused compute capacity for

bursty workloads• Dev -> Stage -> Prod lifecycle• Run Hadoop workloads in few clicks without

expertise in Hadoop ops

Contributors

Architecture overview

Data Sources

Savanna Python Client RE

ST A

PI

Cluster Configuration

Manager

Horizon

Keystone

Auth

Data Access Layer

Swift

Savanna Pages

HadoopVM

Vendors Plugins

HadoopVM

HadoopVM

HadoopVM

Resources Orchestration

Manager

Job Sources Job

Manager

Heat

Nova

Glance

Cinder

Neutron

Trove DB

Agenda


● Part of Mirantis OpenStack● Part of OpenStack Integrated release from Juno● Launchpad home page https://launchpad.net/sahara● Integrated with OpenStack CI/CD

○ https://github.com/openstack/sahara● Features: cluster provisioning and basic EDP● Active contributors: Red Hat and Hortonworks● Supported Hadoop distros:

○ Vanilla Apache Hadoop 1.2.1, 2.3.0 and 2.4.1○ Hortonworks Data Platform 1.3.2 and 2.0.6○ Cloudera CDH5○ Spark 0.9.1 and 1.0.0

Current Status

https://launchpad.net/sahara

https://github.com/openstack/sahara

https://github.com/openstack/sahara

● Hadoop clusters operation and provisioning○ Templates for Hadoop cluster configuration○ REST API for cluster startup and operations○ Manual cluster scaling (add/remove nodes)○ Data node anti-affinity○ Swift integration

● UI integrated into Horizon● Plugin mechanism for integration with different Hadoop

distributions - Vanilla Apache, Hortonworks, Cloudera, Spark

Features - Cluster Ops

● EDP - API to execute MapReduce jobs without exposing details of underlying infrastructure (similar to AWS EMR)○ Pluggable workflow engine: Oozie, Spark ○ Pluggable data sources: Swift, HDFS, Ceph○ Supported job types: Jar, Pig, Hive

● User-friendly UI for ad-hoc analytics queries based on Hive or Pig

● Transient clusters creation for a single job

Features - Jobs Ops

● Neutron and nova networking support● Keystone trust model for async operations● Full support of data locality - rack and 4-level

awareness for HDFS and Swift● Python client● Integration with OpenStack ecosystem: Heat,

Tempest, Devstack, Ceilometer

Features - OpenStack Integration

● Support of more distributives○ MapR plugin (on review now)○ Storm plugin (work in progress)

● Native Ceph support● Ironic integration (Bare metal provisioning)● Complete work on distributed Sahara engine

Kilo Tentative Plans

Agenda


Elastic Data Processing

● EDP - API for executing MapReduce jobs on Hadoop clusters (similar to AWS EMR)○ Supported data sources: Swift, HDFS, Ceph○ Supported job types: Java actions,

MapReduce, MapReduce.Streaming, Pig, Hive○ Pluggable workflow management engine:

Oozie, Spark ● Supports both Hadoop 1 & 2● Job executions on transient clusters

EDP Use Cases

● Simplified task executions. You don’t need to know Hadoop!

● Bursty workload: ad-hoc queries requiring a significant resource only for short time period

● Utilization of free IaaS capacity for Hadoop tasks

EDP - Data Sources

Swift Sahara EDP

INPUT

OUTPUT

HadoopVM

HadoopVM

HadoopVM

HadoopVM

swift://some_container/INPUT

swift://some_container/OUTPUT

EDP - Job Binaries

Swift

Sahara DB

Sahara EDP

internal-db://script.pig

swift://some_container/mapreduce.jar

1. Pig, Hive scripts2. Executable Jar files3. Pluggable binaries and

libraries

EDP - Job Execution. Step 1

Sahara

SwiftINPUT

DB: Jar, Pig

EDP

Jar, Pig


Sahara

SwiftINPUT

DB: Jar, Pig

EDP

Jar, Pig

JobTracker

Oozie

HadoopVM

HadoopVM

HadoopVM


Sahara

SwiftINPUT

DB: Jar, Pig

EDP

Jar, Pig

HadoopVM

HadoopVM

HadoopVM

JobTracker

OozieExecute a job


Sahara

SwiftINPUT

DB: Jar, Pig

EDP

Jar, Pig

HadoopVM

HadoopVM

HadoopVM

JobTracker

Oozie


Sahara

SwiftINPUT

DB: Jar, Pig

EDP

Jar, Pig

HadoopVM

HadoopVM

HadoopVM

workflow.xm

l

1. Job-specific configurations

2. URLs to binaries

3. URLs for data sources

4. Credentials

JobTracker

Oozie


Sahara

SwiftINPUT

DB: Jar, Pig

EDP

Jar, Pig

HadoopVM

HadoopVM

HadoopVM

workflow.xm

l

Data Processing

OUTPUT

1. Job-specific configurations

2. URLs to binaries

3. URLs for data sources

4. Credentials

JobTracker

Oozie

Agenda


Download - BigData Meetup - OpenStack Sahara

Top Related