bigdata meetup - openstack sahara

Sergey LukjanovAndrew Lazarev

The State of OpenStackData Processing: Sahara

Agenda

• Sahara overview• Status & Roadmap• EDP Technical Concepts• Live demo

Agenda

What is OpenStack?

https://www.openstack.org/software/

OpenStack Data Processing: Sahara

Mission: To provide a scalable data processing stack and associated management interfaces.

• provision and operate data processing clusters • schedule and operate data processing jobs

Data processing in Sahara == Hadoop, Spark, etc.

Hadoop - Big Data Platform

Trends

http://www.google.com/trends/

Use cases

• Self-service provisioning of Hadoop clusters• Utilization of unused compute capacity for

bursty workloads• Dev -> Stage -> Prod lifecycle• Run Hadoop workloads in few clicks without

expertise in Hadoop ops

Contributors

Architecture overview

Data Sources

Savanna Python Client RE

Cluster Configuration

Manager

Horizon

Keystone

Data Access Layer

Savanna Pages

HadoopVM

Vendors Plugins

HadoopVM

Resources Orchestration

Manager

Job Sources Job

Manager

Glance

Cinder

Neutron

Trove DB

Agenda

● Part of Mirantis OpenStack● Part of OpenStack Integrated release from Juno● Launchpad home page https://launchpad.net/sahara● Integrated with OpenStack CI/CD

○ https://github.com/openstack/sahara● Features: cluster provisioning and basic EDP● Active contributors: Red Hat and Hortonworks● Supported Hadoop distros:

○ Vanilla Apache Hadoop 1.2.1, 2.3.0 and 2.4.1○ Hortonworks Data Platform 1.3.2 and 2.0.6○ Cloudera CDH5○ Spark 0.9.1 and 1.0.0

Current Status

● Hadoop clusters operation and provisioning○ Templates for Hadoop cluster configuration○ REST API for cluster startup and operations○ Manual cluster scaling (add/remove nodes)○ Data node anti-affinity○ Swift integration

● UI integrated into Horizon● Plugin mechanism for integration with different Hadoop

distributions - Vanilla Apache, Hortonworks, Cloudera, Spark

Features - Cluster Ops

● EDP - API to execute MapReduce jobs without exposing details of underlying infrastructure (similar to AWS EMR)○ Pluggable workflow engine: Oozie, Spark ○ Pluggable data sources: Swift, HDFS, Ceph○ Supported job types: Jar, Pig, Hive

● User-friendly UI for ad-hoc analytics queries based on Hive or Pig

● Transient clusters creation for a single job

Features - Jobs Ops

● Neutron and nova networking support● Keystone trust model for async operations● Full support of data locality - rack and 4-level

awareness for HDFS and Swift● Python client● Integration with OpenStack ecosystem: Heat,

Tempest, Devstack, Ceilometer

Features - OpenStack Integration

● Support of more distributives○ MapR plugin (on review now)○ Storm plugin (work in progress)

● Native Ceph support● Ironic integration (Bare metal provisioning)● Complete work on distributed Sahara engine

Kilo Tentative Plans

Agenda

Elastic Data Processing

● EDP - API for executing MapReduce jobs on Hadoop clusters (similar to AWS EMR)○ Supported data sources: Swift, HDFS, Ceph○ Supported job types: Java actions,

MapReduce, MapReduce.Streaming, Pig, Hive○ Pluggable workflow management engine:

Oozie, Spark ● Supports both Hadoop 1 & 2● Job executions on transient clusters

EDP Use Cases

● Simplified task executions. You don’t need to know Hadoop!

● Bursty workload: ad-hoc queries requiring a significant resource only for short time period

● Utilization of free IaaS capacity for Hadoop tasks

EDP - Data Sources

Swift Sahara EDP

OUTPUT

HadoopVM

swift://some_container/INPUT

swift://some_container/OUTPUT

EDP - Job Binaries

Sahara DB

Sahara EDP

internal-db://script.pig

swift://some_container/mapreduce.jar

1. Pig, Hive scripts2. Executable Jar files3. Pluggable binaries and

libraries

EDP - Job Execution. Step 1

Sahara

SwiftINPUT

DB: Jar, Pig

Jar, Pig

Sahara

SwiftINPUT

DB: Jar, Pig

Jar, Pig

JobTracker

HadoopVM

Sahara

SwiftINPUT

DB: Jar, Pig

Jar, Pig

HadoopVM

JobTracker

OozieExecute a job

Sahara

SwiftINPUT

DB: Jar, Pig

Jar, Pig

HadoopVM

JobTracker

Sahara

SwiftINPUT

DB: Jar, Pig

Jar, Pig

HadoopVM

workflow.xm

1. Job-specific configurations

2. URLs to binaries

3. URLs for data sources

4. Credentials

JobTracker

Sahara

SwiftINPUT

DB: Jar, Pig

Jar, Pig

HadoopVM

workflow.xm

Data Processing

OUTPUT

1. Job-specific configurations

2. URLs to binaries

3. URLs for data sources

4. Credentials

JobTracker

Agenda

bigdata meetup - openstack sahara

Engineering

openstack and...

workshop su openstack - par-tec su openstack.pdf · dbaas...

the ai thunderdome with sahara, spark, and swift using...

the state of bigdata - meetup bigdata @ovh

bigdata primer

manila and sahara integration in openstack report manila and...

the ai thunderdome with sahara, spark, and swift using...

openstack data processing ("sahara") project update -...

openstack sahara essentials - sample chapter

bigdata english

security bigdata

bigdata analytics

bigdata @ comscore

bigdata workshop

yaron haviv, iguaz.io - openstack and bigdata - openstack...

bhupeshbansal bigdata

the massachusetts open cloud: an open cloud exchange ·...

bigdata presentation

junal bigdata

implementando #bigdata #analytics #datascience com #...