satisfaction hadoop meetup presentation

Introducing Satisfaction:

The Next-Generation

Hadoop Scheduler

Jerome Banks

Big Data Engineer

08 April 2015

April SF Hadoop User’s Meetup

Satisfaction

Rise of the Data Product

● Industries of tomorrow produce “Data Products”

o Produces a product created from processing Big Data

o Accuracy and timeliness of data key to it’s value

● Increasing pace of change

o Need to quickly prototype multiple different ideas

o Need to control the accidental complexity of chaos

Satisfaction

Enter the Solution !!!

● Satisfaction runs your Big Data Jobs!

o Geared for the Hadoop/Hive/Scala

Ecosystem

o Monitors Job’s Status

o Tracks Job’s Progress

● Development model for Data Scientists and

Engineers

● Dashboard for Ops and C-Level Execs

Satisfaction

Motivation: Why another Scheduler ?

● Existing solutions hard-to-use and incomplete

o Mish-mash of oozie, Jenkins, and shell scripts

● Data-driven Agile orgs produce increasing number of workflows

● Value of data makes SLA slippage very expensive

● Complexity of interactions between disparate Data sources

● Lack of understanding of org’s data assets

o How was this data generated ?

o Who is using this data and how ?

● Operational issues are painful for DevOps group

o Monitoring, Tracking

o Restarts and Alerting

Satisfaction

Next Generation Hadoop Scheduler

● Runs Hadoop/Hive Jobs

o Successor to oozie/Azkaban/Jenkins

o Extensible to new technologies

● DevOps Infrastructure

o Job Monitoring/Notifications

o Job History/Log Capture

● Development Model

o DSL for Defining Workflows

o Packaging/Deployment for Big Data Apps

Satisfaction

Three sets of customers

● Data Engineers/Data Scientists

o Crunches “Big Data”

o Creates “Data Products”

● DevOps

o Keeps the trains running on time

o Fixes things in bad weather

● C-Level Execs

o Wants to know the status of the company

o How much data are we processing?

o Will we make SLA ?

o What are our data assets? What are they worth?

Satisfaction

Differentiators

● Backward-Chaining

o Define dependencies, not flow graphs

● Data-focused

o Specify DataOutputs, not Actions

o Don’t re-generate already existing data

● Extensible

o Can implement new Satisfiers for new technologies

Spark, Shell, Scalding, etc..

o Scala DSL, not XML

Simple things are simple, complex things are possible

Satisfaction

Technology Overview

● Satisfaction is a Scala Application

o DSL for Workflows

o Akka Actor Dependency Engine

o Play! Frontend GUI

o Satisfier Extensions

● Data Products are Scala Projects

o Import needed Satisfiers in SBT

o Implement Flow in DSL

o Deploy to HDFS

Satisfaction

Architecture

Satisfaction

Key Concepts

● Goals And Satisfiers

● Witnesses and Variables

● Tracks and TrackDescriptors

Satisfaction

Goals and Satisfiers

● Developers define Goals to be satisfied

● Goals produce one or more DataOutputs

● Goals depend upon the DataOutputs of other Goals

● A Goal can be satisfied with a specific Satisfier

o HiveSatisfier

o HadoopJobSatisfier

o ShellSatisfier

o SparkSatisfier

Satisfaction

Witnesses and Variables

● Goals define a set of Variables

o dt, hour, topic, network

● A Witness reifies a DataOutput to get a DataInstance

● To satisfy a Goal, you need to specify an appropriate Witness

● One can define rules to depend on Goals with a mapped Witness

o Depend upon yesterday’s data

o Depend upon all instances of a group

Satisfaction

Tracks and TrackDescriptors

● A Track specifies the Goals, and the TopLevel DataOutput

● A TrackDescriptor specifies a specific release of a Track

o Includes Version, User, and Variant

● Track can be “pimped out” with various traits

o Job Scheduling

o Retry-Logic

o Notifications

● Developer creates a repo for a project, and defines the Track

o Uses sbt to build and upload a Track to HDFS

Satisfaction

Development Model

● Data Engineers/Scientists define workflows as Tracks

o Scala Git Repo project for each Track

o Specify DataOutputs and dependencies for the Track

o Define top level Goals in Scala DSL

o Define ETL as Hive Scripts ( or Scalding, Hadoop, etc)

Add as resources to project

Can define UDF’s as project Code

o Deploy to HDFS in track Directory

Satisfaction

DEMO !!!

Satisfaction

Next Steps:

● Currently in production Internally

● Source code available

o http://github.com/tagged/satisfaction

● Still a work in progress

o Documentation

o Bug fixes

o UI improvements

● Additional Satisfiers ( Spark, MLLib, Scalding )

● Job Progress and SLA Tracking TBD

Thank you!