satisfaction hadoop meetup presentation
TRANSCRIPT
Introducing Satisfaction:
The Next-Generation
Hadoop Scheduler
Jerome Banks
Big Data Engineer
08 April 2015
April SF Hadoop User’s Meetup
Satisfaction
Rise of the Data Product
● Industries of tomorrow produce “Data Products”
o Produces a product created from processing Big Data
o Accuracy and timeliness of data key to it’s value
● Increasing pace of change
o Need to quickly prototype multiple different ideas
o Need to control the accidental complexity of chaos
Satisfaction
Enter the Solution !!!
● Satisfaction runs your Big Data Jobs!
o Geared for the Hadoop/Hive/Scala
Ecosystem
o Monitors Job’s Status
o Tracks Job’s Progress
● Development model for Data Scientists and
Engineers
● Dashboard for Ops and C-Level Execs
Satisfaction
Motivation: Why another Scheduler ?
● Existing solutions hard-to-use and incomplete
o Mish-mash of oozie, Jenkins, and shell scripts
● Data-driven Agile orgs produce increasing number of workflows
● Value of data makes SLA slippage very expensive
● Complexity of interactions between disparate Data sources
● Lack of understanding of org’s data assets
o How was this data generated ?
o Who is using this data and how ?
● Operational issues are painful for DevOps group
o Monitoring, Tracking
o Restarts and Alerting
Satisfaction
Next Generation Hadoop Scheduler
● Runs Hadoop/Hive Jobs
o Successor to oozie/Azkaban/Jenkins
o Extensible to new technologies
● DevOps Infrastructure
o Job Monitoring/Notifications
o Job History/Log Capture
● Development Model
o DSL for Defining Workflows
o Packaging/Deployment for Big Data Apps
Satisfaction
Three sets of customers
● Data Engineers/Data Scientists
o Crunches “Big Data”
o Creates “Data Products”
● DevOps
o Keeps the trains running on time
o Fixes things in bad weather
● C-Level Execs
o Wants to know the status of the company
o How much data are we processing?
o Will we make SLA ?
o What are our data assets? What are they worth?
Satisfaction
Differentiators
● Backward-Chaining
o Define dependencies, not flow graphs
● Data-focused
o Specify DataOutputs, not Actions
o Don’t re-generate already existing data
● Extensible
o Can implement new Satisfiers for new technologies
Spark, Shell, Scalding, etc..
o Scala DSL, not XML
Simple things are simple, complex things are possible
Satisfaction
Technology Overview
● Satisfaction is a Scala Application
o DSL for Workflows
o Akka Actor Dependency Engine
o Play! Frontend GUI
o Satisfier Extensions
● Data Products are Scala Projects
o Import needed Satisfiers in SBT
o Implement Flow in DSL
o Deploy to HDFS
Satisfaction
Architecture
Satisfaction
Key Concepts
● Goals And Satisfiers
● Witnesses and Variables
● Tracks and TrackDescriptors
Satisfaction
Goals and Satisfiers
● Developers define Goals to be satisfied
● Goals produce one or more DataOutputs
● Goals depend upon the DataOutputs of other Goals
● A Goal can be satisfied with a specific Satisfier
o HiveSatisfier
o HadoopJobSatisfier
o ShellSatisfier
o SparkSatisfier
Satisfaction
Witnesses and Variables
● Goals define a set of Variables
o dt, hour, topic, network
● A Witness reifies a DataOutput to get a DataInstance
● To satisfy a Goal, you need to specify an appropriate Witness
● One can define rules to depend on Goals with a mapped Witness
o Depend upon yesterday’s data
o Depend upon all instances of a group
Satisfaction
Tracks and TrackDescriptors
● A Track specifies the Goals, and the TopLevel DataOutput
● A TrackDescriptor specifies a specific release of a Track
o Includes Version, User, and Variant
● Track can be “pimped out” with various traits
o Job Scheduling
o Retry-Logic
o Notifications
● Developer creates a repo for a project, and defines the Track
o Uses sbt to build and upload a Track to HDFS
Satisfaction
Development Model
● Data Engineers/Scientists define workflows as Tracks
o Scala Git Repo project for each Track
o Specify DataOutputs and dependencies for the Track
o Define top level Goals in Scala DSL
o Define ETL as Hive Scripts ( or Scalding, Hadoop, etc)
Add as resources to project
Can define UDF’s as project Code
o Deploy to HDFS in track Directory
Satisfaction
DEMO !!!
Satisfaction
Next Steps:
● Currently in production Internally
● Source code available
o http://github.com/tagged/satisfaction
● Still a work in progress
o Documentation
o Bug fixes
o UI improvements
● Additional Satisfiers ( Spark, MLLib, Scalding )
● Job Progress and SLA Tracking TBD
Thank you!