rise of intermediate apis - beam and alluxio at alluxio meetup 2016

Eric AndersonProduct Manager@ericmander

Rise of Intermediary APIs(Beam and Alluxio)https://goo.gl/Fa95XZ

Google Cloud Platform 2

About Me

Product Manager at Google on Cloud DataflowWork closely with the most of the Apache Beam committersProject Management Committee for AlluxioContributed Google Compute Engine support to Alluxio

Originally from Salt Lake City, UTFather of 3 kids!

Twitter: @ericmander


Intermediary API?

Jesse Anderson (formerly Cloudera) in blog post: Strata+Hadoop Trends

I’m open to a better name if you have ideas


In the beginning...

There was only one approach to data processing

HDFS GFS

Hadoop MapReduce

Open Source Google


In the beginning...

And it required just two APIs, one for job description, one for storage

HDFS API GFS API

Hadoop API MapReduce API

Hadoop MR

HDFS GFS

Open Source Google


Then there was an evolution

But MapReduce was really hard (data processing in assembly language)

MapReduce API

MR


Flume (2010)

Flume was a programming model (API) innovation(FlumeJava not Apache Flume)

MapReduce API

Flume

MR

Programming Model

Higher level abstractions- PCollections (RDDs)- PTransforms

Directed Acyclical Graphs (DAGs)Pipeline optimization (fusing)


Millwheel (2013)

Millwheel was an execution model innovationIntroduced a new API, as expected3 APIs, 2 processing systems!

MapReduce API Millwheel API

Flume

MR Millwheel

Execution Model

Low latency, exactly-once, stream processing


Programming model innovation: Batch and streaming unifiedExecution model innovation: Managed batch and service

Dataflow (2015)


Flume Dataflow SDK

Cloud Dataflow

MR Millwheel

Programming Model

Execution Model


Dataflow (2015)


Flume Dataflow SDK

Cloud Dataflow

MR Millwheel

Programming Model

Execution Model

Programming model innovation: Batch and streaming unifiedExecution model innovation: Managed batch and service


“We believe that [...] the Beam model is the future reference programming model for writing data applications in both stream and batch”- Kostas Tzoumas, CEO of data Artisans and Flink co-creator

Apache Beam (2016)

Flink API Dataflow SDK

Cloud DataflowFlink Spark

Spark API

Local

Apache Beam


Apache Beam

1. The Beam Programming Model (unifies streaming/batch)a. Transformationsb. Windowingc. Watermarks + Triggersd. Accumulation

2. SDKs for writing Beam pipelinesa. Java (Scala thanks to Spotify)b. Python

3. Runners for existing distributed processing backendsa. Apache Flink (thanks to data Artisans)b. Apache Spark (thanks to Cloudera and PayPal)c. Google Cloud Dataflow (fully managed service from Google)d. Local runner for testinge. Other runners in progress: Gear Pump, Apache Apex


There is once again, only one library we need for data processing, except this time:- It’s easy/expressive- And we can still choose from the best execution technology

Apache Beam (2017?)

Flink API Dataflow SDK

DataflowFlink Spark

Spark API

Local

Apache Beam

Gear Pump Apache Apex


Coming full circle

There is once again, only one library we need for data processing, except this time:- It’s easy/expressive- And we can still choose from the best execution technology

Yet, we’ve tried this before...


Hadoop MR


Apache Crunch (2012)

Apache Crunch is an open source Flume-like API on Hadoop and now Spark.

MapReduce API

Crunch

Hadoop

Programming Model

MapReduce

Flume


Apache Crunch (2012)

Interest in Apache Crunch vs Apache Beam

Why? Perhaps...● Limited portability need / value● Missed the streaming revolution● Community support


What about storage?

And it required just two APIs, one for job description, one for storage

HDFS API GFS API


Hadoop MR

HDFS GFS

Open Source Google


Need for Intermediary Storage API

Again, an explosion of options

No reason to believe this will ever end. There will always be innovation on storage and the file system

HDFS API Swift API

HDFS SwiftGCS / S3

GCS / S3 APIs

Gluster FS

GlusterFS API


Model for expressing storage lifecycle

There are patterns we want to express:● Caching● Retention policy● ACLs● Down-tiering old or stale data

Across storage systems:● Unified namespace


PRD: Intermediate Storage API

1. Model for expressing storage lifecycle2. Write to the popular storage systems3. Pluggable APIs extend to other systems4. Read from the popular processing frameworks


Pluggable under storage

Unified namespace + Tiered storage + LineageSupports at least a half dozen

Supports at least a half dozen

Alluxio

1. Model for expressing storage lifecycle2. Write to the popular storage systems3. Pluggable APIs extend to other systems4. Read from the popular processing frameworks

Alibaba OSSSwift HDFSGCS / S3

Alluxio

GlusterFS NFS

HadoopFlink SparkLocal HBase Presto


Survival tests

Survival tests Apache Beam Alluxio

Portability need / value Lots of frameworks with varying performance profiles

Lots of frameworks and storage systems with varying performance profiles

Catch the technology wave Leading stream processing revolution

Leading in-memory revolution

Community support Top names in data processing Fastest growing contributor base


My particular excitement about Alluxio

It’s a particularly interesting intermediary API because:

● Data has gravity, Alluxio allows enterprises to adopt tech alongside legacy storage.

● Alluxio’s unification of sources is valuable within a single job. Beam is used with one framework at a time, so it’s portable across jobs.

● Alluxio has standalone value from its built-in open source in-memory filesystem. Beam requires an underly execution engine like Dataflow


Intermediary APIs = Data processing nirvana

Coder:● Ability to express my data processing job or storage lifecycle logically, independent of

physical constraints.

Deployer:● Code portability● Swap in technology at will

System/Technology Creators:● Easy path to adoption● Focus on features and performance, not APIs/connectivity


Apache Beam

Alluxio

Stack of the future?

DataflowFlink SparkLocal Gear Pump Apache Apex

Alibaba OSSSwift HDFSGCS / S3 GlusterFS NFS


Questions?https://goo.gl/Fa95XZ