data pipelines in hadoop - sap meetup in tel aviv

Data Processing in Hadoop

Lars George – Partner and Co-Founder @ OpenCore

Big Data & Data Science Israel Meetup – 21.03.2017

Analytics and Data Pipelines in Practice

About Me• Partner & Co-Founder at OpenCore• Before, EMEA Chief Architect at Cloudera• 5+ years

• Hadoop since 2007• Apache Committer• HBase and Whirr

• O’Reilly Author: HBase – The Definitive Guide• Also in Japanese, Korean & Chinese• 2nd edition out soon!

• Contact• [email protected]• @larsgeorge

日本語版も出ました !

Agenda• Hadoop History• Data Pipelines• Hadoop Components• Data Processing• Summary

Hadoop HistoryA walk through time…

Tectonic Shifting: Prevalent Data Inertia

The Original Inspirations for Hadoop

2003 2004

A Decade of Hadoop History on One Slide

Ten years ago, “Hadoop” referred to a scalable, fault-tolerant filesystem (HDFS) and programming framework (MapReduce)

for distributed computing.

Today, it refers to both a kernel containing the aforementioned pieces, as well as a constantly evolving ecosystem of 25+ data

stores, execution engines, programming and data access frameworks, and other componentry.

Recognize this guy?

Hadoop’s Original Architecture

MapReduce(Data Processing and Resource Management)

HDFS(Filesystem/Storage)

Hadoop's Architecture Today

MapReduce(Data Processing)

YARN(Resource Management)

HDFS(Storage)

Popular by Demand• More resources are poured into

Hadoop than many other projects• Vibrant community with many

commercial entities backing the development• List on the right lists separate

projects, which are combined in Hadoop distributions• Total would far exceed anything

else

• Literally no alternatives!

Data PipelinesFrom deluge to insight

Data Pipeline Components• Pipelines need data and CPUs• Continuous ingest lands new

data in various ways• Access to data allows for

consumers to build products• All of this needs to be

• Automated & managed• Done in a secure manner

• Finally, pipelines need to be properly onboarded• Discovery is necessary to find

schemas, data sources, etc.

Storage ProcessingIngest

Automation + Data & Resource Management

Authentication, Authorization, Audits

Access

Onboarding & Discovery

Physical Systems

Pipelines Increase Value of Data

Now that we know how data pipelines span many layers in both hardware and software, we can look at what Hadoop has to offer in more detail…

Hadoop ComponentsGrowth and Controversies

Example: Cloudera

Batch, Interactive, and Real-Time.Leading performance and usability in one platform.

• End-to-end analytic workflows

• Access more data• Work with data in new

ways• Enable new users

Security and Administration

ProcessIngest

Sqoop, Flume, NiFi

TransformMapReduce,

Hive, Pig, Spark

DiscoverAnalytic Database

Impala

SearchSolr

ModelMachine Learning

SAS, R, Spark, Mahout

ServeNoSQL Database

HBase

StreamingSpark Streaming

Unlimited Storage HDFS, HBase

YARN, Cloudera Manager,Cloudera Navigator

One Platform, Many Workloads

Hadoop: One Platform• Different to the silo’ed, monolithic databases, Hadoop is a single, shared

platform, with multiple entry points (access engines)• Scale and resilience is inherently built in• There are no silos, everything is just a directory with data inside

But…• How do you know what is where?• Access needs to be tightly controlled, down to the field level!

Analogy: The Universal Flatbed• Hadoop is a powerful engine exposed as a platform to carry loads• Initially the platform is bare and beckons for customization• You can convert the flatbed to what is needed

But…• Once converted, how to switch

between workloads?• How do you share the engine with

different users?

Hadoop Architecture Today• Components are selected to

match customer demands• A platform has many

advantages, including paid QA time• Some newer components

can be added later on• Labs etc.

• Many buzzwords that need to be carefully vetted…

2006 2008 2009 2010 2011 2012 2013

Core Hadoop (HDFS,

MapReduce)

HBaseZooKeeper

SolrPig

Core Hadoop

HiveMahoutHBase

ZooKeeperSolrPig

Core Hadoop

SqoopAvroHive

MahoutHBase

ZooKeeperSolrPig

Core Hadoop

FlumeBigtopOozie

HCatalogHue

SqoopAvroHive

MahoutHBase

ZooKeeperSolrPig

YARNCore Hadoop

SparkTez

ImpalaKafkaDrill

FlumeBigtopOozie

HCatalogHue

SqoopAvroHive

MahoutHBase

ZooKeeperSolrPig

YARNCore Hadoop

ParquetSentrySpark

TezImpalaKafkaDrill

FlumeBigtopOozie

HCatalogHue

SqoopAvroHive

MahoutHBase

ZooKeeperSolrPig

YARNCore Hadoop

The stack is continually evolving and growing!

2007

SolrPig

Core Hadoop

KnoxFlink

ParquetSentrySpark

TezImpalaKafkaDrill

FlumeBigtopOozie

HCatalogHue

SqoopAvroHive

MahoutHBase

ZooKeeperSolrPig

YARNCore Hadoop

2014 2015

KuduRecordService

IbisFalconKnoxFlink

ParquetSentrySpark

TezImpalaKafkaDrill

FlumeBigtopOozie

HCatalogHue

SqoopAvroHive

MahoutHBase

ZooKeeperSolrPig

YARNCore Hadoop

Evolution of the Hadoop Platform

And There Is More

Hadoop - The Movie: “Divergent”

Hadoop Core

2006

HDPCDH

2008 2011

CM

Navigator

2013

Sentry

2014

RangerAmbari

Impala

2016

CDSW

2015 2017

ZeppelinAtlas

Knox

SolrSpark

Kafka

Kudu

YARN

So, Hadoop is both complicated and divergent? How can we build data pipelines then, using its components? What else is needed?

Data Processing In Hadoop TodayCoasting through the "Trough of Disillusionment"

Wait! Before we can look at the aspects of building a data pipeline, a bit more context on where users are coming from and what their needs are: The Waves of Adoption.

Waves of Adoption #1• The “AllSpark” (as in the Transformers movie)

• First companies to adopt Hadoop as a way to mirror Google’s approach

• Early Adopters• Inspired by early success stories, these engineering focused companies extended on

Hadoop

• Followers• Companies that are OK to try out new things• Still engineering driven

• Late Bloomers• First Enterprises

• New Wave• Everyone else… AllSpark

Early Adopters

Followers

Late Bloomers

Enterprises

TODAY!

Waves of Adoption #2• Simple logic at bulk (batch processing of petabytes)• What: Reporting • With: SQL (Hive), Pig• Who: Analysts, Developers

• Streaming logic, likely in Lambda architecture• What: Decision support• With: OLAP Analytics, Druid, Oryx• Who: Data architects, DevOps

• Complex analytics• What: Machine Learning, AI• With: Notebooks, DS Workbench,• Who: Data Scientists

Batch

LambdaKappa?

Hybrids: Lambda FTW?

Stage:• Storage & Processing• Ingest• Access• Automation & Management• Security• Onboarding & Discovery• Physical Systems




Access


Physical Systems

Storage & ProcessingStorage• Reliable and scalable systems:

HDFS, Kafka, HBase• What about Kudu, Cassandra, …

MongoDB?

• Data laid out in a structured manner• Information Architecture• Physical storage (e.g. columnar)

Processing• Generic framework: YARN• What about Mesos? Non-batch

jobs?

• Resource management hooks• Pluggable engines• MapReduce, Spark, …• MPP Systems?

Information Architecture• There is a need to define how data

flows through the system and is organized• This simplifies the onboarding

process• Can be simple, or arbitrarily

complex• Needs to be enforced as it is used• Living system, may need to adopt• Define batch and stream interfaces

Example: YARN Services?• Little progress in

years• Still batch

oriented• Projects shoehorn

service idea into YARN using kludges• Example: Slider,

Trill





Access


Physical Systems

Ingest• Purpose• Receive data from heterogeneous sources• Save as-is, or do first pass processing• Store data in best format, aggregate small files• Comply to stack rules (security, IA)

• One of the most active areas• Vibrant third-party ecosystem

• Streamsets, Tamr, Waterline Data, Trifacta, IBM, …• Often a generic task, with Hadoop being only one target

• Open-source frameworks• NiFi• Flume (with Kafka)?




Access


Physical Systems


Access• Hadoop has traditionally only a few interfaces• Interactive SQL

• Shell, Notebooks, Hue• JDBC/ODBC• File Access

• WebHDFS/HttpFs• Gateways

• REST, Knox

• Needs to be set up based on the use-case• Throughput vs Latency

• Must apply security rules





Access


Physical Systems

Automation & Management• PoCs and prototyping are not production grade!• Need to automate the pipelines with monitoring and alerting• Full development lifecycle needs to be established• Precious resources need to be managed• Easier if use-cases all fall into the same category• Difficult when they span many systems• One of the remaining topics not addressed at all in Hadoop

• Change management should handle dynamic reconfiguration

Automation• Directed acyclic graphs (DAGs)• Define the actions and link them• Schedules based on various events (time or data)• Handle errors and maintenance

• Examples• Apache Oozie [2007, 2010 O/S, 2012 Apache]

• Java• XML or Hue

• Azkaban (LinkedIn) [2010]• Java

• Luigi (Spotify) [2012]• Python

• Apache Airflow (Airbnb) [2015]• Python

Example: Notebooks• Data scientists like prototyping• But how to bring the results into

production?• One attempt is to boost notebooks

with a framework that can handle their chaining and execution• Shared resources used• Depends on notebook backends

Source: https://databricks.com/blog/2016/08/30/notebook-workflows-the-easiest-way-to-implement-apache-spark-pipelines.html





Access


Physical Systems

Security• Many moving parts• Kerberos• RPC Level• ACLs• RBAC• UIs• Data• Encryption (at-rest and in-

transit)

• Hard to configure properly• Management software helps

to a degree





Access


Physical Systems

Onboarding Use-Cases• Ask the necessary questions ahead of time• Use the answer to set (initially) strict limits• Use HDFS quotas, YARN queues, etc.

• Initialize the system with the defaults• Communicate to other teams what the expected impact might be• During onboarding explain the shared nature of Hadoop• Avoid “long faces” due to changes (change management)

• Define costs and chargeback models• Automate into self-service if possible• Push updated configuration and notifications

Stage:• Storage & Processing• Ingest• Access• Automation & Management• Onboarding & Discovery• Physical Systems




Access


Physical Systems

Stack Architecture• Combine the reliable components into a

whole stack• Organize interfaces to outside systems

by users and purpose• Separate components for ease of

maintenance• Layer network to fit data flow• Tight security control at vital points

Network Architecture

Wrap UpDate pipeline deconstruction

“Oh… and I thought I just add Hadoop to our technology landscape… you know, like a database or an appliance.”

– Misled Decision Maker

Hype CurveVisibility

TimeTechnology Trigger

Peak of Inflated Expectations

Trough of Disillusionment

Slope of Enlightenment

Plateau of Productivity

Technology Waves• Hadoop is just one part of the hype curve• Technologies that follow may (heavily, or even solely) depend on it• “Shaky foundations”?

• But… most (if not all) technologies are initially oversold and overhyped

• What happens in practice?

Hype Curve – The Hadoop VersionVisibility

Time

“Big Data isStrategic for us!”

First PoC

“Where are the results?”

“Darn, Hadoopis difficult!”

“Security? Multitenancy?Development? Lifecycle?

Environments?”

“Maybe Hadoopis not for us?”

Allocate more Resources & Budget

First use-case in productionHadoop Team Productivity

Meanwhile…

Summary• Data Pipelines span many levels of architectures• Hardware, Networking, Information, Security, Data Management

• Core Hadoop itself only provides little in that regard• Vendors offer some support (closed or open source)

• Use-case are often unknown• Guess as good as possible, generalize

• Careful planning is vital, mistakes are costly• Mixed workloads are a nightmare for resource management• Keep things simple (KISS principle)

• Knowledge needs to be built upfront • Hire someone in the know!

Thank You!@larsgeorge