data pipelines in hadoop - sap meetup in tel aviv
TRANSCRIPT
Data Processing in Hadoop
Lars George – Partner and Co-Founder @ OpenCore
Big Data & Data Science Israel Meetup – 21.03.2017
Analytics and Data Pipelines in Practice
About Me• Partner & Co-Founder at OpenCore• Before, EMEA Chief Architect at Cloudera• 5+ years
• Hadoop since 2007• Apache Committer• HBase and Whirr
• O’Reilly Author: HBase – The Definitive Guide• Also in Japanese, Korean & Chinese• 2nd edition out soon!
• Contact• [email protected]• @larsgeorge
日本語版も出ました !
Agenda• Hadoop History• Data Pipelines• Hadoop Components• Data Processing• Summary
Hadoop HistoryA walk through time…
Tectonic Shifting: Prevalent Data Inertia
The Original Inspirations for Hadoop
2003 2004
A Decade of Hadoop History on One Slide
Ten years ago, “Hadoop” referred to a scalable, fault-tolerant filesystem (HDFS) and programming framework (MapReduce)
for distributed computing.
Today, it refers to both a kernel containing the aforementioned pieces, as well as a constantly evolving ecosystem of 25+ data
stores, execution engines, programming and data access frameworks, and other componentry.
Recognize this guy?
Hadoop’s Original Architecture
MapReduce(Data Processing and Resource Management)
HDFS(Filesystem/Storage)
Hadoop's Architecture Today
MapReduce(Data Processing)
YARN(Resource Management)
HDFS(Storage)
Popular by Demand• More resources are poured into
Hadoop than many other projects• Vibrant community with many
commercial entities backing the development• List on the right lists separate
projects, which are combined in Hadoop distributions• Total would far exceed anything
else
• Literally no alternatives!
Data PipelinesFrom deluge to insight
Data Pipeline Components• Pipelines need data and CPUs• Continuous ingest lands new
data in various ways• Access to data allows for
consumers to build products• All of this needs to be
• Automated & managed• Done in a secure manner
• Finally, pipelines need to be properly onboarded• Discovery is necessary to find
schemas, data sources, etc.
Storage ProcessingIngest
Automation + Data & Resource Management
Authentication, Authorization, Audits
Access
Onboarding & Discovery
Physical Systems
Pipelines Increase Value of Data
Now that we know how data pipelines span many layers in both hardware and software, we can look at what Hadoop has to offer in more detail…
Hadoop ComponentsGrowth and Controversies
Example: Cloudera
Batch, Interactive, and Real-Time.Leading performance and usability in one platform.
• End-to-end analytic workflows
• Access more data• Work with data in new
ways• Enable new users
Security and Administration
ProcessIngest
Sqoop, Flume, NiFi
TransformMapReduce,
Hive, Pig, Spark
DiscoverAnalytic Database
Impala
SearchSolr
ModelMachine Learning
SAS, R, Spark, Mahout
ServeNoSQL Database
HBase
StreamingSpark Streaming
Unlimited Storage HDFS, HBase
YARN, Cloudera Manager,Cloudera Navigator
One Platform, Many Workloads
Hadoop: One Platform• Different to the silo’ed, monolithic databases, Hadoop is a single, shared
platform, with multiple entry points (access engines)• Scale and resilience is inherently built in• There are no silos, everything is just a directory with data inside
But…• How do you know what is where?• Access needs to be tightly controlled, down to the field level!
Analogy: The Universal Flatbed• Hadoop is a powerful engine exposed as a platform to carry loads• Initially the platform is bare and beckons for customization• You can convert the flatbed to what is needed
But…• Once converted, how to switch
between workloads?• How do you share the engine with
different users?
Hadoop Architecture Today• Components are selected to
match customer demands• A platform has many
advantages, including paid QA time• Some newer components
can be added later on• Labs etc.
• Many buzzwords that need to be carefully vetted…
2006 2008 2009 2010 2011 2012 2013
Core Hadoop (HDFS,
MapReduce)
HBaseZooKeeper
SolrPig
Core Hadoop
HiveMahoutHBase
ZooKeeperSolrPig
Core Hadoop
SqoopAvroHive
MahoutHBase
ZooKeeperSolrPig
Core Hadoop
FlumeBigtopOozie
HCatalogHue
SqoopAvroHive
MahoutHBase
ZooKeeperSolrPig
YARNCore Hadoop
SparkTez
ImpalaKafkaDrill
FlumeBigtopOozie
HCatalogHue
SqoopAvroHive
MahoutHBase
ZooKeeperSolrPig
YARNCore Hadoop
ParquetSentrySpark
TezImpalaKafkaDrill
FlumeBigtopOozie
HCatalogHue
SqoopAvroHive
MahoutHBase
ZooKeeperSolrPig
YARNCore Hadoop
The stack is continually evolving and growing!
2007
SolrPig
Core Hadoop
KnoxFlink
ParquetSentrySpark
TezImpalaKafkaDrill
FlumeBigtopOozie
HCatalogHue
SqoopAvroHive
MahoutHBase
ZooKeeperSolrPig
YARNCore Hadoop
2014 2015
KuduRecordService
IbisFalconKnoxFlink
ParquetSentrySpark
TezImpalaKafkaDrill
FlumeBigtopOozie
HCatalogHue
SqoopAvroHive
MahoutHBase
ZooKeeperSolrPig
YARNCore Hadoop
Evolution of the Hadoop Platform
And There Is More
Hadoop - The Movie: “Divergent”
Hadoop Core
2006
HDPCDH
2008 2011
CM
Navigator
2013
Sentry
2014
RangerAmbari
Impala
2016
CDSW
2015 2017
ZeppelinAtlas
Knox
SolrSpark
Kafka
Kudu
YARN
So, Hadoop is both complicated and divergent? How can we build data pipelines then, using its components? What else is needed?
Data Processing In Hadoop TodayCoasting through the "Trough of Disillusionment"
Wait! Before we can look at the aspects of building a data pipeline, a bit more context on where users are coming from and what their needs are: The Waves of Adoption.
Waves of Adoption #1• The “AllSpark” (as in the Transformers movie)
• First companies to adopt Hadoop as a way to mirror Google’s approach
• Early Adopters• Inspired by early success stories, these engineering focused companies extended on
Hadoop
• Followers• Companies that are OK to try out new things• Still engineering driven
• Late Bloomers• First Enterprises
• New Wave• Everyone else… AllSpark
Early Adopters
Followers
Late Bloomers
Enterprises
TODAY!
Waves of Adoption #2• Simple logic at bulk (batch processing of petabytes)• What: Reporting • With: SQL (Hive), Pig• Who: Analysts, Developers
• Streaming logic, likely in Lambda architecture• What: Decision support• With: OLAP Analytics, Druid, Oryx• Who: Data architects, DevOps
• Complex analytics• What: Machine Learning, AI• With: Notebooks, DS Workbench,• Who: Data Scientists
Batch
LambdaKappa?
Hybrids: Lambda FTW?
Stage:• Storage & Processing• Ingest• Access• Automation & Management• Security• Onboarding & Discovery• Physical Systems
Storage ProcessingIngest
Automation + Data & Resource Management
Authentication, Authorization, Audits
Access
Onboarding & Discovery
Physical Systems
Storage & ProcessingStorage• Reliable and scalable systems:
HDFS, Kafka, HBase• What about Kudu, Cassandra, …
MongoDB?
• Data laid out in a structured manner• Information Architecture• Physical storage (e.g. columnar)
Processing• Generic framework: YARN• What about Mesos? Non-batch
jobs?
• Resource management hooks• Pluggable engines• MapReduce, Spark, …• MPP Systems?
Information Architecture• There is a need to define how data
flows through the system and is organized• This simplifies the onboarding
process• Can be simple, or arbitrarily
complex• Needs to be enforced as it is used• Living system, may need to adopt• Define batch and stream interfaces
Example: YARN Services?• Little progress in
years• Still batch
oriented• Projects shoehorn
service idea into YARN using kludges• Example: Slider,
Trill
Stage:• Storage & Processing• Ingest• Access• Automation & Management• Security• Onboarding & Discovery• Physical Systems
Storage ProcessingIngest
Automation + Data & Resource Management
Authentication, Authorization, Audits
Access
Onboarding & Discovery
Physical Systems
Ingest• Purpose• Receive data from heterogeneous sources• Save as-is, or do first pass processing• Store data in best format, aggregate small files• Comply to stack rules (security, IA)
• One of the most active areas• Vibrant third-party ecosystem
• Streamsets, Tamr, Waterline Data, Trifacta, IBM, …• Often a generic task, with Hadoop being only one target
• Open-source frameworks• NiFi• Flume (with Kafka)?
Storage ProcessingIngest
Automation + Data & Resource Management
Authentication, Authorization, Audits
Access
Onboarding & Discovery
Physical Systems
Stage:• Storage & Processing• Ingest• Access• Automation & Management• Security• Onboarding & Discovery• Physical Systems
Access• Hadoop has traditionally only a few interfaces• Interactive SQL
• Shell, Notebooks, Hue• JDBC/ODBC• File Access
• WebHDFS/HttpFs• Gateways
• REST, Knox
• Needs to be set up based on the use-case• Throughput vs Latency
• Must apply security rules
Stage:• Storage & Processing• Ingest• Access• Automation & Management• Security• Onboarding & Discovery• Physical Systems
Storage ProcessingIngest
Automation + Data & Resource Management
Authentication, Authorization, Audits
Access
Onboarding & Discovery
Physical Systems
Automation & Management• PoCs and prototyping are not production grade!• Need to automate the pipelines with monitoring and alerting• Full development lifecycle needs to be established• Precious resources need to be managed• Easier if use-cases all fall into the same category• Difficult when they span many systems• One of the remaining topics not addressed at all in Hadoop
• Change management should handle dynamic reconfiguration
Automation• Directed acyclic graphs (DAGs)• Define the actions and link them• Schedules based on various events (time or data)• Handle errors and maintenance
• Examples• Apache Oozie [2007, 2010 O/S, 2012 Apache]
• Java• XML or Hue
• Azkaban (LinkedIn) [2010]• Java
• Luigi (Spotify) [2012]• Python
• Apache Airflow (Airbnb) [2015]• Python
Example: Notebooks• Data scientists like prototyping• But how to bring the results into
production?• One attempt is to boost notebooks
with a framework that can handle their chaining and execution• Shared resources used• Depends on notebook backends
Source: https://databricks.com/blog/2016/08/30/notebook-workflows-the-easiest-way-to-implement-apache-spark-pipelines.html
Stage:• Storage & Processing• Ingest• Access• Automation & Management• Security• Onboarding & Discovery• Physical Systems
Storage ProcessingIngest
Automation + Data & Resource Management
Authentication, Authorization, Audits
Access
Onboarding & Discovery
Physical Systems
Security• Many moving parts• Kerberos• RPC Level• ACLs• RBAC• UIs• Data• Encryption (at-rest and in-
transit)
• Hard to configure properly• Management software helps
to a degree
Stage:• Storage & Processing• Ingest• Access• Automation & Management• Security• Onboarding & Discovery• Physical Systems
Storage ProcessingIngest
Automation + Data & Resource Management
Authentication, Authorization, Audits
Access
Onboarding & Discovery
Physical Systems
Onboarding Use-Cases• Ask the necessary questions ahead of time• Use the answer to set (initially) strict limits• Use HDFS quotas, YARN queues, etc.
• Initialize the system with the defaults• Communicate to other teams what the expected impact might be• During onboarding explain the shared nature of Hadoop• Avoid “long faces” due to changes (change management)
• Define costs and chargeback models• Automate into self-service if possible• Push updated configuration and notifications
Stage:• Storage & Processing• Ingest• Access• Automation & Management• Onboarding & Discovery• Physical Systems
Storage ProcessingIngest
Automation + Data & Resource Management
Authentication, Authorization, Audits
Access
Onboarding & Discovery
Physical Systems
Stack Architecture• Combine the reliable components into a
whole stack• Organize interfaces to outside systems
by users and purpose• Separate components for ease of
maintenance• Layer network to fit data flow• Tight security control at vital points
Network Architecture
Wrap UpDate pipeline deconstruction
“Oh… and I thought I just add Hadoop to our technology landscape… you know, like a database or an appliance.”
– Misled Decision Maker
Hype CurveVisibility
TimeTechnology Trigger
Peak of Inflated Expectations
Trough of Disillusionment
Slope of Enlightenment
Plateau of Productivity
Technology Waves• Hadoop is just one part of the hype curve• Technologies that follow may (heavily, or even solely) depend on it• “Shaky foundations”?
• But… most (if not all) technologies are initially oversold and overhyped
• What happens in practice?
Hype Curve – The Hadoop VersionVisibility
Time
“Big Data isStrategic for us!”
First PoC
“Where are the results?”
“Darn, Hadoopis difficult!”
“Security? Multitenancy?Development? Lifecycle?
Environments?”
“Maybe Hadoopis not for us?”
Allocate more Resources & Budget
First use-case in productionHadoop Team Productivity
Meanwhile…
Summary• Data Pipelines span many levels of architectures• Hardware, Networking, Information, Security, Data Management
• Core Hadoop itself only provides little in that regard• Vendors offer some support (closed or open source)
• Use-case are often unknown• Guess as good as possible, generalize
• Careful planning is vital, mistakes are costly• Mixed workloads are a nightmare for resource management• Keep things simple (KISS principle)
• Knowledge needs to be built upfront • Hire someone in the know!
Thank You!@larsgeorge