hadoop - looking to the future by arun murthy
TRANSCRIPT
1 ° ° ° ° °
° ° ° ° ° N
HDFS (Hadoop Distributed File System)
MapReduceLargely Batch Processing
2006
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop w/ MapReduceTraditional Hadoop allowed early adopters to
deal with data at scale however…
• Single purpose clusters, specific data sets
• Primarily a batch system using MapReduce
• Difficult to natively integrate existing applications
• Limited enterprise capabilities:
Operations, Security & Governance
In the beginning…
20092006
1 ° ° ° ° °
° ° ° ° ° N
HDFS (Hadoop Distributed File System)
MapReduceLargely Batch Processing
Hadoop w/ MapReduce
MAPREDUCE-279
Common data,
multiple applications
• Support multi-tenant cluster
• Batch, interactive & real-time
use cases can leverage the
most appropriate engine
Architectural Center
• Consistent security,
governance & operations
• Ecosystem applications
run natively in Hadoop
Apache Hadoop 2.0 & YARNOctober 23, 2013
YARN : Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°N
HDFS (Hadoop Distributed File System)
Batch Interactive Real-Time
BatchMapReduce
Apache Tez: Flexible & More Efficient Execution Engine
YARN: Data Operating System
Batch & Interactive
Apache Tez
SQL
Apache
Hive
Data Flow
Apache
Pig
1
°
°
°
° ° ° ° ° ° °
° ° ° ° ° ° N
Java Apps
Cascading
Others BatchMapReduce
1
°
°
°
° °
° °
HDFS (Hadoop Distributed File System)
SQL
Apache
Hive
Data Flow
Apache Pig
° ° ° ° ° °
° ° ° ° ° N
Others
1
°
HDFS (Hadoop Distributed File System)
Hadoop 1
Hadoop 2
Batch System w/
MapReduce as base
Apache Tez supports both
interactive & batch processing
YARN : Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°N
HDFS (Hadoop Distributed File System)
LegacyMapReduce
Interactive SQL
Apache TezOther Engines
& Workloads
Apache Hive
SQL
Business Analytics Custom Apps
Apache Hive and the Power of YARN
Stinger InitiativeNext generation SQL based
interactive query in Hadoop
SpeedPerformance increased 100x for
interactive & batch use cases
ScaleQueries from GBs,
to TBs to PBs
SQLBroadest range of SQL
semantics
Apache Hive Community
1,672 Jira Tickets Closed
145Developers
44Companies
~390,000Lines Of Code Added… (2x)
13Months
Hive
13
Hive
12
Hive
10
Dramatically
faster queries
speeds time
to insight
secondsthousands
of seconds
YARN : Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°N
HDFS (Hadoop Distributed File System)
LegacyMapReduce
Interactive SQL
Apache TezOther Engines
& Workloads
Apache Hive
SQL
Business Analytics Custom Apps
Apache Hive – Interactive SQL in Hadoop
StingerNext generation SQL based
interactive query in Hadoop
ORCIO Improvements
Efficient processing via complex
pushdown
TezPowerful primitives for
the SQL Planner
VQPEfficient CPU utilization in
Inner Loop
Sub-Second SQL with Hive LLAP
Stinger.NextSub-second SQL in Hadoop via
Hive/LLAP
CBOThe “right” plan executed
violently…
LLAP
MetastoreExtensive stats &
scalability
YARN : Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°N
HDFS (Hadoop Distributed File System)
LLAP Apache TezOther Engines
& Workloads
Apache Hive
Sub-second SQL
Business Analytics Custom Apps
Long-lived daemon for low-
latency startup, caching & CPU
efficiency via JIT
Apache Slider For “Always-on” Services
“Slide” apps on YARN
Democratize access to
storage (HDFS) and compute
(YARN)
Ease management (Ambari)
in addition to deploymentYARN: Data Operating System
Real-Time
Slider
NoSQL
Apache
HBase
NoSQL
Apache
Accumulo
1
°
°
°
° ° ° ° ° ° °
° ° ° ° ° ° N
HDFS (Hadoop Distributed File System)
Stream
Apache
Storm
Others
ISV
© Hortonworks Inc. 2015. All Rights Reserved
Data Governance Initiative
Requirements
1. Hadoop must snap in to the
existing frameworks and
openly exchange metadata
2. Hadoop must address
governance within its own
stack of technologies
Engineers from a group of companies dedicated
to meeting these requirements in the open
New Apache
project proposal
Knowledge Store
Audit Store (Ranger)
ModelsType-System
Policy RulesTaxonomies
Tag Based
Policies
Data Lifecycle
Management
(Falcon)
Real-time Tag-based Access Control (Ranger)
REST API
Services
Search Lineage Exchange
Healthcare
HIPAA
HL7
Financial
SOX
Dodd-Frank
Energy
PPDM
Retail
PCI
PII
Other
CWM
Hadoop - Redefined
YARN(Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
TezTez
Others
Engines
Tez
Java
Cascading
Tez
° °
° °
° °
HBase
NoSQL
Storm
Stream
Slider Slider
Accumulo
NoSQL
Others
Engines
Slider Slider
° ° ° ° °
° ° ° ° °
° ° ° ° °
°
°
°
Spark
In-Memory
°
°
°
°
°
°
PaaS
KubernetesLASR
HPA
°
°
N
°
°
°
°
°
°
HDFS (Storage Management)
Batch
MR
DGI(Data Governance & Metadata Management)