hadoop - looking to the future by arun murthy

15
Hadoop Looking to the Future Arun C. Murthy Hortonworks Co-Founder @acmurthy

Upload: huguk

Post on 18-Jul-2015

215 views

Category:

Technology


4 download

TRANSCRIPT

Hadoop – Looking to the FutureArun C. Murthy

Hortonworks Co-Founder

@acmurthy

1 ° ° ° ° °

° ° ° ° ° N

HDFS (Hadoop Distributed File System)

MapReduceLargely Batch Processing

2006

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hadoop w/ MapReduceTraditional Hadoop allowed early adopters to

deal with data at scale however…

• Single purpose clusters, specific data sets

• Primarily a batch system using MapReduce

• Difficult to natively integrate existing applications

• Limited enterprise capabilities:

Operations, Security & Governance

In the beginning…

20092006

1 ° ° ° ° °

° ° ° ° ° N

HDFS (Hadoop Distributed File System)

MapReduceLargely Batch Processing

Hadoop w/ MapReduce

MAPREDUCE-279

Common data,

multiple applications

• Support multi-tenant cluster

• Batch, interactive & real-time

use cases can leverage the

most appropriate engine

Architectural Center

• Consistent security,

governance & operations

• Ecosystem applications

run natively in Hadoop

Apache Hadoop 2.0 & YARNOctober 23, 2013

YARN : Data Operating System

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° °

°

°N

HDFS (Hadoop Distributed File System)

Batch Interactive Real-Time

BatchMapReduce

Apache Tez: Flexible & More Efficient Execution Engine

YARN: Data Operating System

Batch & Interactive

Apache Tez

SQL

Apache

Hive

Data Flow

Apache

Pig

1

°

°

°

° ° ° ° ° ° °

° ° ° ° ° ° N

Java Apps

Cascading

Others BatchMapReduce

1

°

°

°

° °

° °

HDFS (Hadoop Distributed File System)

SQL

Apache

Hive

Data Flow

Apache Pig

° ° ° ° ° °

° ° ° ° ° N

Others

1

°

HDFS (Hadoop Distributed File System)

Hadoop 1

Hadoop 2

Batch System w/

MapReduce as base

Apache Tez supports both

interactive & batch processing

YARN : Data Operating System

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° °

°

°N

HDFS (Hadoop Distributed File System)

LegacyMapReduce

Interactive SQL

Apache TezOther Engines

& Workloads

Apache Hive

SQL

Business Analytics Custom Apps

Apache Hive and the Power of YARN

Stinger InitiativeNext generation SQL based

interactive query in Hadoop

SpeedPerformance increased 100x for

interactive & batch use cases

ScaleQueries from GBs,

to TBs to PBs

SQLBroadest range of SQL

semantics

Apache Hive Community

1,672 Jira Tickets Closed

145Developers

44Companies

~390,000Lines Of Code Added… (2x)

13Months

Hive

13

Hive

12

Hive

10

Dramatically

faster queries

speeds time

to insight

secondsthousands

of seconds

YARN : Data Operating System

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° °

°

°N

HDFS (Hadoop Distributed File System)

LegacyMapReduce

Interactive SQL

Apache TezOther Engines

& Workloads

Apache Hive

SQL

Business Analytics Custom Apps

Apache Hive – Interactive SQL in Hadoop

StingerNext generation SQL based

interactive query in Hadoop

ORCIO Improvements

Efficient processing via complex

pushdown

TezPowerful primitives for

the SQL Planner

VQPEfficient CPU utilization in

Inner Loop

Sub-Second SQL with Hive LLAP

Stinger.NextSub-second SQL in Hadoop via

Hive/LLAP

CBOThe “right” plan executed

violently…

LLAP

MetastoreExtensive stats &

scalability

YARN : Data Operating System

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° °

°

°N

HDFS (Hadoop Distributed File System)

LLAP Apache TezOther Engines

& Workloads

Apache Hive

Sub-second SQL

Business Analytics Custom Apps

Long-lived daemon for low-

latency startup, caching & CPU

efficiency via JIT

Apache Slider For “Always-on” Services

“Slide” apps on YARN

Democratize access to

storage (HDFS) and compute

(YARN)

Ease management (Ambari)

in addition to deploymentYARN: Data Operating System

Real-Time

Slider

NoSQL

Apache

HBase

NoSQL

Apache

Accumulo

1

°

°

°

° ° ° ° ° ° °

° ° ° ° ° ° N

HDFS (Hadoop Distributed File System)

Stream

Apache

Storm

Others

ISV

© Hortonworks Inc. 2015. All Rights Reserved

Data Governance Initiative

Requirements

1. Hadoop must snap in to the

existing frameworks and

openly exchange metadata

2. Hadoop must address

governance within its own

stack of technologies

Engineers from a group of companies dedicated

to meeting these requirements in the open

New Apache

project proposal

Knowledge Store

Audit Store (Ranger)

ModelsType-System

Policy RulesTaxonomies

Tag Based

Policies

Data Lifecycle

Management

(Falcon)

Real-time Tag-based Access Control (Ranger)

REST API

Services

Search Lineage Exchange

Healthcare

HIPAA

HL7

Financial

SOX

Dodd-Frank

Energy

PPDM

Retail

PCI

PII

Other

CWM

Hadoop - Redefined

YARN(Cluster Resource Management)

1 ° ° ° ° ° ° °

° ° ° ° ° ° ° °

° ° ° ° ° ° ° °

Script

Pig

SQL

Hive

TezTez

Others

Engines

Tez

Java

Cascading

Tez

° °

° °

° °

HBase

NoSQL

Storm

Stream

Slider Slider

Accumulo

NoSQL

Others

Engines

Slider Slider

° ° ° ° °

° ° ° ° °

° ° ° ° °

°

°

°

Spark

In-Memory

°

°

°

°

°

°

PaaS

KubernetesLASR

HPA

°

°

N

°

°

°

°

°

°

HDFS (Storage Management)

Batch

MR

DGI(Data Governance & Metadata Management)

© Hortonworks Inc. 2015. All Rights Reserved

HDFS - Futures

Sanjay Radia

HDFS – Tiered Storage

HDFS Ozone – Object Store

Thank You@acmurthy