cloud austin meetup - hadoop like a champion

33
Page 1 © Hortonworks Inc. 2011 2014. All Rights Reserved Data Access with Hadoop [email protected] @ameetp512 Ameet Paranjape

Upload: ameet-paranjape

Post on 14-Jul-2015

282 views

Category:

Technology


1 download

TRANSCRIPT

Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Data Access with Hadoop

[email protected]

@ameetp512

Ameet Paranjape

Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Interactive and real-time data analysis in Hadoop!

Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

2013Digital universe

2.3 Zettabytes

1 Zettabyte (ZB) = 1 million Petabytes (PB); Sources: IDC and IDG Enterprise

85% of growth from new types of data

with machine-generated data increasing

15x

2020Digital universe

40 Zettabytes

Analysts consensus estimates

enterprise data growth of

year over year through 2020

50x

Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

AP

PL

ICA

TIO

NS

DA

TA

S

YS

TE

M

Business

Analytics

Custom

Applications

Packaged

Applications

Traditional systems under pressure

• Silos of Data

• Costly to Scale

• Constrained Schemas

Clickstream

Geolocation

Sentiment, Web Data

Sensor. Machine Data

Unstructured docs, emails

Server logs

SO

UR

CE

S

Existing Sources (CRM, ERP,…)

RDBMS EDW MPP

New Data Types

…and difficult to

manage new data

Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 5

Virtualization

Slicing your servers into pieces so your can parcel out computing resources

Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 6

Hadoop

Tying your servers together to make them act like one big computer

Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Cost of storage is going down

According to StatisticBrain, the average cost per gigabyte of storage was

$437,500 in 1980, $11 in 2000, and just five cents in 2013.

Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hadoop 101

The basics

1. Hadoop ties your servers together, and makes them act like one big computer

• So you can use inexpensive servers to do your big data processing

2. Hadoop works well with structured, semi-

structured, and unstructured information

Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hadoop and the Modern Data Architecture (MDA)

SO

UR

CE

S

EXISTING Systems

Clickstream Web &Social

Geolocation Sensor & Machine

Server Logs

Unstructured

AP

PL

ICA

TIO

NS

DA

TA

S

YS

TE

M

Business

Analytics

Custom

Applications

Packaged

Applications

RDBMS EDW MPP YARN: Data Operating System

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° N

HDFS (Hadoop Distributed File System)

Interactive Real-TimeBatch

Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

It’s crowded out there!

Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Recommended Reading

The Forrester Wave Report – Big Data Hadoop Solutions, Q1 2014

Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hadoop Comparison Tips

1. Is the solution open or closed source?

2. If code is open, who owns the IP?

3. What’s available for free and what do you pay for?

4. Is the solution substrate agnostic?

5. OS support options?

6. Partnerships

7. What’s the pricing model?

8. Local resources to help?

Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

A Blueprint for Enterprise Hadoop

Load data

and manage

according

to policy

Deploy and

effectively

manage the

platform

Store and process all of your Corporate Data Assets

Access your data simultaneously in multiple ways

(batch, interactive, real-time) Provide layered

approach to

security through

Authentication,

Authorization,

Accounting, and

Data Protection

DATA MANAGEMENT

SECURITYDATA ACCESSGOVERNANCE

& INTEGRATIONOPERATIONS

Enable both existing and new application to

provide value to the organization

PRESENTATION & APPLICATION

Empower existing operations and

security tools to manage Hadoop

ENTERPRISE MGMT & SECURITY

Provide deployment choice across physical, virtual, cloud

DEPLOYMENT OPTIONS

YARN Data Operating System

Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Apache Hadoop & A Hadoop “Distribution”

Apache Hadoop Is a project

Governed by Apache Software Foundation (ASF)

Comprises YARN and HDFS

Hadoop distribution is a package of projects (e.g. HDP)

Packages Apache Hadoop and related Apache projects

It extends Hadoop with:

–Data access services to manipulate the data

–Data governance and integration services

–Security services

–Operational services to manage the cluster

Tested for consistency across the entire package

Hardened for the enterprise

Page 14

Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

YARN has transformed Hadoop

YARN: Data Operating System

DATA MANAGEMENT

BATCH, INTERACTIVE & REAL-TIME

DATA ACCESS

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

°

°

N

HDFS (Hadoop Distributed File System)

Interactive Real-TimeBatch

Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Apache Projects for Data Access

Apache Pig

Apache Hive

Apache HBase

Apache Storm

Apache Solr

Apache Spark

Traditional Tools

YARN: Data Operating System

DATA MANAGEMENT

BATCH, INTERACTIVE & REAL-TIME

DATA ACCESS

Script

Pig

Search

Solr

SQL

Hive

HCatalog

NoSQL

HBase

Accumulo

Stream

Storm

Others

ISV

Engines

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

°

°

N

HDFS (Hadoop Distributed File System)

In-Memory

Spark

TezTez

Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Apache Projects for Governance

Data Workflow,

Lifecycle &

Governance

Falcon

Sqoop

Flume

NFS

WebHDFS

YARN: Data Operating System

DATA MANAGEMENT

BATCH, INTERACTIVE & REAL-TIME

DATA ACCESS

GOVERNANCE

& INTEGRATION

Script

Pig

Search

Solr

SQL

Hive

HCatalog

NoSQL

HBase

Accumulo

Stream

Storm

Others

ISV

Engines

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

°

°

N

HDFS (Hadoop Distributed File System)

In-Memory

Spark

TezTez

Apache Falcon

Apache Sqoop

Apache Flume

Hadoop NFS & WebHDFS

Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Apache Projects for Security

Apache Knox

Apache Argus

Entire Stack

(HDFS, Hive, YARN)

Data Workflow,

Lifecycle &

Governance

Falcon

Sqoop

Flume

NFS

WebHDFS

YARN: Data Operating System

DATA MANAGEMENT

SECURITYBATCH, INTERACTIVE & REAL-TIME

DATA ACCESS

GOVERNANCE

& INTEGRATION

Authentication

Authorization

Accounting

Data Protection

Storage: HDFS

Resources: YARN

Access: Hive, …

Pipeline: Falcon

Cluster: Knox

Script

Pig

Search

Solr

SQL

Hive

HCatalog

NoSQL

HBase

Accumulo

Stream

Storm

Others

ISV

Engines

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

°

°

N

HDFS (Hadoop Distributed File System)

In-Memory

Spark

TezTez

Page 19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Apache Projects for Operations

Apache Ambari

Apache Zookeeper

Apache Oozie

Provision,

Manage &

Monitor

Ambari

Zookeeper

Scheduling

Oozie

Data Workflow,

Lifecycle &

Governance

Falcon

Sqoop

Flume

NFS

WebHDFS

YARN: Data Operating System

DATA MANAGEMENT

SECURITYBATCH, INTERACTIVE & REAL-TIME

DATA ACCESS

GOVERNANCE

& INTEGRATION

Authentication

Authorization

Accounting

Data Protection

Storage: HDFS

Resources: YARN

Access: Hive, …

Pipeline: Falcon

Cluster: Knox

OPERATIONS

Script

Pig

Search

Solr

SQL

Hive

HCatalog

NoSQL

HBase

Accumulo

Stream

Storm

Others

ISV

Engines

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

°

°

N

HDFS (Hadoop Distributed File System)

In-Memory

Spark

TezTez

Page 20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Remember the MDA

SO

UR

CE

S

EXISTING Systems

Clickstream Web &Social

Geolocation Sensor & Machine

Server Logs

Unstructured

AP

PL

ICA

TIO

NS

DA

TA

S

YS

TE

M

Business

Analytics

Custom

Applications

Packaged

Applications

RDBMS EDW MPP YARN: Data Operating System

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° N

HDFS (Hadoop Distributed File System)

Interactive Real-TimeBatch

Page 21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

What is Data Access?

Data Access defines ALL the channels

through which data can be accessed,

analyzed, cleansed and consumed within

Hadoop. Each channel can be categorized

into THREE core patterns; Batch, Interactive

and Real-time.Multiple engines provide

optimized access to your mission

critical data.

Page 22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Access patterns enabled by YARN

BatchNeeds to happen but, no

timeframe limitations

InteractiveNeeds to happen at

Human time

Real-Time Needs to happen at

Machine Execution time.

YARN: Data Operating System

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° °

°

°N

HDFS (Hadoop Distributed File System)

Interactive Real-TimeBatch

Page 23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

HBase

• Apache™ HBase is a non-relational (NoSQL)

database that runs on top of the Hadoop®

Distributed File System (HDFS).

• It is columnar and provides fault-tolerant

storage and quick access to large quantities

of sparse data.

• It also adds transactional capabilities to

Hadoop, allowing users to conduct updates,

inserts and deletes.

• HBase was created for hosting very large

tables with billions of rows and millions of

columns.

Developers use it to:

• Provide low latency access to

massive amounts of data (eg.

Recommendation engine

results)

• Document store

Page 24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Spark

• Spark is a general-purpose engine for ad-hoc

interactive analytics, iterative machine-

learning, and other use cases well-suited to

interactive, in-memory data processing of GB

to TB sized datasets.

• Spark loads data into memory so it can be

queried repeatedly. It can create a “shadow”

of data that can be used in the next iteration

of a query

• Spark provides simple APIs for data scientists

and engineers familiar with Scala

(programming language) to build applications

• Spark is YARN-ready – another engine on

YARN!

Developers use it to:

• Data Science: machine

Learning and iterative analytics

Page 25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Stream Processing in Hadoop

How do I deal with this

continuous stream of data

coming in from sensors…etc?

Apache StormReal-time event processing for sensor

and business activity monitoring

• Unlocks new business cases for Hadoop

• Key component of a data lake architecture

• Scale: Ingest millions of events per second. Fast

query on petabytes of data

• Integrated with Ambari to manage

• Predictive Analytics

Prevent Optimize

Finance- Securities Fraud

- Compliance violations

- Order routing

- Pricing

Telco- Security breaches

- Network Outages

- Bandwidth allocation

- Customer service

Retail- Offers

- Pricing

Manufacturing- Machine failures - Supply chain

Transportation- Driver & fleet issues - Routes

- Pricing

Web- Application failures

- Operational issues

- Site content

Sentiment Clickstream Machine/Sensor Server Logs Geo-location

----

Monitor real-time data to…

YARN: Data Operating System

Interactive Real-TimeBatch

Page 26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Trucking company w/ large fleet of trucks in Midwest

A truck generates millions of events for

a given route; an event could be:

• 'Normal' events: starting / stopping of the vehicle

• ‘Violation’ events: speeding, excessive

acceleration and breaking, unsafe tail distance

Route?

Truck?

Driver?

Analysts query a

broad history to

understand if today’s

violations are part of

a larger problem with

specific routes,

trucks, or drivers

Company uses an application that

monitors truck locations and violations

from the truck/driver in real-time

Page 27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Truck Sensors

Distributed Storage: HDFS

Many Workloads: YARN

Solutions on Hadoop Require All!

Stream Processing (Storm)

Inbound Messaging(Kafka)

Microsoft

Excel

Interactive Query(Hive on Tez)

Real-time Serving (HBase)

Alerts & Events(ActiveMQ)

Real-Time

User Interface

Page 28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Query Executes Blazingly Fast with Hive 13 on Tez

Do Specific Routes Cause More Issues?

Do Specific Trucks Cause More Issues?

Do Specific Drivers in Trucks Cause More Issues?

Page 32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Try it out...

Page 33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

YARN Has

Fundamentally

Changed HadoopYARN enables:

• More WorkloadsFrom batch to interactive & real-

time

• More Data Multiple data sets of varying types

and structures

• More ValueHosting multiple business cases

in a single Hadoop cluster

Enterprise Hadoop Enables…

• More WorkloadsFrom batch to interactive & real-time

• More Data Multiple data sets of varying types

and structures

• More ValueHosting multiple business cases

in a single Hadoop cluster