discover red hat and apache hadoop for the modern data architecture - part 3

35
Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Q&A box is available for your questions Webinar will be recorded for future viewing Thank you for joining! We’ll get started soon

Upload: hortonworks

Post on 29-Nov-2014

531 views

Category:

Data & Analytics


1 download

DESCRIPTION

Red Hat JBoss Data Virtualization and HDP: Enabling the Data Lake (demo and deep dive)

TRANSCRIPT

Page 1: Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3

Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Q&A box is available for your questions

Webinar will be recorded for future viewing

Thank you for joining!

We’ll get started soon…

Page 2: Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3

Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Deliver the Data Lake (demo/deep dive) …using HDP and Red Hat JBoss Data Virtualization

We do Hadoop.

Page 3: Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3

Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Your speakers…

Raghu Thiagarajan, Dir, Partner Product Management, Hortonworks

Kimberly Palko, Principal Product Manager, Red Hat

Kenny Peeples, Principal Technical Marketing Manager, Red Hat

Page 4: Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3

Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

An architectural shift towards an HDP Data Lake SC

ALE

SCOPE

Unlocking the Data Lake

 RDBMS

MPP

EDW

Data Lake Enabled by YARN •  Single data repository,

shared infrastructure

•  Multiple biz apps accessing all the data

•  Enable a shift from reactive to proactive interactions

•  Gain new insight across the entire enterprise

New Analytic Apps or IT Optimization

HDP 2.1

Gov

erna

nce

&

Inte

grat

ion

Secu

rity

Ope

ratio

ns

Data Access

Data Management

YARN

Page 5: Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3

Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

What is a Data Lake?

Architectural Pattern in the Data Center Uses Hadoop to deliver deeper insight across a large, broad, diverse set of data efficiently §  Multipurpose, Open PLATFORM for Data (NOT a database) §  Land all data in a single place and interact with it in many ways §  Allows for the ecosystem to provide higher level services (SAS, SAP, Microsoft for Streaming,

MPP, In-memory, etc..) §  First class data management capabilities (metadata management, security, transformation

pipelines, replication, retention, etc..)

Page 6: Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3

Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

HDP Data Lake Solution Architecture

compute &

storage . . .

. . .

. . compute

& storage

.

.

YARN

Data Lake HDP Grid

AMBARI

HCATALOG (table & user-defined metadata)

Step 2: Model/Apply Metadata

Use Case Type 1: Materialize &

Exchange

Opens up Many New Use Cases

Stream Processing, Real-time Search,

MPI, etc. YARN Apps

Interactive Hive Server (Tez/Stinger)

Query/ Analytics/Reporting

Tools Tableau, Excel, Microstrategy

Datameer, Platfora, Business Objects

Use Case Type 2: Explore/Visualize

FALCON (Data pipeline & flow management)

SOURCE DATA

Click Stream

Sales Transactions

Product Data

Marketing/Inventory

Social Data

EDW

NFS

Apache Argus (Unified Access Controls and Audit)

(data processing)

HIVE PIG Cascading Exchange

HBase Client

Sqoop/Hive

Downstream Data Sources

OLTP HBase

EDW (Teradata)

MR2 Graph

SolR

SAS

Step 3: Transform, Aggregate & Materialize

Storm

Step 4: Schedule and Orchestrate

Manage Steps 1-4: Data Management with Falcon, Security with HDP Advanced Security

Ingestion

SQOOP

FLUME

Web HDFS

NFS

STORM

Step 1:Extract & Load

REST

HTTP

Streaming

JMS

TEZ

Mahout

Page 7: Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3

Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

HDP Data Lake Solution Architecture + Virtual Data Mart

compute &

storage . . .

. . .

. . compute

& storage

.

.

YARN

Data Lake HDP Grid

AMBARI

HCATALOG (table & user-defined metadata)

Step 2: Model/Apply Metadata

Use Case Type 1: Materialize &

Exchange

Opens up Many New Use Cases

Stream Processing,

Real-time Search, MPI, etc.

YARN Apps

Interactive

Hive Server (Tez/Stinger)

Query/ Analytics/

Reporting Tools

Tableau, Excel, Microstrategy

Datameer, Platfora, Business

Objects

Use Case Type 2: Explore/Visualize

FALCON (Data pipeline & flow management)

SOURCE DATA

Click Stream

Sales Transactions

Product Data

Marketing/Inventory

Social Data

EDW

NFS

Apache Argus (Unified Access Controls and Audit)

(data processing)

HIVE PIG Cascading

Exchange

HBase Client

Sqoop/Hive

Downstream Data Sources

OLTP HBase

EDW (Teradata)

MR2 Graph

SolR

SAS

Step 3: Transform, Aggregate & Materialize

Storm

Step 4: Schedule and Orchestrate

Manage Steps 1-4: Data Management with Falcon, Security with HDP Advanced Security

Ingestion

SQOOP

FLUME

Web HDFS

NFS

STORM

Step 1:Extract & Load

REST

HTTP

Streaming

JMS

TEZ

Mahout

Dept Base Virtual Database (VDB)

Team 1 VDB

Team2 VDB

View2 View1

Page 8: Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3

Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Yarn allows for new processing engines

compute &

storage . . .

. . .

. . compute

& storage

.

.

YARN

Data Lake HDP Grid

AMBARI

HCATALOG (table & user-defined metadata)

Step 2: Model/Apply Metadata

Use Case Type 1: Materialize &

Exchange

Opens up Many New Use Cases

Stream Processing, Real-time Search,

MPI, etc. YARN Apps

Interactive

Hive Server (Tez/Stinger)

Query/ Analytics/Reporting

Tools Tableau, Excel, Microstrategy

Datameer, Platfora, Business Objects

Use Case Type 2: Explore/Visualize

FALCON (Data pipeline & flow management)

SOURCE DATA

Click Stream

Sales Transactions

Product Data

Marketing/Inventory

Social Data

EDW

NFS

Apache Argus (Unified Access Controls and Audit)

(data processing)

HIVE PIG Cascading

Exchange

HBase Client

Sqoop/Hive

Downstream Data Sources

OLTP HBase

EDW (Teradata)

MR2 Graph

SolR

SAS

Step 3: Transform, Aggregate & Materialize

Storm

Step 4: Schedule and Orchestrate

Manage Steps 1-4: Data Management with Falcon, Security with HDP Advanced Security

Ingestion

SQOOP

FLUME

Web HDFS

NFS

STORM

Step 1:Extract & Load

REST

HTTP

Streaming

JMS

TEZ

Mahout

Page 9: Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3

Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Falcon enables Governance of Data Pipelines

compute &

storage . . .

. . .

. . compute

& storage

.

.

YARN

Data Lake HDP Grid

AMBARI

HCATALOG (table & user-defined metadata)

Step 2: Model/Apply Metadata

Use Case Type 1: Materialize &

Exchange

Opens up Many New Use Cases

Stream Processing, Real-time Search,

MPI, etc. YARN Apps

Interactive

Hive Server (Tez/Stinger)

Query/ Analytics/Reporting

Tools Tableau, Excel, Microstrategy

Datameer, Platfora, Business Objects

Use Case Type 2: Explore/Visualize

FALCON (Data pipeline & flow management)

SOURCE DATA

Click Stream

Sales Transactions

Product Data

Marketing/Inventory

Social Data

EDW

NFS

Apache Argus (Unified Access Controls and Audit)

(data processing)

HIVE PIG Cascading

Exchange

HBase Client

Sqoop/Hive

Downstream Data Sources

OLTP HBase

EDW (Teradata)

MR2 Graph

SolR

SAS

Step 3: Transform, Aggregate & Materialize

Storm

Step 4: Schedule and Orchestrate

Manage Steps 1-4: Data Management with Falcon, Security with HDP Advanced Security

Ingestion

SQOOP

FLUME

Web HDFS

NFS

STORM

Step 1:Extract & Load

REST

HTTP

Streaming

JMS

TEZ

Mahout

Page 10: Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3

Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Apache Falcon: Data Governance in the Lake Data pipeline

Raw Clean Prep

Defined in

Auto generate & orchestrate

Multiple complex Oozie workflows

Job1 Job2 JobN Job3 Job4 Job7 Job6 JobN

Job1 Job2 JobN Job3 Job4 Job7 Job6 JobN

Other Hadoop ecosystem tools

Eg. DistCp

Adds the required data governance features

Falcon Adds the required data governance features

DEFINITION Replication | Retention

Eviction | Late data MONITORING

TRACING Audit | Lineage

Tagging

Page 11: Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3

Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Mashing up diverse data types in the Data Lake

Page 12: Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3

Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Mashing up diverse data types in the Data Lake

Page 13: Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3

Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Mashing up diverse data types in the Data Lake

Page 14: Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3

Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Mashing up diverse data types in the Data Lake

Page 15: Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3

Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Mashing up diverse data types in the Data Lake

Page 16: Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3

Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Mashing up diverse data types in the Data Lake

Page 17: Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3

Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Virtual Data Marts with Red Hat JBoss Data Virtualization and Hortonworks HDP Kimberly Palko

Page 18: Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3

Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Data Supply and Integration Solution

Data Virtualization sits in front of multiple data sources and ü  allows them to be treated a single source ü  delivering the desired data

ü  in the required form

ü  at the right time

ü  to any application and/or user. THINK VIRTUAL MACHINE FOR DATA

Page 19: Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3

Page 19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Easy Access to Big Data

•  Reporting tool accesses the data virtualization server via rich SQL dialect

•  The data virtualization server translates rich SQL dialect to HiveQL

•  Hive translates HiveQL to MapReduce

•  MapReduce runs MR job on big data

MapReduce

HDFS

Hive

Analytical Reporting

Tool

Data Virtualization

Server

Hadoop

Big Data

Page 20: Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3

Page 20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Use Case 1: Combine data from Hadoop with traditional data sources

Problem: Data from new data sources like social media,

clickstream and sensors needs to be combined with data from traditional sources to get the full value.

Solution: Leverage JBoss Data Virtualization to mashup

new data in Hadoop with data in traditional data sources without moving or copying any data and access it through a variety of BI tools and SOA technologies.

Consume Compose Connect

Data  can  be  accessed  by    mul/ple  tools  and  methods  already  in-­‐house  

JBoss Data Virtualization

Hive

SOURCE  1:  Hive/Hadoop  contains  data  from  new  data  sources  like  social  media,  

clickstream  and  sensor  data  

SOURCE  2:  Tradi/onal  rela/onal  databases  in  the  enterprise  

Page 21: Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3

Page 21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Use Case 2: Federating across Geographically Distributed Hadoop Clusters Problem:

Geographically distributed Hadoop clusters contains sensitive data like patient records or customer identification that cannot be accessed by other regions due to regulatory policy. IT needs access to all data, but users can only access the data in their region.

Solution: Leverage JBoss Data Virtualization to provide Row

Level Security and Masking of columns while federating across Hadoop clusters.

Consume Compose Connect

Data  can  be  accessed  by    mul/ple  tools  and  methods  already  in-­‐house  

JBoss Data Virtualization

Hive

Hadoop  cluster  in  one  geographic  region  

Hive

Hadoop  cluster  in  a  second  geographic  region  

Page 22: Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3

Page 22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Data for entire organization in Hadoop Data Lake

Problem: How does IT control access and give business users just the data they need? - Does every line of business have access to everyone’s data? - How do business users get access to the data they need in a simple (even self-service) way?

Marketing Clickstream Data Finance

Expense Reports

HR Employee Files Server

Logs

Sales Transactions

Customer Accounts Twitter Sentiment

Data

Hadoop Data Lake

Page 23: Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3

Page 23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Marketing Clickstream Data

Marketing IT Finance

Customer Accounts Twitter Sentiment

Data

Sales

Server Logs

Sales Transactions HR Employee Files

Finance Expense Reports

Secure, Self-Service Virtual Data Marts for Hadoop Solution: Use JBoss Data Virtualization to create virtual data marts on top of a Hadoop cluster -  Lines of Business get access to the data they need in a simple manner -  IT maintains the process and control it needs -  All data remains in the data lake, nothing is copied or moved

Hadoop Data Lake

Page 24: Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3

Page 24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Optional hierarchical data architectures with virtual data mart

Can be combined with security features like user role access and row and column masking

Dept Base Virtual Database (VDB)

Team 1 VDB

Team2 VDB

View2 View1

Page 25: Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3

Page 25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Want most recent data in an operational data store

Problem: All the legacy and archived data is in the Hadoop data lake. We want to access the most recent, up to the minute, operational data often and quickly.

Marketing Clickstream Data

Finance Expense Reports

HR Employee Files Server Logs

Sales Transactions

Customer Accounts

Twitter Sentiment Data

Hadoop Data Lake Historical Data

Page 26: Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3

Page 26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Caching For Faster Performance – Materialized View

Cached or Materialized View 1

View 1

Query 2 Query 1

Virtual Database (VDB)

•  Same cached view for multiple queries

•  Refreshed automatically or manually •  Cache repository can be any

supported data source

Page 27: Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3

Page 27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Want most recent data in an operational data store Solution: Use JBoss Data Virtualization to integrate up to the minute data from multiple diverse data sources that can be quickly queried. - Use HDP for all data older than today. - Use JDV to materialize the data in HDP for faster access and to combine with operational VDB

Marketing Clickstream Data

Finance Expense Reports

HR Employee Files

Server Logs

Sales Transactions

Customer Accounts

Twitter Sentiment Data

Hadoop Data Lake Historical Data Operational VDB

with up to the minute data

Nightly Transfer from Data Sources

Materialized View

Page 28: Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3

Page 28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Demonstration Virtual Data Marts

with Hadoop Data Lake

Kenny Peeples

Page 29: Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3

Page 29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved HORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION

Use Case 3 - Overview

xxx Objective: –Purpose oriented data views for functional teams over a rich variety of semi-structured and structured data Problem: –Data Lakes have large volumes of consolidated clickstream data, product and customer data that need to be constrained for multi-departmental use. Solution: –Leverage HDP to mashup Clickstream analysis data with product and customer data on HDP to answer - Leverage Jboss Data Virt to provide Virtual data marts for each of Marketing and Product teams to …..

Page 30: Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3

Page 30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Use Case 3 - Architecture

APPLICAT

IONS  

Business    Analy/cs  

Custom  Applica/ons  

Packaged  Applica/ons  

DATA

   SYSTEM  

SOURC

ES  

Emerging  Sources    (Sensor,  Sen/ment,  Geo,  Unstructured)  

Exis/ng  Sources    (CRM,  ERP,  Clickstream,  Logs)  

HDP 2.1

Gov

erna

nce

&

Inte

grat

ion

Secu

rity

Ope

ratio

ns

Data Access

Data Management VIRTUAL  DATA  MART  

Page 31: Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3

Page 31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Use Case 3 - Resources

•  GUIDE How to guide: https://github.com/DataVirtualizationByExample/HortonworksUseCase3 Tutorial: Available soon •  VIDEOS: http://vimeo.com/user16928011/hwxuc3configuration http://vimeo.com/user16928011/hwxuc3run http://vimeo.com/user16928011/hwxuc3overview •  SOURCE: https://github.com/DataVirtualizationByExample/HortonworksUseCase3

Page 32: Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3

Page 32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Benefits of JBoss Data Virtualization with Hortonworks HDP 2.1 •  Creates virtual databases for controlling

access to data in a data lake while giving lines of business the autonomy they seek

•  Combines new data in Hadoop with data in traditional data sources without moving or copying data

•  Gives access to a variety of BI and analytics tools

•  Provides caching for faster access to data •  Provides consistent security policy across

multiple data sources

Page 33: Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3

Page 33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Thank you! Hortonworks and Red Hat JBoss Data Virtualization

Page 34: Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3

Page 34 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Next Steps...

Download the Hortonworks Sandbox

Learn Hadoop

Build Your Analytic App

Try Hadoop 2

More about Red Hat & Hortonworks http://hortonworks.com/partner/redhat

Contact us: [email protected]

Page 35: Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3

Page 35 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Don’t Forget to Register for our Next Webinar!

September 17th, 10 AM PST Red Hat JBoss Data Virtualization and Hortonworks Data Platform

http://info.hortonworks.com/RedHatSeries_Hortonworks.html