discover red hat and apache hadoop for the modern data architecture - part 3

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Q&A box is available for your questions

Webinar will be recorded for future viewing

Thank you for joining!

We’ll get started soon…


Deliver the Data Lake (demo/deep dive) …using HDP and Red Hat JBoss Data Virtualization

We do Hadoop.


Your speakers…

Raghu Thiagarajan, Dir, Partner Product Management, Hortonworks

Kimberly Palko, Principal Product Manager, Red Hat

Kenny Peeples, Principal Technical Marketing Manager, Red Hat


An architectural shift towards an HDP Data Lake SC

ALE

SCOPE

Unlocking the Data Lake

RDBMS

MPP

EDW

Data Lake Enabled by YARN •  Single data repository,

shared infrastructure

•  Multiple biz apps accessing all the data

•  Enable a shift from reactive to proactive interactions

•  Gain new insight across the entire enterprise

New Analytic Apps or IT Optimization

HDP 2.1

Gov

erna

nce

&

Inte

grat

ion

Secu

rity

Ope

ratio

ns

Data Access

Data Management

YARN


What is a Data Lake?

Architectural Pattern in the Data Center Uses Hadoop to deliver deeper insight across a large, broad, diverse set of data efficiently §  Multipurpose, Open PLATFORM for Data (NOT a database) §  Land all data in a single place and interact with it in many ways §  Allows for the ecosystem to provide higher level services (SAS, SAP, Microsoft for Streaming,

MPP, In-memory, etc..) §  First class data management capabilities (metadata management, security, transformation

pipelines, replication, retention, etc..)


HDP Data Lake Solution Architecture

compute &

storage . . .

. . .

. . compute

& storage

.

.

YARN

Data Lake HDP Grid

AMBARI

HCATALOG (table & user-defined metadata)

Step 2: Model/Apply Metadata

Use Case Type 1: Materialize &

Exchange

Opens up Many New Use Cases

Stream Processing, Real-time Search,

MPI, etc. YARN Apps

Interactive Hive Server (Tez/Stinger)

Query/ Analytics/Reporting

Tools Tableau, Excel, Microstrategy

Datameer, Platfora, Business Objects

Use Case Type 2: Explore/Visualize

FALCON (Data pipeline & flow management)

SOURCE DATA

Click Stream

Sales Transactions

Product Data

Marketing/Inventory

Social Data

EDW

NFS

Apache Argus (Unified Access Controls and Audit)

(data processing)

HIVE PIG Cascading Exchange

HBase Client

Sqoop/Hive

Downstream Data Sources

OLTP HBase

EDW (Teradata)

MR2 Graph

SolR

SAS

Step 3: Transform, Aggregate & Materialize

Storm

Step 4: Schedule and Orchestrate

Manage Steps 1-4: Data Management with Falcon, Security with HDP Advanced Security

Ingestion

SQOOP

FLUME

Web HDFS

NFS

STORM

Step 1:Extract & Load

REST

HTTP

Streaming

JMS

TEZ

Mahout


HDP Data Lake Solution Architecture + Virtual Data Mart

compute &

storage . . .

. . .

. . compute

& storage

.

.

YARN

Data Lake HDP Grid

AMBARI




Exchange


Stream Processing,

Real-time Search, MPI, etc.

YARN Apps

Interactive

Hive Server (Tez/Stinger)

Query/ Analytics/

Reporting Tools

Tableau, Excel, Microstrategy

Datameer, Platfora, Business

Objects



SOURCE DATA

Click Stream

Sales Transactions

Product Data

Marketing/Inventory

Social Data

EDW

NFS


(data processing)

HIVE PIG Cascading

Exchange

HBase Client

Sqoop/Hive


OLTP HBase

EDW (Teradata)

MR2 Graph

SolR

SAS


Storm



Ingestion

SQOOP

FLUME

Web HDFS

NFS

STORM


REST

HTTP

Streaming

JMS

TEZ

Mahout

Dept Base Virtual Database (VDB)

Team 1 VDB

Team2 VDB

View2 View1


Yarn allows for new processing engines

compute &

storage . . .

. . .

. . compute

& storage

.

.

YARN

Data Lake HDP Grid

AMBARI




Exchange



MPI, etc. YARN Apps

Interactive







SOURCE DATA

Click Stream

Sales Transactions

Product Data

Marketing/Inventory

Social Data

EDW

NFS


(data processing)

HIVE PIG Cascading

Exchange

HBase Client

Sqoop/Hive


OLTP HBase

EDW (Teradata)

MR2 Graph

SolR

SAS


Storm



Ingestion

SQOOP

FLUME

Web HDFS

NFS

STORM


REST

HTTP

Streaming

JMS

TEZ

Mahout


Falcon enables Governance of Data Pipelines

compute &

storage . . .

. . .

. . compute

& storage

.

.

YARN

Data Lake HDP Grid

AMBARI




Exchange



MPI, etc. YARN Apps

Interactive







SOURCE DATA

Click Stream

Sales Transactions

Product Data

Marketing/Inventory

Social Data

EDW

NFS


(data processing)

HIVE PIG Cascading

Exchange

HBase Client

Sqoop/Hive


OLTP HBase

EDW (Teradata)

MR2 Graph

SolR

SAS


Storm



Ingestion

SQOOP

FLUME

Web HDFS

NFS

STORM


REST

HTTP

Streaming

JMS

TEZ

Mahout


Apache Falcon: Data Governance in the Lake Data pipeline

Raw Clean Prep

Defined in

Auto generate & orchestrate

Multiple complex Oozie workflows

Job1 Job2 JobN Job3 Job4 Job7 Job6 JobN

Job1 Job2 JobN Job3 Job4 Job7 Job6 JobN

Other Hadoop ecosystem tools

Eg. DistCp

Adds the required data governance features

Falcon Adds the required data governance features

DEFINITION Replication | Retention

Eviction | Late data MONITORING

TRACING Audit | Lineage

Tagging


Mashing up diverse data types in the Data Lake


Virtual Data Marts with Red Hat JBoss Data Virtualization and Hortonworks HDP Kimberly Palko


Data Supply and Integration Solution

Data Virtualization sits in front of multiple data sources and ü  allows them to be treated a single source ü  delivering the desired data

ü  in the required form

ü  at the right time

ü  to any application and/or user. THINK VIRTUAL MACHINE FOR DATA


Easy Access to Big Data

•  Reporting tool accesses the data virtualization server via rich SQL dialect

•  The data virtualization server translates rich SQL dialect to HiveQL

•  Hive translates HiveQL to MapReduce

•  MapReduce runs MR job on big data

MapReduce

HDFS

Hive

Analytical Reporting

Tool

Data Virtualization

Server

Hadoop

Big Data


Use Case 1: Combine data from Hadoop with traditional data sources

Problem: Data from new data sources like social media,

clickstream and sensors needs to be combined with data from traditional sources to get the full value.

Solution: Leverage JBoss Data Virtualization to mashup

new data in Hadoop with data in traditional data sources without moving or copying any data and access it through a variety of BI tools and SOA technologies.

Consume Compose Connect

Data can be accessed by mul/ple tools and methods already in-‐house

JBoss Data Virtualization

Hive

SOURCE 1: Hive/Hadoop contains data from new data sources like social media,

clickstream and sensor data

SOURCE 2: Tradi/onal rela/onal databases in the enterprise


Use Case 2: Federating across Geographically Distributed Hadoop Clusters Problem:

Geographically distributed Hadoop clusters contains sensitive data like patient records or customer identification that cannot be accessed by other regions due to regulatory policy. IT needs access to all data, but users can only access the data in their region.

Solution: Leverage JBoss Data Virtualization to provide Row

Level Security and Masking of columns while federating across Hadoop clusters.

Consume Compose Connect

Data can be accessed by mul/ple tools and methods already in-‐house

JBoss Data Virtualization

Hive

Hadoop cluster in one geographic region

Hive

Hadoop cluster in a second geographic region


Data for entire organization in Hadoop Data Lake

Problem: How does IT control access and give business users just the data they need? - Does every line of business have access to everyone’s data? - How do business users get access to the data they need in a simple (even self-service) way?

Marketing Clickstream Data Finance

Expense Reports

HR Employee Files Server

Logs

Sales Transactions

Customer Accounts Twitter Sentiment

Data

Hadoop Data Lake


Marketing Clickstream Data

Marketing IT Finance

Customer Accounts Twitter Sentiment

Data

Sales

Server Logs

Sales Transactions HR Employee Files

Finance Expense Reports

Secure, Self-Service Virtual Data Marts for Hadoop Solution: Use JBoss Data Virtualization to create virtual data marts on top of a Hadoop cluster -  Lines of Business get access to the data they need in a simple manner -  IT maintains the process and control it needs -  All data remains in the data lake, nothing is copied or moved

Hadoop Data Lake


Optional hierarchical data architectures with virtual data mart

Can be combined with security features like user role access and row and column masking

Dept Base Virtual Database (VDB)

Team 1 VDB

Team2 VDB

View2 View1


Want most recent data in an operational data store

Problem: All the legacy and archived data is in the Hadoop data lake. We want to access the most recent, up to the minute, operational data often and quickly.



HR Employee Files Server Logs

Sales Transactions

Customer Accounts

Twitter Sentiment Data

Hadoop Data Lake Historical Data


Caching For Faster Performance – Materialized View

Cached or Materialized View 1

View 1

Query 2 Query 1

Virtual Database (VDB)

•  Same cached view for multiple queries

•  Refreshed automatically or manually •  Cache repository can be any

supported data source


Want most recent data in an operational data store Solution: Use JBoss Data Virtualization to integrate up to the minute data from multiple diverse data sources that can be quickly queried. - Use HDP for all data older than today. - Use JDV to materialize the data in HDP for faster access and to combine with operational VDB



HR Employee Files

Server Logs

Sales Transactions

Customer Accounts

Twitter Sentiment Data

Hadoop Data Lake Historical Data Operational VDB

with up to the minute data

Nightly Transfer from Data Sources

Materialized View


Demonstration Virtual Data Marts

with Hadoop Data Lake

Kenny Peeples

© Hortonworks Inc. 2011 – 2014. All Rights Reserved HORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION

Use Case 3 - Overview

xxx Objective: –Purpose oriented data views for functional teams over a rich variety of semi-structured and structured data Problem: –Data Lakes have large volumes of consolidated clickstream data, product and customer data that need to be constrained for multi-departmental use. Solution: –Leverage HDP to mashup Clickstream analysis data with product and customer data on HDP to answer - Leverage Jboss Data Virt to provide Virtual data marts for each of Marketing and Product teams to …..


Use Case 3 - Architecture

APPLICAT

IONS

Business Analy/cs

Custom Applica/ons

Packaged Applica/ons

DATA

SYSTEM

SOURC

ES

Emerging Sources (Sensor, Sen/ment, Geo, Unstructured)

Exis/ng Sources (CRM, ERP, Clickstream, Logs)

HDP 2.1

Gov

erna

nce

&

Inte

grat

ion

Secu

rity

Ope

ratio

ns

Data Access

Data Management VIRTUAL DATA MART


Use Case 3 - Resources

•  GUIDE How to guide: https://github.com/DataVirtualizationByExample/HortonworksUseCase3 Tutorial: Available soon •  VIDEOS: http://vimeo.com/user16928011/hwxuc3configuration http://vimeo.com/user16928011/hwxuc3run http://vimeo.com/user16928011/hwxuc3overview •  SOURCE: https://github.com/DataVirtualizationByExample/HortonworksUseCase3


Benefits of JBoss Data Virtualization with Hortonworks HDP 2.1 •  Creates virtual databases for controlling

access to data in a data lake while giving lines of business the autonomy they seek

•  Combines new data in Hadoop with data in traditional data sources without moving or copying data

•  Gives access to a variety of BI and analytics tools

•  Provides caching for faster access to data •  Provides consistent security policy across

multiple data sources


Thank you! Hortonworks and Red Hat JBoss Data Virtualization


Next Steps...

Download the Hortonworks Sandbox

Learn Hadoop

Build Your Analytic App

Try Hadoop 2

More about Red Hat & Hortonworks http://hortonworks.com/partner/redhat

Contact us: [email protected]


Don’t Forget to Register for our Next Webinar!

September 17th, 10 AM PST Red Hat JBoss Data Virtualization and Hortonworks Data Platform

http://info.hortonworks.com/RedHatSeries_Hortonworks.html

discover red hat and apache hadoop for the modern data architecture - part 3

Data & Analytics