discover red hat and apache hadoop for the modern data architecture - part 3
DESCRIPTION
Red Hat JBoss Data Virtualization and HDP: Enabling the Data Lake (demo and deep dive)TRANSCRIPT
Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Q&A box is available for your questions
Webinar will be recorded for future viewing
Thank you for joining!
We’ll get started soon…
Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Deliver the Data Lake (demo/deep dive) …using HDP and Red Hat JBoss Data Virtualization
We do Hadoop.
Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Your speakers…
Raghu Thiagarajan, Dir, Partner Product Management, Hortonworks
Kimberly Palko, Principal Product Manager, Red Hat
Kenny Peeples, Principal Technical Marketing Manager, Red Hat
Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
An architectural shift towards an HDP Data Lake SC
ALE
SCOPE
Unlocking the Data Lake
RDBMS
MPP
EDW
Data Lake Enabled by YARN • Single data repository,
shared infrastructure
• Multiple biz apps accessing all the data
• Enable a shift from reactive to proactive interactions
• Gain new insight across the entire enterprise
New Analytic Apps or IT Optimization
HDP 2.1
Gov
erna
nce
&
Inte
grat
ion
Secu
rity
Ope
ratio
ns
Data Access
Data Management
YARN
Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
What is a Data Lake?
Architectural Pattern in the Data Center Uses Hadoop to deliver deeper insight across a large, broad, diverse set of data efficiently § Multipurpose, Open PLATFORM for Data (NOT a database) § Land all data in a single place and interact with it in many ways § Allows for the ecosystem to provide higher level services (SAS, SAP, Microsoft for Streaming,
MPP, In-memory, etc..) § First class data management capabilities (metadata management, security, transformation
pipelines, replication, retention, etc..)
Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HDP Data Lake Solution Architecture
compute &
storage . . .
. . .
. . compute
& storage
.
.
YARN
Data Lake HDP Grid
AMBARI
HCATALOG (table & user-defined metadata)
Step 2: Model/Apply Metadata
Use Case Type 1: Materialize &
Exchange
Opens up Many New Use Cases
Stream Processing, Real-time Search,
MPI, etc. YARN Apps
Interactive Hive Server (Tez/Stinger)
Query/ Analytics/Reporting
Tools Tableau, Excel, Microstrategy
Datameer, Platfora, Business Objects
Use Case Type 2: Explore/Visualize
FALCON (Data pipeline & flow management)
SOURCE DATA
Click Stream
Sales Transactions
Product Data
Marketing/Inventory
Social Data
EDW
NFS
Apache Argus (Unified Access Controls and Audit)
(data processing)
HIVE PIG Cascading Exchange
HBase Client
Sqoop/Hive
Downstream Data Sources
OLTP HBase
EDW (Teradata)
MR2 Graph
SolR
SAS
Step 3: Transform, Aggregate & Materialize
Storm
Step 4: Schedule and Orchestrate
Manage Steps 1-4: Data Management with Falcon, Security with HDP Advanced Security
Ingestion
SQOOP
FLUME
Web HDFS
NFS
STORM
Step 1:Extract & Load
REST
HTTP
Streaming
JMS
TEZ
Mahout
Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HDP Data Lake Solution Architecture + Virtual Data Mart
compute &
storage . . .
. . .
. . compute
& storage
.
.
YARN
Data Lake HDP Grid
AMBARI
HCATALOG (table & user-defined metadata)
Step 2: Model/Apply Metadata
Use Case Type 1: Materialize &
Exchange
Opens up Many New Use Cases
Stream Processing,
Real-time Search, MPI, etc.
YARN Apps
Interactive
Hive Server (Tez/Stinger)
Query/ Analytics/
Reporting Tools
Tableau, Excel, Microstrategy
Datameer, Platfora, Business
Objects
Use Case Type 2: Explore/Visualize
FALCON (Data pipeline & flow management)
SOURCE DATA
Click Stream
Sales Transactions
Product Data
Marketing/Inventory
Social Data
EDW
NFS
Apache Argus (Unified Access Controls and Audit)
(data processing)
HIVE PIG Cascading
Exchange
HBase Client
Sqoop/Hive
Downstream Data Sources
OLTP HBase
EDW (Teradata)
MR2 Graph
SolR
SAS
Step 3: Transform, Aggregate & Materialize
Storm
Step 4: Schedule and Orchestrate
Manage Steps 1-4: Data Management with Falcon, Security with HDP Advanced Security
Ingestion
SQOOP
FLUME
Web HDFS
NFS
STORM
Step 1:Extract & Load
REST
HTTP
Streaming
JMS
TEZ
Mahout
Dept Base Virtual Database (VDB)
Team 1 VDB
Team2 VDB
View2 View1
Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Yarn allows for new processing engines
compute &
storage . . .
. . .
. . compute
& storage
.
.
YARN
Data Lake HDP Grid
AMBARI
HCATALOG (table & user-defined metadata)
Step 2: Model/Apply Metadata
Use Case Type 1: Materialize &
Exchange
Opens up Many New Use Cases
Stream Processing, Real-time Search,
MPI, etc. YARN Apps
Interactive
Hive Server (Tez/Stinger)
Query/ Analytics/Reporting
Tools Tableau, Excel, Microstrategy
Datameer, Platfora, Business Objects
Use Case Type 2: Explore/Visualize
FALCON (Data pipeline & flow management)
SOURCE DATA
Click Stream
Sales Transactions
Product Data
Marketing/Inventory
Social Data
EDW
NFS
Apache Argus (Unified Access Controls and Audit)
(data processing)
HIVE PIG Cascading
Exchange
HBase Client
Sqoop/Hive
Downstream Data Sources
OLTP HBase
EDW (Teradata)
MR2 Graph
SolR
SAS
Step 3: Transform, Aggregate & Materialize
Storm
Step 4: Schedule and Orchestrate
Manage Steps 1-4: Data Management with Falcon, Security with HDP Advanced Security
Ingestion
SQOOP
FLUME
Web HDFS
NFS
STORM
Step 1:Extract & Load
REST
HTTP
Streaming
JMS
TEZ
Mahout
Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Falcon enables Governance of Data Pipelines
compute &
storage . . .
. . .
. . compute
& storage
.
.
YARN
Data Lake HDP Grid
AMBARI
HCATALOG (table & user-defined metadata)
Step 2: Model/Apply Metadata
Use Case Type 1: Materialize &
Exchange
Opens up Many New Use Cases
Stream Processing, Real-time Search,
MPI, etc. YARN Apps
Interactive
Hive Server (Tez/Stinger)
Query/ Analytics/Reporting
Tools Tableau, Excel, Microstrategy
Datameer, Platfora, Business Objects
Use Case Type 2: Explore/Visualize
FALCON (Data pipeline & flow management)
SOURCE DATA
Click Stream
Sales Transactions
Product Data
Marketing/Inventory
Social Data
EDW
NFS
Apache Argus (Unified Access Controls and Audit)
(data processing)
HIVE PIG Cascading
Exchange
HBase Client
Sqoop/Hive
Downstream Data Sources
OLTP HBase
EDW (Teradata)
MR2 Graph
SolR
SAS
Step 3: Transform, Aggregate & Materialize
Storm
Step 4: Schedule and Orchestrate
Manage Steps 1-4: Data Management with Falcon, Security with HDP Advanced Security
Ingestion
SQOOP
FLUME
Web HDFS
NFS
STORM
Step 1:Extract & Load
REST
HTTP
Streaming
JMS
TEZ
Mahout
Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Falcon: Data Governance in the Lake Data pipeline
Raw Clean Prep
Defined in
Auto generate & orchestrate
Multiple complex Oozie workflows
Job1 Job2 JobN Job3 Job4 Job7 Job6 JobN
Job1 Job2 JobN Job3 Job4 Job7 Job6 JobN
Other Hadoop ecosystem tools
Eg. DistCp
Adds the required data governance features
Falcon Adds the required data governance features
DEFINITION Replication | Retention
Eviction | Late data MONITORING
TRACING Audit | Lineage
Tagging
Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Mashing up diverse data types in the Data Lake
Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Mashing up diverse data types in the Data Lake
Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Mashing up diverse data types in the Data Lake
Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Mashing up diverse data types in the Data Lake
Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Mashing up diverse data types in the Data Lake
Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Mashing up diverse data types in the Data Lake
Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Virtual Data Marts with Red Hat JBoss Data Virtualization and Hortonworks HDP Kimberly Palko
Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data Supply and Integration Solution
Data Virtualization sits in front of multiple data sources and ü allows them to be treated a single source ü delivering the desired data
ü in the required form
ü at the right time
ü to any application and/or user. THINK VIRTUAL MACHINE FOR DATA
Page 19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Easy Access to Big Data
• Reporting tool accesses the data virtualization server via rich SQL dialect
• The data virtualization server translates rich SQL dialect to HiveQL
• Hive translates HiveQL to MapReduce
• MapReduce runs MR job on big data
MapReduce
HDFS
Hive
Analytical Reporting
Tool
Data Virtualization
Server
Hadoop
Big Data
Page 20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Use Case 1: Combine data from Hadoop with traditional data sources
Problem: Data from new data sources like social media,
clickstream and sensors needs to be combined with data from traditional sources to get the full value.
Solution: Leverage JBoss Data Virtualization to mashup
new data in Hadoop with data in traditional data sources without moving or copying any data and access it through a variety of BI tools and SOA technologies.
Consume Compose Connect
Data can be accessed by mul/ple tools and methods already in-‐house
JBoss Data Virtualization
Hive
SOURCE 1: Hive/Hadoop contains data from new data sources like social media,
clickstream and sensor data
SOURCE 2: Tradi/onal rela/onal databases in the enterprise
Page 21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Use Case 2: Federating across Geographically Distributed Hadoop Clusters Problem:
Geographically distributed Hadoop clusters contains sensitive data like patient records or customer identification that cannot be accessed by other regions due to regulatory policy. IT needs access to all data, but users can only access the data in their region.
Solution: Leverage JBoss Data Virtualization to provide Row
Level Security and Masking of columns while federating across Hadoop clusters.
Consume Compose Connect
Data can be accessed by mul/ple tools and methods already in-‐house
JBoss Data Virtualization
Hive
Hadoop cluster in one geographic region
Hive
Hadoop cluster in a second geographic region
Page 22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data for entire organization in Hadoop Data Lake
Problem: How does IT control access and give business users just the data they need? - Does every line of business have access to everyone’s data? - How do business users get access to the data they need in a simple (even self-service) way?
Marketing Clickstream Data Finance
Expense Reports
HR Employee Files Server
Logs
Sales Transactions
Customer Accounts Twitter Sentiment
Data
Hadoop Data Lake
Page 23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Marketing Clickstream Data
Marketing IT Finance
Customer Accounts Twitter Sentiment
Data
Sales
Server Logs
Sales Transactions HR Employee Files
Finance Expense Reports
Secure, Self-Service Virtual Data Marts for Hadoop Solution: Use JBoss Data Virtualization to create virtual data marts on top of a Hadoop cluster - Lines of Business get access to the data they need in a simple manner - IT maintains the process and control it needs - All data remains in the data lake, nothing is copied or moved
Hadoop Data Lake
Page 24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Optional hierarchical data architectures with virtual data mart
Can be combined with security features like user role access and row and column masking
Dept Base Virtual Database (VDB)
Team 1 VDB
Team2 VDB
View2 View1
Page 25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Want most recent data in an operational data store
Problem: All the legacy and archived data is in the Hadoop data lake. We want to access the most recent, up to the minute, operational data often and quickly.
Marketing Clickstream Data
Finance Expense Reports
HR Employee Files Server Logs
Sales Transactions
Customer Accounts
Twitter Sentiment Data
Hadoop Data Lake Historical Data
Page 26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Caching For Faster Performance – Materialized View
Cached or Materialized View 1
View 1
Query 2 Query 1
Virtual Database (VDB)
• Same cached view for multiple queries
• Refreshed automatically or manually • Cache repository can be any
supported data source
Page 27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Want most recent data in an operational data store Solution: Use JBoss Data Virtualization to integrate up to the minute data from multiple diverse data sources that can be quickly queried. - Use HDP for all data older than today. - Use JDV to materialize the data in HDP for faster access and to combine with operational VDB
Marketing Clickstream Data
Finance Expense Reports
HR Employee Files
Server Logs
Sales Transactions
Customer Accounts
Twitter Sentiment Data
Hadoop Data Lake Historical Data Operational VDB
with up to the minute data
Nightly Transfer from Data Sources
Materialized View
Page 28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Demonstration Virtual Data Marts
with Hadoop Data Lake
Kenny Peeples
Page 29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved HORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION
Use Case 3 - Overview
xxx Objective: –Purpose oriented data views for functional teams over a rich variety of semi-structured and structured data Problem: –Data Lakes have large volumes of consolidated clickstream data, product and customer data that need to be constrained for multi-departmental use. Solution: –Leverage HDP to mashup Clickstream analysis data with product and customer data on HDP to answer - Leverage Jboss Data Virt to provide Virtual data marts for each of Marketing and Product teams to …..
Page 30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Use Case 3 - Architecture
APPLICAT
IONS
Business Analy/cs
Custom Applica/ons
Packaged Applica/ons
DATA
SYSTEM
SOURC
ES
Emerging Sources (Sensor, Sen/ment, Geo, Unstructured)
Exis/ng Sources (CRM, ERP, Clickstream, Logs)
HDP 2.1
Gov
erna
nce
&
Inte
grat
ion
Secu
rity
Ope
ratio
ns
Data Access
Data Management VIRTUAL DATA MART
Page 31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Use Case 3 - Resources
• GUIDE How to guide: https://github.com/DataVirtualizationByExample/HortonworksUseCase3 Tutorial: Available soon • VIDEOS: http://vimeo.com/user16928011/hwxuc3configuration http://vimeo.com/user16928011/hwxuc3run http://vimeo.com/user16928011/hwxuc3overview • SOURCE: https://github.com/DataVirtualizationByExample/HortonworksUseCase3
Page 32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Benefits of JBoss Data Virtualization with Hortonworks HDP 2.1 • Creates virtual databases for controlling
access to data in a data lake while giving lines of business the autonomy they seek
• Combines new data in Hadoop with data in traditional data sources without moving or copying data
• Gives access to a variety of BI and analytics tools
• Provides caching for faster access to data • Provides consistent security policy across
multiple data sources
Page 33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Thank you! Hortonworks and Red Hat JBoss Data Virtualization
Page 34 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Next Steps...
Download the Hortonworks Sandbox
Learn Hadoop
Build Your Analytic App
Try Hadoop 2
More about Red Hat & Hortonworks http://hortonworks.com/partner/redhat
Contact us: [email protected]
Page 35 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Don’t Forget to Register for our Next Webinar!
September 17th, 10 AM PST Red Hat JBoss Data Virtualization and Hortonworks Data Platform
http://info.hortonworks.com/RedHatSeries_Hortonworks.html