denodo datafest 2016: logical data lakes: performance considerations
TRANSCRIPT
O C T O B E R 1 8 , 2 0 1 6 S A N F R A N C I S C O B A Y A R E A , C A
#DenodoDataFest
RAPID, AGILE DATA STRATEGIESFor Accelerating Analytics, Cloud, and Big Data Initiatives.
Logical Data Lakes: Performance Considerations
Dr. Alberto Pan
Denodo, CTO
Agenda1. Logical Architectures for Analytics: The Role of Data Virtualization
2. Example Scenario: The Numbers
3. Performance: The “Move Processing to the Data” Paradigm
4. Performance: Cost-based Optimization in Logical Architectures
Logical Architectures for Analytics
The Role of Data Virtualization
5
The Logical Data Warehouse
6
Logical Data LakeReal-TimeDecision
Management
Alerts
ScorecardsDashboards
Reporting
Data DiscoverySelf-Service
Search
Predictive Analytics
Statistical Analytics (R)
Text Analytics
Data Mining
Data Warehouse
Sensor Data
Machine Data (Logs)
Social Data
Clickstream Data
Internet Data
Image and Video
Enterprise Content (Unstructured)
Big Data
Enterprise Applications
Traditional Enterprise
Data
Cloud
Cloud Applications
Metadata Management, Data Governance, Data Security
NoSQL
EDWIn-Memory
(SAP Hana, …)Analytical
Appliances
Cloud DW(Redshift,..)
ODS
Big Data ETL
CDC
Sqoop
(Flume, Kafka, …)
Real-Time Data Access (On-Demand / Streaming)
Batch
YARN / Workload Management
HDFS
HiveSparkDrill
Impala
Storm HBase SolrHunk
DW Streams NoSQL SearchSQL
Hadoop
TezMapRed.
7
The Logical Data Warehouse
8
The Logical Data Warehouse: Data Virtualization
9
Logical Data LakeReal-TimeDecision
Management
Alerts
ScorecardsDashboards
Reporting
Data DiscoverySelf-Service
Search
Predictive Analytics
Statistical Analytics (R)
Text Analytics
Data Mining
Data Warehouse
Sensor Data
Machine Data (Logs)
Social Data
Clickstream Data
Internet Data
Image and Video
Enterprise Content (Unstructured)
Big Data
Enterprise Applications
Traditional Enterprise
Data
Cloud
Cloud Applications
Metadata Management, Data Governance, Data Security
NoSQL
EDWIn-Memory
(SAP Hana, …)Analytical
Appliances
Cloud DW(Redshift,..)
ODS
Big Data ETL
CDC
Sqoop
(Flume, Kafka, …)
Real-Time Data Access (On-Demand / Streaming)
Batch
YARN / Workload Management
HDFS
HiveSparkDrill
Impala
Storm HBase SolrHunk
DW Streams NoSQL SearchSQL
Hadoop
TezMapRed.
10
Logical Data Lake: Data Virtualization
Real-TimeDecision
Management
Alerts
ScorecardsDashboards
Reporting
Data DiscoverySelf-Service
Search
Predictive Analytics
Statistical Analytics (R)
Text Analytics
Data MiningData Warehouse
Sensor Data
Machine Data (Logs)
Social Data
Clickstream Data
Internet Data
Image and Video
Enterprise Content (Unstructured)
Big Data
Enterprise Applications
Traditional Enterprise
Data
Cloud
Cloud Applications
NoSQL
EDWIn-Memory
(SAP Hana, …)Analytical
Appliances
Cloud DW(Redshift,..)
ODS
Big Data ETL
CDC
Sqoop
(Flume, Kafka, …)
Data Virtualization
Real-Time Data Access (On-Demand / Streaming)
Data Caching
Data
Serv
ices
Data Search & Discovery
Governance
Security
Optimization
Data
Abstr
action
Data
Tra
nsfo
rmation
Data
Federa
tionBatch
YARN / Workload Management
HDFS
HiveSparkDrill
Impala
Storm HBase SolrHunk
DW Streams NoSQL SearchSQL
Hadoop
TezMapRed.
11
The Role of Data Virtualization
Combines data from several systems and publish it to the desired format with a few clicks
Denodo Data Virtualization is the only option verifying:
12
Example: Distributed Report
Demo Scenario of LDW
Sales Data (TPC-DS)(280M)
Customer (TPC-DS)(2M)
MDM
12
SQL on Hadoop
Total Sales by Customer
13
The Role of Data Virtualization
Combines data from several systems and publish it to the desired format with a few clicks
Abstracts applications from changes in the underlying infrastructure
Expose different logical views over the same data
Single entry point to apply Security and Governance
Denodo Data Virtualization is the only option verifying:
14
DW + MDM Dimensions
Common LDW Patterns
TimeDimension
Fact table(sales) Product
Dimension
Retailer Dimension
EDW MDM
14
DW Historical Offloading (Cold Data Storage)
Common LDW Patterns
Time Dimension Fact table(sales) Product Dimension
Retailer Dimension
EDW Hadoop
Current Sales Historical Sales
15
DW + Cloud Dimensional Data
Common LDW Patterns
Time Dimension Fact table(sales) Product Dimension
Customer Dimension
EDW CRM
SFDC Customer
16
Data Warehouse Federation
Common LDW Patterns
Time Dimensi
on
Fact table(sales)
CustomerDimension
Region
EDW
City
EDW
Fact table(Fidelity)
Customer Dimension
17
Example Scenario: The Numbers
19
Denodo has done extensive testing using queries from the standard benchmarking test
TPC-DS* and the following scenario
Compares the performance of a federated approach in Denodo with an MPP system where
all the data has been replicated via ETL
Customer Dim.2 M rows
Sales Facts290 M rows
Items Dim.400 K rows
* TPC-DS is the de-facto industry standard benchmark for measuring the performance of decision support solutions including, but not limited to, Big Data systems.
vs.Sales Facts290 M rows
Items Dim.400 K rows
Customer Dim.2 M rows
Denodo 6.0 ArchitecturePerformance Comparison – Logical Data Warehouse vs. Physical Data Warehouse
20
Denodo 6.0 Architecture
Query DescriptionReturned
RowsTime Netezza
Time Denodo (Federated Oracle,
Netezza & SQL Server)
Optimization Technique (automatically selected)
Total sales by customer 1,99 M 20.9 sec. 21.4 sec. Full aggregation push-down
Total sales by customer and year between 2000 and 2004
5,51 M 52.3 sec. 59.0 sec Full aggregation push-down
Total sales by item brand 31,35 K 4.7 sec. 5.0 sec. Partial aggregation push-down
Total sales by item where sale price less than current
list price17,05 K 3.5 sec. 5.2 sec On the fly data movement
Performance Comparison – Logical Data Warehouse vs. Physical Data Warehouse
Performance
The “Move Processing to the Data” Paradigm
22
Move Processing to the Data
Process the data where it resides
Process the data locally where it resides
DV System combines partial results
Minimizes network traffic
Leverages specialized data sources
23
Move Processing to the Data: Example 1
Obtain Total Sales By Product (Naive Strategy)
Naive Strategy: 350M rows moved through the network
24
Move Processing to the Data: Example 1
Obtain Total Sales By Product (Move Processing to the Data)
Denodo Strategy: 30k rows moved through the network
25
Move Processing to the Data: Example 1 (Alternative)
Obtain Total Sales By Product (Dimension Table Replicated)
Denodo Strategy: Dimension Table Replicated
Denodo Strategy: 20k rows moved through the network
26
Move Processing to the Data: Example 2
Execution Strategy: Full aggregation pushdown not possible Two possible techniques: - On-the-fly data movement- Partial aggregation pushdown
Maximum Sales Discount By Product in The last year
Sales Discount: list_price (Product) – sale_price (Sales)
27
Move Processing to the Data: Example 2
Maximum Sales Discount By Product in the last year: On-the-fly Data Movement
Move Products Data to a Temp table in the DW : 20K rows moved through the network + 10K
rows inserted in the DW
Execute full query on the DW: 10k rows through the network
28
Move Processing to the Data: Example 2
Maximum Sales Discount By Product in the last year: Partial aggregation Pushdown
Products DB: 10K rows through the network
Data Warehouse: #rows through the network = 10K * average
#sale_prices_per_product
Performance
Cost-based Optimization in Logical Architectures
30
How to Choose the Best Execution Plan?Cost-Based Optimization in Data Virtualization
Data statistics to estimate size of intermediate result sets
Data Source Indexes (and other physical structures)
Execution Model of data sources: e.g. Parallel Databases VS Hadoop clusters VS Relational Databases
Features of data sources (e.g. number of processing cores in parallel database or Hadoop Cluster)
Data Transfer rate
Must take into account:
Q&AFind more details at: datavirtualization.bloghttp://www.datavirtualizationblog.com/myths-in-data-virtualization-performance/
Thank you!
© Copyright Denodo Technologies. All rights reservedUnless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical, including photocopying and microfilm, without prior the written authorization from Denodo Technologies.
O C T O B E R 1 8 , 2 0 1 6 S A N F R A N C I S C O B A Y A R E A , C A
#DenodoDataFest