denodo datafest 2016: logical data lakes: performance considerations

32
OCTOBER 18,2016 SAN FRANCISCO BAY AREA, CA #DenodoDataFest RAPID, AGILE DATA STRATEGIES For Accelerating Analytics, Cloud, and Big Data Initiatives.

Upload: denodo

Post on 22-Jan-2017

17 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations

O C T O B E R 1 8 , 2 0 1 6 S A N F R A N C I S C O B A Y A R E A , C A

#DenodoDataFest

RAPID, AGILE DATA STRATEGIESFor Accelerating Analytics, Cloud, and Big Data Initiatives.

Page 2: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations

Logical Data Lakes: Performance Considerations

Dr. Alberto Pan

Denodo, CTO

Page 3: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations

Agenda1. Logical Architectures for Analytics: The Role of Data Virtualization

2. Example Scenario: The Numbers

3. Performance: The “Move Processing to the Data” Paradigm

4. Performance: Cost-based Optimization in Logical Architectures

Page 4: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations

Logical Architectures for Analytics

The Role of Data Virtualization

Page 5: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations

5

The Logical Data Warehouse

Page 6: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations

6

Logical Data LakeReal-TimeDecision

Management

Alerts

ScorecardsDashboards

Reporting

Data DiscoverySelf-Service

Search

Predictive Analytics

Statistical Analytics (R)

Text Analytics

Data Mining

Data Warehouse

Sensor Data

Machine Data (Logs)

Social Data

Clickstream Data

Internet Data

Image and Video

Enterprise Content (Unstructured)

Big Data

Enterprise Applications

Traditional Enterprise

Data

Cloud

Cloud Applications

Metadata Management, Data Governance, Data Security

NoSQL

EDWIn-Memory

(SAP Hana, …)Analytical

Appliances

Cloud DW(Redshift,..)

ODS

Big Data ETL

CDC

Sqoop

(Flume, Kafka, …)

Real-Time Data Access (On-Demand / Streaming)

Batch

YARN / Workload Management

HDFS

HiveSparkDrill

Impala

Storm HBase SolrHunk

DW Streams NoSQL SearchSQL

Hadoop

TezMapRed.

Page 7: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations

7

The Logical Data Warehouse

Page 8: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations

8

The Logical Data Warehouse: Data Virtualization

Page 9: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations

9

Logical Data LakeReal-TimeDecision

Management

Alerts

ScorecardsDashboards

Reporting

Data DiscoverySelf-Service

Search

Predictive Analytics

Statistical Analytics (R)

Text Analytics

Data Mining

Data Warehouse

Sensor Data

Machine Data (Logs)

Social Data

Clickstream Data

Internet Data

Image and Video

Enterprise Content (Unstructured)

Big Data

Enterprise Applications

Traditional Enterprise

Data

Cloud

Cloud Applications

Metadata Management, Data Governance, Data Security

NoSQL

EDWIn-Memory

(SAP Hana, …)Analytical

Appliances

Cloud DW(Redshift,..)

ODS

Big Data ETL

CDC

Sqoop

(Flume, Kafka, …)

Real-Time Data Access (On-Demand / Streaming)

Batch

YARN / Workload Management

HDFS

HiveSparkDrill

Impala

Storm HBase SolrHunk

DW Streams NoSQL SearchSQL

Hadoop

TezMapRed.

Page 10: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations

10

Logical Data Lake: Data Virtualization

Real-TimeDecision

Management

Alerts

ScorecardsDashboards

Reporting

Data DiscoverySelf-Service

Search

Predictive Analytics

Statistical Analytics (R)

Text Analytics

Data MiningData Warehouse

Sensor Data

Machine Data (Logs)

Social Data

Clickstream Data

Internet Data

Image and Video

Enterprise Content (Unstructured)

Big Data

Enterprise Applications

Traditional Enterprise

Data

Cloud

Cloud Applications

NoSQL

EDWIn-Memory

(SAP Hana, …)Analytical

Appliances

Cloud DW(Redshift,..)

ODS

Big Data ETL

CDC

Sqoop

(Flume, Kafka, …)

Data Virtualization

Real-Time Data Access (On-Demand / Streaming)

Data Caching

Data

Serv

ices

Data Search & Discovery

Governance

Security

Optimization

Data

Abstr

action

Data

Tra

nsfo

rmation

Data

Federa

tionBatch

YARN / Workload Management

HDFS

HiveSparkDrill

Impala

Storm HBase SolrHunk

DW Streams NoSQL SearchSQL

Hadoop

TezMapRed.

Page 11: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations

11

The Role of Data Virtualization

Combines data from several systems and publish it to the desired format with a few clicks

Denodo Data Virtualization is the only option verifying:

Page 12: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations

12

Example: Distributed Report

Demo Scenario of LDW

Sales Data (TPC-DS)(280M)

Customer (TPC-DS)(2M)

MDM

12

SQL on Hadoop

Total Sales by Customer

Page 13: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations

13

The Role of Data Virtualization

Combines data from several systems and publish it to the desired format with a few clicks

Abstracts applications from changes in the underlying infrastructure

Expose different logical views over the same data

Single entry point to apply Security and Governance

Denodo Data Virtualization is the only option verifying:

Page 14: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations

14

DW + MDM Dimensions

Common LDW Patterns

TimeDimension

Fact table(sales) Product

Dimension

Retailer Dimension

EDW MDM

14

Page 15: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations

DW Historical Offloading (Cold Data Storage)

Common LDW Patterns

Time Dimension Fact table(sales) Product Dimension

Retailer Dimension

EDW Hadoop

Current Sales Historical Sales

15

Page 16: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations

DW + Cloud Dimensional Data

Common LDW Patterns

Time Dimension Fact table(sales) Product Dimension

Customer Dimension

EDW CRM

SFDC Customer

16

Page 17: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations

Data Warehouse Federation

Common LDW Patterns

Time Dimensi

on

Fact table(sales)

CustomerDimension

Region

EDW

City

EDW

Fact table(Fidelity)

Customer Dimension

17

Page 18: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations

Example Scenario: The Numbers

Page 19: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations

19

Denodo has done extensive testing using queries from the standard benchmarking test

TPC-DS* and the following scenario

Compares the performance of a federated approach in Denodo with an MPP system where

all the data has been replicated via ETL

Customer Dim.2 M rows

Sales Facts290 M rows

Items Dim.400 K rows

* TPC-DS is the de-facto industry standard benchmark for measuring the performance of decision support solutions including, but not limited to, Big Data systems.

vs.Sales Facts290 M rows

Items Dim.400 K rows

Customer Dim.2 M rows

Denodo 6.0 ArchitecturePerformance Comparison – Logical Data Warehouse vs. Physical Data Warehouse

Page 20: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations

20

Denodo 6.0 Architecture

Query DescriptionReturned

RowsTime Netezza

Time Denodo (Federated Oracle,

Netezza & SQL Server)

Optimization Technique (automatically selected)

Total sales by customer 1,99 M 20.9 sec. 21.4 sec. Full aggregation push-down

Total sales by customer and year between 2000 and 2004

5,51 M 52.3 sec. 59.0 sec Full aggregation push-down

Total sales by item brand 31,35 K 4.7 sec. 5.0 sec. Partial aggregation push-down

Total sales by item where sale price less than current

list price17,05 K 3.5 sec. 5.2 sec On the fly data movement

Performance Comparison – Logical Data Warehouse vs. Physical Data Warehouse

Page 21: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations

Performance

The “Move Processing to the Data” Paradigm

Page 22: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations

22

Move Processing to the Data

Process the data where it resides

Process the data locally where it resides

DV System combines partial results

Minimizes network traffic

Leverages specialized data sources

Page 23: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations

23

Move Processing to the Data: Example 1

Obtain Total Sales By Product (Naive Strategy)

Naive Strategy: 350M rows moved through the network

Page 24: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations

24

Move Processing to the Data: Example 1

Obtain Total Sales By Product (Move Processing to the Data)

Denodo Strategy: 30k rows moved through the network

Page 25: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations

25

Move Processing to the Data: Example 1 (Alternative)

Obtain Total Sales By Product (Dimension Table Replicated)

Denodo Strategy: Dimension Table Replicated

Denodo Strategy: 20k rows moved through the network

Page 26: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations

26

Move Processing to the Data: Example 2

Execution Strategy: Full aggregation pushdown not possible Two possible techniques: - On-the-fly data movement- Partial aggregation pushdown

Maximum Sales Discount By Product in The last year

Sales Discount: list_price (Product) – sale_price (Sales)

Page 27: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations

27

Move Processing to the Data: Example 2

Maximum Sales Discount By Product in the last year: On-the-fly Data Movement

Move Products Data to a Temp table in the DW : 20K rows moved through the network + 10K

rows inserted in the DW

Execute full query on the DW: 10k rows through the network

Page 28: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations

28

Move Processing to the Data: Example 2

Maximum Sales Discount By Product in the last year: Partial aggregation Pushdown

Products DB: 10K rows through the network

Data Warehouse: #rows through the network = 10K * average

#sale_prices_per_product

Page 29: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations

Performance

Cost-based Optimization in Logical Architectures

Page 30: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations

30

How to Choose the Best Execution Plan?Cost-Based Optimization in Data Virtualization

Data statistics to estimate size of intermediate result sets

Data Source Indexes (and other physical structures)

Execution Model of data sources: e.g. Parallel Databases VS Hadoop clusters VS Relational Databases

Features of data sources (e.g. number of processing cores in parallel database or Hadoop Cluster)

Data Transfer rate

Must take into account:

Page 31: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations

Q&AFind more details at: datavirtualization.bloghttp://www.datavirtualizationblog.com/myths-in-data-virtualization-performance/

Page 32: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations

Thank you!

© Copyright Denodo Technologies. All rights reservedUnless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical, including photocopying and microfilm, without prior the written authorization from Denodo Technologies.

O C T O B E R 1 8 , 2 0 1 6 S A N F R A N C I S C O B A Y A R E A , C A

#DenodoDataFest