Download - Using Data Virtualization to Accelerate Time-to-Value From Your Data · 2018. 10. 18. · Services Big Data APIs Web Automation and Indexing CONNECT COMBINE CONSUME Share, Deliver,

Using Data Virtualization to Accelerate Time-to-Value From Your Data

Integrating Distributed Data in Real Time

Speaker

Paul Moxon

VP Data Architectures

and Chief Evangelist

@ Denodo Technologies

3

Data, Data Everywhere, And Not a Thought to Think

4

Agile Analytics Architecture

5

Data Pipeline Problem

Data Discovery Data ExtractionData Pre-

processingData Analysis

Decision Making

Data Discovery & Preparation Analysis Actions

70-80% 20-30%

6

Data Pipeline Problem

Decision Making

Data Analysis

Data Preparation Analysis Actions

50-60% 40-50%

7

Agile Analytics Architecture - Revisited

DATAVIRTUALIZATION

8

What is Data Virtualization?

Consumein business applications

Combinerelated data into views

Connectto disparate data sources

2

3

1

DATA CONSUMERS

DISPARATE DATA SOURCES

Enterprise Applications, Reporting, BI, Portals, ESB, Mobile, Web, Users

Databases & Warehouses, Cloud/Saas Applications, Big Data, NoSQL, Web, XML, Excel, PDF, Word...

Analytical Operational

Less StructuredMore Structured

CONNECT COMBINE PUBLISH

Multiple Protocols,Formats

Query, Search,Browse

Request/Reply,Event Driven

SecureDelivery

SQL,MDX

WebServices

Big DataAPIs

Web Automationand Indexing

CONNECT COMBINE CONSUME

Share, Deliver, Publish, Govern,

Collaborate

Discover, Transform, Prepare, Improve Quality, Integrate

Normalized views of disparate data

“Data virtualization integrates disparate data sources in real time or near-real time to meet demands for analytics and transactional data.”

– Create a Road Map For A Real-time, Agile, Self-Service Data Platform, Forrester Research, Dec 16, 2015

9

How Does It Work?

Sources

Combine, Transform

& Integrate

Publish

Base View (Source

Abstraction)Client Address ClientType

Company Invoicing ServiceUsage

Product Logs WebIncidents

Customer Invoice Product

Customer 360°

Service Usage Incident

Hadoop Web SiteRESTWeb Service

MultidimensionalSalesforceS3 BucketRDBMS/EDW

SQL, SOAP, REST, ODATA, etc. Denodo’s Information Self Service

Data Virtualization Platform

10

Data Virtualization Connects the Users to the Data That They Need

1. Data Virtualization allows you to connect to any data source

2. You can combine and transform that data into the format needed by the consumer

3. The data can be exposed to the consumers in a format and interface that is usable

by them

• Typically consumers use the tools that they already use – they don’t have to learn new tools

and skills to access the data

4. All of this can be done without copying or moving the data

• The data stays in the original sources (databases, applications, files, etc.) and is retrieved, in

real-time, on demand

Cliff Notes version (TL;DR)

11

Example using Microsoft Power BI

Accessing data for Reports and Dashboards

12

OK…What About Performance?(The first question that everyone asks)

1. Query Delegation

Moving the processing to the data

2. Advanced query rewriting for analytical queries

Partial aggregation pushdown, JOIN-UNION reordering, branch pruning, etc.

3. Offloading of processing to MPP cluster

Take advantage of your Hadoop or Spark cluster

4. Caching

Cache data from slow data sources (‘Temporary materialization’)

The cache can be your Hadoop or Spark cluster

13

Example: Logical Data Warehouse

Total sales by retailer and product during the last month for the brand ACME

Time Dimension Fact table(sales) Product Dimension

Retailer Dimension

EDW MDM

SELECT retailer.name,

product.name,

SUM(sales.amount)

FROM

sales JOIN retailer ON

sales.retailer_fk = retailer.id

JOIN product ON sales.product_fk =

product.id

JOIN time ON sales.time_fk = time.id

WHERE time.date < ADDMONTH(NOW(),-1)

AND product.brand = ‘ACME’

GROUP BY product.name, retailer.name


14

Query Before Optimization

300,000,000 rows

JOIN

JOIN

JOIN

GROUP BYproduct.name, retailer.name

100 rows 10 rows 30 rows

10,000,000 rows

SELECT sales.retailer_fk,

sales.product_fk,

sales.time_fk,

sales.amount

FROM sales

SELECT

retailer.name,

retailer.id

FROM retailer

SELECT product.name,

product.id

FROM product

WHERE

produc.brand = ‘ACME’

SELECT time.date,

time.id

FROM time

WHERE time.date <

add_months(CURRENT_TIMESTAMP, -1)


15

Step 1 – Apply JOIN Re-ordering to Maximize Delegation

30,000,000 rows

JOIN

JOIN

100 rows 10 rows

10,000,000 rows


SELECT sales.retailer_fk, sales.product_fk,

sales.amount

FROM sales JOIN time ON sales.time_fk =

time.id WHERE time.date <


SELECT

retailer.name,

retailer.id

FROM retailer


product.id

FROM product

WHERE



16

Step 2 – Partial Aggregation Pushdown

The JOIN is on foreign

keys (1-to-many) and the

GROUP BY is on attributes

from the dimensions.

Partial aggregation

push-down optimization

applied.

10,000 rows

JOIN

JOIN

100 rows 10 rows

1,000 rows


SELECT sales.retailer_fk,

sales.product_fk,

SUM(sales.amount)

FROM sales JOIN time ON

sales.time_fk = time.id WHERE

time.date <

add_months(CURRENT_TIMESTAMP,-1)

GROUP BY sales.retailer_fk,

sales.product_fk

SELECT

retailer.name,

retailer.id

FROM retailer


product.id

FROM product

WHERE



17

Step 3 – Choose Best JOIN Methods

Selects the right JOIN

strategy based on

costs for data volume

estimations.

1,000 rows

NESTED JOIN

HASH JOIN

100 rows10 rows

1,000 rows


SELECT sales.retailer_fk, sales.product_fk,

SUM(sales.amount)

FROM sales JOIN time ON sales.time_fk =

time.id WHERE time.date <


GROUP BY sales.retailer_fk,

sales.product_fk

WHERE product.id IN (1,2,…)

SELECT

retailer.name,

retailer.id

FROM retailer


product.id

FROM product

WHERE



18

Leveraging the Power of a Hadoop Cluster

2M rows (sales by customer)

Current Sales68 M rows

1. Partial Aggregation push down

Maximizes source processingdramatically Reduces network

traffic 3. On-demand data transferDV Platform automatically

generates and upload Parquet files

4. Integration with local dataThe engine detects when data

is cached or comes from a local table already in the MPP

2. Integrated with Cost Based OptimizerBased on data volume estimation and

the cost of these particular operations, the CBO can decide to move all or part

of the execution tree to the MPP

5. Fast parallel executionSupport for Spark, Presto and Impala

for fast analytical processing in inexpensive Hadoop-based solutions

Hist. Sales220 M rows

Customer2 M rows

join

group by State

System Execution Time Optimization Techniques

Others ~ 19 min Simple federation

No MPP 43 sec Aggregation push-down

With MPP 26 sec Aggregation push-down + MPP integration (Impala 4 nodes)

group by ID

DataVirtualization

Platform

19

Example using Zeppelin Analytics Notebook

Accessing data for analytics and ML

20

Three Key Takeaways

FIRSTTakeaway

Data users have access to a vast array of data and the means to process that data to gain insights…the bottleneck is finding, gathering, and preparing the data.

SECONDTakeaway

Up to 80% of a user’s time is spent preparing the data and not doing the analysis on that data. Reducing this time increases that valuable analysis and insights that they deliver.

THIRDTakeaway

Data Virtualization is a technology that allows a variety of users to quickly and easily find, prepare, and access data, from a vast array of data sources, for their analytical and ML models.

Thanks!

www.denodo.com [email protected]

© Copyright Denodo Technologies. All rights reservedUnless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical, including photocopying and microfilm, without prior the written authorization from Denodo Technologies.

Download - Using Data Virtualization to Accelerate Time-to-Value From Your Data · 2018. 10. 18. · Services Big Data APIs Web Automation and Indexing CONNECT COMBINE CONSUME Share, Deliver,

Top Related