Using Data Virtualization to Accelerate Time-to-Value From Your Data
Integrating Distributed Data in Real Time
Speaker
Paul Moxon
VP Data Architectures
and Chief Evangelist
@ Denodo Technologies
3
Data, Data Everywhere, And Not a Thought to Think
4
Agile Analytics Architecture
5
Data Pipeline Problem
Data Discovery Data ExtractionData Pre-
processingData Analysis
Decision Making
Data Discovery & Preparation Analysis Actions
70-80% 20-30%
6
Data Pipeline Problem
Decision Making
Data Analysis
Data Preparation Analysis Actions
50-60% 40-50%
7
Agile Analytics Architecture - Revisited
DATAVIRTUALIZATION
8
What is Data Virtualization?
Consumein business applications
Combinerelated data into views
Connectto disparate data sources
2
3
1
DATA CONSUMERS
DISPARATE DATA SOURCES
Enterprise Applications, Reporting, BI, Portals, ESB, Mobile, Web, Users
Databases & Warehouses, Cloud/Saas Applications, Big Data, NoSQL, Web, XML, Excel, PDF, Word...
Analytical Operational
Less StructuredMore Structured
CONNECT COMBINE PUBLISH
Multiple Protocols,Formats
Query, Search,Browse
Request/Reply,Event Driven
SecureDelivery
SQL,MDX
WebServices
Big DataAPIs
Web Automationand Indexing
CONNECT COMBINE CONSUME
Share, Deliver, Publish, Govern,
Collaborate
Discover, Transform, Prepare, Improve Quality, Integrate
Normalized views of disparate data
“Data virtualization integrates disparate data sources in real time or near-real time to meet demands for analytics and transactional data.”
– Create a Road Map For A Real-time, Agile, Self-Service Data Platform, Forrester Research, Dec 16, 2015
9
How Does It Work?
Sources
Combine, Transform
& Integrate
Publish
Base View (Source
Abstraction)Client Address ClientType
Company Invoicing ServiceUsage
Product Logs WebIncidents
Customer Invoice Product
Customer 360°
Service Usage Incident
Hadoop Web SiteRESTWeb Service
MultidimensionalSalesforceS3 BucketRDBMS/EDW
SQL, SOAP, REST, ODATA, etc. Denodo’s Information Self Service
Data Virtualization Platform
10
Data Virtualization Connects the Users to the Data That They Need
1. Data Virtualization allows you to connect to any data source
2. You can combine and transform that data into the format needed by the consumer
3. The data can be exposed to the consumers in a format and interface that is usable
by them
• Typically consumers use the tools that they already use – they don’t have to learn new tools
and skills to access the data
4. All of this can be done without copying or moving the data
• The data stays in the original sources (databases, applications, files, etc.) and is retrieved, in
real-time, on demand
Cliff Notes version (TL;DR)
11
Example using Microsoft Power BI
Accessing data for Reports and Dashboards
12
OK…What About Performance?(The first question that everyone asks)
1. Query Delegation
Moving the processing to the data
2. Advanced query rewriting for analytical queries
Partial aggregation pushdown, JOIN-UNION reordering, branch pruning, etc.
3. Offloading of processing to MPP cluster
Take advantage of your Hadoop or Spark cluster
4. Caching
Cache data from slow data sources (‘Temporary materialization’)
The cache can be your Hadoop or Spark cluster
13
Example: Logical Data Warehouse
Total sales by retailer and product during the last month for the brand ACME
Time Dimension Fact table(sales) Product Dimension
Retailer Dimension
EDW MDM
SELECT retailer.name,
product.name,
SUM(sales.amount)
FROM
sales JOIN retailer ON
sales.retailer_fk = retailer.id
JOIN product ON sales.product_fk =
product.id
JOIN time ON sales.time_fk = time.id
WHERE time.date < ADDMONTH(NOW(),-1)
AND product.brand = ‘ACME’
GROUP BY product.name, retailer.name
Data Virtualization Platform
14
Query Before Optimization
300,000,000 rows
JOIN
JOIN
JOIN
GROUP BYproduct.name, retailer.name
100 rows 10 rows 30 rows
10,000,000 rows
SELECT sales.retailer_fk,
sales.product_fk,
sales.time_fk,
sales.amount
FROM sales
SELECT
retailer.name,
retailer.id
FROM retailer
SELECT product.name,
product.id
FROM product
WHERE
produc.brand = ‘ACME’
SELECT time.date,
time.id
FROM time
WHERE time.date <
add_months(CURRENT_TIMESTAMP, -1)
Data Virtualization Platform
15
Step 1 – Apply JOIN Re-ordering to Maximize Delegation
30,000,000 rows
JOIN
JOIN
100 rows 10 rows
10,000,000 rows
GROUP BYproduct.name, retailer.name
SELECT sales.retailer_fk, sales.product_fk,
sales.amount
FROM sales JOIN time ON sales.time_fk =
time.id WHERE time.date <
add_months(CURRENT_TIMESTAMP, -1)
SELECT
retailer.name,
retailer.id
FROM retailer
SELECT product.name,
product.id
FROM product
WHERE
produc.brand = ‘ACME’
Data Virtualization Platform
16
Step 2 – Partial Aggregation Pushdown
The JOIN is on foreign
keys (1-to-many) and the
GROUP BY is on attributes
from the dimensions.
Partial aggregation
push-down optimization
applied.
10,000 rows
JOIN
JOIN
100 rows 10 rows
1,000 rows
GROUP BYproduct.name, retailer.name
SELECT sales.retailer_fk,
sales.product_fk,
SUM(sales.amount)
FROM sales JOIN time ON
sales.time_fk = time.id WHERE
time.date <
add_months(CURRENT_TIMESTAMP,-1)
GROUP BY sales.retailer_fk,
sales.product_fk
SELECT
retailer.name,
retailer.id
FROM retailer
SELECT product.name,
product.id
FROM product
WHERE
produc.brand = ‘ACME’
Data Virtualization Platform
17
Step 3 – Choose Best JOIN Methods
Selects the right JOIN
strategy based on
costs for data volume
estimations.
1,000 rows
NESTED JOIN
HASH JOIN
100 rows10 rows
1,000 rows
GROUP BYproduct.name, retailer.name
SELECT sales.retailer_fk, sales.product_fk,
SUM(sales.amount)
FROM sales JOIN time ON sales.time_fk =
time.id WHERE time.date <
add_months(CURRENT_TIMESTAMP, -1)
GROUP BY sales.retailer_fk,
sales.product_fk
WHERE product.id IN (1,2,…)
SELECT
retailer.name,
retailer.id
FROM retailer
SELECT product.name,
product.id
FROM product
WHERE
produc.brand = ‘ACME’
Data Virtualization Platform
18
Leveraging the Power of a Hadoop Cluster
2M rows (sales by customer)
Current Sales68 M rows
1. Partial Aggregation push down
Maximizes source processingdramatically Reduces network
traffic 3. On-demand data transferDV Platform automatically
generates and upload Parquet files
4. Integration with local dataThe engine detects when data
is cached or comes from a local table already in the MPP
2. Integrated with Cost Based OptimizerBased on data volume estimation and
the cost of these particular operations, the CBO can decide to move all or part
of the execution tree to the MPP
5. Fast parallel executionSupport for Spark, Presto and Impala
for fast analytical processing in inexpensive Hadoop-based solutions
Hist. Sales220 M rows
Customer2 M rows
join
group by State
System Execution Time Optimization Techniques
Others ~ 19 min Simple federation
No MPP 43 sec Aggregation push-down
With MPP 26 sec Aggregation push-down + MPP integration (Impala 4 nodes)
group by ID
DataVirtualization
Platform
19
Example using Zeppelin Analytics Notebook
Accessing data for analytics and ML
20
Three Key Takeaways
FIRSTTakeaway
Data users have access to a vast array of data and the means to process that data to gain insights…the bottleneck is finding, gathering, and preparing the data.
SECONDTakeaway
Up to 80% of a user’s time is spent preparing the data and not doing the analysis on that data. Reducing this time increases that valuable analysis and insights that they deliver.
THIRDTakeaway
Data Virtualization is a technology that allows a variety of users to quickly and easily find, prepare, and access data, from a vast array of data sources, for their analytical and ML models.
Thanks!
www.denodo.com [email protected]
© Copyright Denodo Technologies. All rights reservedUnless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical, including photocopying and microfilm, without prior the written authorization from Denodo Technologies.