denodo data virtualization (dv) single view of truthdenodo data virtualization (dv) single view of...

25
Denodo Data Virtualization (DV) Single View of Truth By 2018, organizations with data virtualization capabilities will spend 40% less on building & managing data integration processes for connecting distributed data assets.” Source: SPA (Strategic Planning Assumption) Gartner published 2017 predictions research.

Upload: nguyennhu

Post on 11-Mar-2018

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Denodo Data Virtualization (DV) Single View of TruthDenodo Data Virtualization (DV) Single View of Truth ... Teradata, Netezza, Oracle Exadata, Sybase IQ, Greenplum, ... Impalas and

Denodo Data Virtualization (DV)

Single View of Truth

“By 2018, organizations with data virtualization capabilities will spend 40% less on building & managing data integration processes for connecting distributed data assets.”

Source: SPA (Strategic Planning Assumption) Gartner published 2017 predictions research.

Page 2: Denodo Data Virtualization (DV) Single View of TruthDenodo Data Virtualization (DV) Single View of Truth ... Teradata, Netezza, Oracle Exadata, Sybase IQ, Greenplum, ... Impalas and

2

What is Data Virtualization

• A method of Data Integration that does not physically move

data or create new copies of data

• A method of Data Integration that isolates users from the

format, location, technologies, and protocols for storing and

accessing data

• Real Time Data Access to any data type – structured,

unstructured, semi structured

• Many to 1 approach. Virtually join any number of data

sources and source types into a single view

Page 3: Denodo Data Virtualization (DV) Single View of TruthDenodo Data Virtualization (DV) Single View of Truth ... Teradata, Netezza, Oracle Exadata, Sybase IQ, Greenplum, ... Impalas and

3

Problem: IT Architecture is Unmanageable

Log files(.txt/.log files)

CRM(MySQL)

Billing System(Web Service - Rest)

Big Data, Cloud(Hadoop, Web)

Inventory System(MS SQL Server)

Product Catalog(Web Service -SOAP)

Customer Voice(Internet, Unstruc)

Product Data(CSV)

ETL

TraditionalIssues

Hi-Data Growth,

IT Complexity, Data Silos, Hi - Latency

New Trends

Real Time,Big Data,

Unstructured Data,

External Data,Move to Cloud

Page 4: Denodo Data Virtualization (DV) Single View of TruthDenodo Data Virtualization (DV) Single View of Truth ... Teradata, Netezza, Oracle Exadata, Sybase IQ, Greenplum, ... Impalas and

4

Solution: Virtual Data Layer

Log files(.txt/.log files)

CRM(MySQL)

Billing System(Web Service - Rest)

Big Data, Cloud(Hadoop, Web)

Inventory System(MS SQL Server)

Product Catalog(Web Service -SOAP)

Customer Voice(Internet, Unstruc)

Product Data(CSV)

ETL

TraditionalIssues

Hi-Data Growth,

IT Complexity, Data Silos, Hi - Latency

New Trends

Real Time,Big Data,

Unstructured Data,

External Data,Move to Cloud

Data Virtualization

Page 5: Denodo Data Virtualization (DV) Single View of TruthDenodo Data Virtualization (DV) Single View of Truth ... Teradata, Netezza, Oracle Exadata, Sybase IQ, Greenplum, ... Impalas and

5

Denodo DV: Connectivity to Any Data TypeRelational DB’s: Oracle, DB2, Sybase, MS SQL Server, MySQL, PostgreSQL, Informix, MS Access…

Parallel DB’s & Appliances: Teradata, Netezza, Oracle Exadata, Sybase IQ, Greenplum, ParAccel…

Multidimensional OLAP Engines: SAP BW, MS SQL Server Analysis Services, Mondrian, Essbase…

SOAP / REST Web Services and Data Feeds: XML, RSS, ATOM, JSON, Odata, Delimited Files – CSV, log files, device feeds, ...

Enterprise Applications: SAP R3 / ECC, Oracle E-Business suite,, Siebel, PeopleSoft, SAS...

Content Management Sys (CMS): MS SharePoint, IBM FileNet, Documentum…

Modeling Tools: Erwin, Rochade, ER Studio…

MDM & Mapping: IBM Initiate, ontologies, taxonomies…

Mainframe / Legacy Connectivity: Adabas, IMS, DB2, TN5250 / TN3270.

Plug-in architecture: third party Mainframe / Legacy Adapters...

Semantic repositories in Triple Stores / RDF accessed via SPARQL endpoints

LDAP and Active Directory: as source data & security access

Big Data / NoSQL: Hadoop, Hive, HCatalog, Impala, Scoop, HBase, PIG, HDFS, MapReduce, AVRO, HDFS, Mongo DB, CouchDB, Neo4J, Cassandra, MarkLogic…

Cloud, SaaS: Salesforce, Google, Amazon, LinkedIn, Facebook, Twitter via APIs; Any Website, Form, any Web based Apps…

Enterprise Service Bus: JMS message queues, WebSphere MQ, Sonic, ActiveMQ…

Custom Connector SDK: access any application via API and procedural interfaces.

Semi-Structured Data: Web sites, Forms, applications, PDF, MS Word, MS Excel

Unstructured Data: websites, file systems, Email servers, databases, knowledgebase, indexes (Lucene, MS FAST, HP Autonomy…), RSS Feeds …

Page 6: Denodo Data Virtualization (DV) Single View of TruthDenodo Data Virtualization (DV) Single View of Truth ... Teradata, Netezza, Oracle Exadata, Sybase IQ, Greenplum, ... Impalas and

6

Denodo Platform Architecture

6

Da

ta V

irtu

aliza

tio

n Design Tools

Optimizer

Cache

Scheduler

Monitoring

Governance

Metadata

Security

Publish Real-time (Right-time) Data Services

Combine Transform, Improve Quality, Integrate

Connect Normalized Views of Disparate Data

Denodo Platform

Library of Wrappers Web Automation Any Data or Content Read and Write

Business SolutionsAccess Information-as-a-Service

Denodo PlatformRight Information at the Right Time

Disparate DataAny SourceAny Format

Denodo Platform

Publish Real-time (Right-time) Data Services

Combine Transform, Improve Quality, Integrate

Connect Normalized Views of Disparate Data Da

taV

irtu

aliza

tio

n Design Tools

Optimizer

Cache

Scheduler

Monitoring

Governance

Metadata

Security

Page 7: Denodo Data Virtualization (DV) Single View of TruthDenodo Data Virtualization (DV) Single View of Truth ... Teradata, Netezza, Oracle Exadata, Sybase IQ, Greenplum, ... Impalas and

7

Common Data Virtualization Use CasesData Virtualization

BIG DATA, CLOUD INTEGRATION

Advanced Analytics

Data Warehouse Offloading

Big Data for Enterprise

Cloud / SaaS Integration

AGILE BUSINESS INTELLIGENCE

Logical Data Warehouse

Virtual Data Marts

Self-Service BI

Operational BI / Analytics

SINGLE VIEW APPLICATIONS

Single Customer View - Call Centers, Portals

Single Product View - Catalogs

Single Inventory View - Inventory Reconciliation

Vertical Specific - Single View of Wells

DATA SERVICES

Unified Data Services Layer

Logical Data Abstraction

Agile Application Development

Linked Data Services

Analytical Operational

BusinessUse Cases

IT Use Cases

Page 8: Denodo Data Virtualization (DV) Single View of TruthDenodo Data Virtualization (DV) Single View of Truth ... Teradata, Netezza, Oracle Exadata, Sybase IQ, Greenplum, ... Impalas and

8

What are analysts saying?

“By 2018, organizations with data virtualization capabilities will spend 40% less on building & managing data integration processes for connecting distributed data assets.”

Source: Gartner Research, Predicts 2017: Data Distribution and Complexity Drive Information Infrastructure Modernization.

“Through 2020, 35% of enterprises will implement some form of data virtualization as one enterprise production option for data integration.”

Source: Gartner Research, Market Guide for Data Virtualization, 2016.

Page 9: Denodo Data Virtualization (DV) Single View of TruthDenodo Data Virtualization (DV) Single View of Truth ... Teradata, Netezza, Oracle Exadata, Sybase IQ, Greenplum, ... Impalas and

Performance & Security

Page 10: Denodo Data Virtualization (DV) Single View of TruthDenodo Data Virtualization (DV) Single View of Truth ... Teradata, Netezza, Oracle Exadata, Sybase IQ, Greenplum, ... Impalas and

10

Performance

Architecture designed for both Informational & Operational scenarios

Focused on 3 Core Concepts

1. Dynamic Multi-Source Query Execution Plans

Leverages processing power & architecture of data sources

Dynamic to support ad hoc queries

2. Selective Materialization

Intelligent Caching of only the most relevant and often used information

3. Optimized Resource Management

Smart allocation of resources to handle high concurrency

Throttling to control and mitigate source impact

Core Concepts

Page 11: Denodo Data Virtualization (DV) Single View of TruthDenodo Data Virtualization (DV) Single View of Truth ... Teradata, Netezza, Oracle Exadata, Sybase IQ, Greenplum, ... Impalas and

11

Performance

Ad Hoc querying requires an architecture that generate efficient plans in execution time.

Denodo borrows many techniques from traditional RDBMs such as:

Cost based execution plans

Based on statistics, indexes, transfer rates, etc.

Multiple JOIN strategies

Merge, Hash, Nested, Parallel Nested, Sorted-Merge

Query rewriting

Redundant filter detection, unnecessary JOIN pruning, etc.

But since data is stored in multiple heterogeneous sources, DV has to apply other techniques to minimize network traffic and minimize processing in the virtual layer:

Maximize query push down – Process at the source

Query rewriting to maximize delegation to sources

Data transformations push-up to maximize delegation

On-the-fly data movement (shipping)

Abstract source capabilities

Emulate in the virtual layer the operations that cannot be push down (e.g., a GROUP BY on a flat file)

Optimization techniques

Page 12: Denodo Data Virtualization (DV) Single View of TruthDenodo Data Virtualization (DV) Single View of Truth ... Teradata, Netezza, Oracle Exadata, Sybase IQ, Greenplum, ... Impalas and

Proven Performance in IBM Labs

Queries to single source

■ Denodo only adds 3-5% overhead

Source: Denodo testing in IBM labs – TPCDS Benchmark and DataShip Performance Tests

Join across multiple sources

■ Denodo optimization engine faster than in-house solution

Page 13: Denodo Data Virtualization (DV) Single View of TruthDenodo Data Virtualization (DV) Single View of Truth ... Teradata, Netezza, Oracle Exadata, Sybase IQ, Greenplum, ... Impalas and

When ‘Data Lakes’ become “Data Swamps”Uncontrolled dumping of Data in Hadoop leads to poor perf.

Denodo DV Query across Impalas and Exadata Vs.

MDM and Large data sets in Hadoop - Impala

ETL all data into Impala and run full query there

MDM data in Exadata (Oracle)

Large Data sets in Hadoop - Impala

Big Data Queries Run Faster using DV because: • DV automatically collects Statistics & Source capabilities, then• Rewrites optimized queries and pushes processing down to the sources• Thus, heavy processing is performed in the systems designed to do so:

• Impala Hadoop performs heavy aggregations on top of very large data sets• Oracle Exadata is faster than Impala to process dimensional queries

Page 14: Denodo Data Virtualization (DV) Single View of TruthDenodo Data Virtualization (DV) Single View of Truth ... Teradata, Netezza, Oracle Exadata, Sybase IQ, Greenplum, ... Impalas and

Big Data Queries Faster using DV

Impala

Hadoop-only

Runtime (s)

Denodo

Runtime (s)

Denodo

Runtime w/

Cache (s)

Data Volumes

Query 1199 120 68

Queries 1,2,3,5

•Exadata Row Count: ~5M

•Impala Row Count: ~500k

Query 4

•Exadata Row Count: ~5M

•Impala Row Count: ~2M

Query 2187 96 88

Query 3120 212 115

Query 4 timeout328 69

Query 546 91 56

Performance comparison of 5 different queries :

• DV delivers better performance & Saves replicating data into Hadoop

• DV leverages Data Source Architectures for what they are good at.

Page 15: Denodo Data Virtualization (DV) Single View of TruthDenodo Data Virtualization (DV) Single View of Truth ... Teradata, Netezza, Oracle Exadata, Sybase IQ, Greenplum, ... Impalas and

15

Performance

Denodo has done extensive testing using queries from the standard benchmarking test

TPC-DS* and the following scenario

Compares the performance of a federated approach in Denodo with an MPP system where

all the data has been replicated via ETL

Benchmarks: Logical Data Warehouse

Customer Dim.2 M rows

Sales Facts290 M rows

Items Dim.400 K rows

* TPC-DS is the de-facto industry standard benchmark for measuring the performance of decision support solutions including, but not limited to, Big Data systems.

vs.Sales Facts290 M rows

Items Dim.400 K rows

Customer Dim.2 M rows

Page 16: Denodo Data Virtualization (DV) Single View of TruthDenodo Data Virtualization (DV) Single View of Truth ... Teradata, Netezza, Oracle Exadata, Sybase IQ, Greenplum, ... Impalas and

16

Performance

Query DescriptionReturned

RowsNetezza Time

Denodo Time (Federated Oracle,

Netezza & SQL Server)

Denodo Optimization Technique (automatically

selected)

Total sales by customer 2 M 20.9 sec. 21.4 sec. Full aggregation push-down

Total sales by customer and year between 2000 and 2004

5.5 M 52.3 sec. 59.0 sec Full aggregation push-down

Total sales by item brand 31 K 4.7 sec. 5.0 sec. Partial aggregation push-down

Total sales by item where sale price less than current

list price17 K 3.5 sec. 5.2 sec On the fly data movement

Benchmarks: Logical Data Warehouse - Results

Page 17: Denodo Data Virtualization (DV) Single View of TruthDenodo Data Virtualization (DV) Single View of Truth ... Teradata, Netezza, Oracle Exadata, Sybase IQ, Greenplum, ... Impalas and

17

Example: Total sales by Customer

Problem

Join cannot be pushed down

Group By is not pushed down

All sales rows sent to Integration Layer

Un-optimized Result

All Rows transferred: 300M + 2M

Slow execution and Netezza is underutilized

Typical Reporting Tool Process (No Query Rewriting)

Join

Group By

300 M 2 M

Sales Customer

SELECT c.id, SUM(s.amount) as total

FROM customer c join sales s

ON c.id = s.customer_id

GROUP BY c.id

ReportingTool

1K

Page 18: Denodo Data Virtualization (DV) Single View of TruthDenodo Data Virtualization (DV) Single View of Truth ... Teradata, Netezza, Oracle Exadata, Sybase IQ, Greenplum, ... Impalas and

18

Example: Total sales by Customer

Denodo Benefit

Group By automatically moved below JOIN without affecting the results (PK-FK join)

Group By pushed down to Netezza

Optimized Result

Rows transferred: 2M + 2M

Leverage star-schema features:

Size of Group By output determined by cardinality of dimensions (small)

Star-schema joins allow Group By push-down

After Denodo’s Rewriting – Full Aggregation Pushdown

SELECT c.id, amount

FROM

(SELECT s.customer_id, SUM(amount) amount

FROM sales s

GROUP BY s.customer_id) s_agg

JOIN Customer c

ON (c.id = s_agg.customer_id)

Join

Group By

2M 2M

Sales

Customer

Reporting Tool

Denodo

1K

Page 19: Denodo Data Virtualization (DV) Single View of TruthDenodo Data Virtualization (DV) Single View of Truth ... Teradata, Netezza, Oracle Exadata, Sybase IQ, Greenplum, ... Impalas and

19

Caching

Sometimes, real time access & federation not a good fit:

Sources are slow (e.g.. text files, cloud apps. like Salesforce.com)

A lot of data processing needed (e.g.. complex combinations,

transformations, matching, cleansing, etc.)

Limited access or have to mitigate impact on the sources

For these scenarios, Denodo can replicate just the relevant data in

the cache

Real time vs. caching

Page 20: Denodo Data Virtualization (DV) Single View of TruthDenodo Data Virtualization (DV) Single View of Truth ... Teradata, Netezza, Oracle Exadata, Sybase IQ, Greenplum, ... Impalas and

20

Security in DenodoOverview

Authentication• Pass-through authentication• Kerberos and Windows SSO• OAuth, SPNEGO

Authentication• Standard JDBC/ODBC security• Kerberos and Windows SSO • Web Service security

LDAPActive Directory

Role based AuthenticationGuest, employee, corporate

Schema-wide Permissions

Data Specific Permissions(Row, Column level, Masking)

Policy Based Security

Data in motion• SSL/TLS

Data in motion• SSL/TLS

Encrypted data at rest• Cache• Swap

Page 21: Denodo Data Virtualization (DV) Single View of TruthDenodo Data Virtualization (DV) Single View of Truth ... Teradata, Netezza, Oracle Exadata, Sybase IQ, Greenplum, ... Impalas and

21

Enterprise Governance

Data Lineage

Find source of ‘truth’ – top down – shows where data comes from and/or how it is derived

Source Refresh

Detect changes in underlying data sources and propagate to the affected data services

Impact Analysis

Analyze impact of metadata changes in workflows where the modified view is used

Catalog Search

Have a complete understanding of each of the views and data services created in Denodo

Page 22: Denodo Data Virtualization (DV) Single View of TruthDenodo Data Virtualization (DV) Single View of Truth ... Teradata, Netezza, Oracle Exadata, Sybase IQ, Greenplum, ... Impalas and

22

Enterprise Governance

Data lineage is available from the Admin Tool and from the web-based Information

Self-Service

Data lineage example

Page 23: Denodo Data Virtualization (DV) Single View of TruthDenodo Data Virtualization (DV) Single View of Truth ... Teradata, Netezza, Oracle Exadata, Sybase IQ, Greenplum, ... Impalas and

23

Management - Monitoring and Diagnosing Tool

■ See current sessions, queries, connections,

cache load processes…

■ See resources usage in each server (CPU,

memory, connections,…)

■ Inspect data sources and cache statistics

(connection pools, response times, active

requests…)

■ Go “back in time” to the moment where a

problem happened

Graphical Monitoring of Servers and Clusters; Graphical Problem

Diagnosing

■ Graphically inspect and browse all the

information provided by the Denodo Monitor

and server logs:

■ Graphical Analysis of incidents

■ Active requests and sessions

■ Resources Usage

■ Data source statistics

Page 24: Denodo Data Virtualization (DV) Single View of TruthDenodo Data Virtualization (DV) Single View of Truth ... Teradata, Netezza, Oracle Exadata, Sybase IQ, Greenplum, ... Impalas and

Denodo Platform Architecture

2424

Page 25: Denodo Data Virtualization (DV) Single View of TruthDenodo Data Virtualization (DV) Single View of Truth ... Teradata, Netezza, Oracle Exadata, Sybase IQ, Greenplum, ... Impalas and

Thanks!

www.denodo.com [email protected]

© Copyright Denodo Technologies. All rights reservedUnless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical, including photocopying and microfilm, without prior the written authorization from Denodo Technologies.