buyers guide 2020 the complete buyer’s guide for ......sql-on-anything a query federation engine...

The Complete Buyer’s Guide For Intelligent Data VirtualizationFuture proof your technology choices and avoid vendor lock-in

Buyers Guide2020

By 2022, 60% of all

organizations will

implement data

virtualization as one

key delivery style in

their data integration

architecture.

GARTNER Gartner Market Guide for Data Virtualization

ABOUT THIS GUIDE

Data virtualization technology has been around for quite some time, primarily supporting operational

use cases. However, the explosion of data in size and variety, coupled with the increased focus on

analytical use cases, has created new challenges for legacy data virtualization technologies. The

need for ad-hoc access to both live and historical data steadily increases among business users and

that demand stretches the limits of how the most robust analytics tools can process these massive

datasets. Moreover, the volume and speed with which data is generated is beyond the capacity and

economic bounds of today’s typical enterprise infrastructures.

In this guide, we will lay out some key considerations for organizations looking to apply data

virtualization to their analytics use cases. Along with key features and capabilities, we’ll discuss the

differences between data virtualization and query federation, drill down on caching techniques and

include a detailed ranking for evaluating vendors in the space.

TABLE OF CONTENTS

What Is Data Virtualization? ___________________________________________1

Is Your Organization Ready For Data Virtualization? ________________________1

Getting Started On Your Search _________________________________________2

What Is “Intelligent” Data Virtualization? ________________________________3

Key Considerations ___________________________________________________4

Analytics Versus Operational Use Cases _______________________________4

Data Sources & Connections ________________________________________5

Development Environment _________________________________________6

Calculations & Analytical Functions (OLAP) ____________________________6

Query Performance & Caching Approaches ____________________________7

Client-side Requirements __________________________________________8

Security & Governance _____________________________________________8

Feature Ranking ______________________________________________________9

Conclusion _________________________________________________________11

Resources & Further Reading __________________________________________12

© 2020 AtScale Inc. All rights reserved. 1

WHAT IS DATA VIRTUALIZATION?

IS YOUR ORGANIZATION READY FOR DATA VIRTUALIZATION?

It wasn’t that long ago when most believed that a centralized data warehouse could be the data panacea

for the enterprise. It didn’t take long to realize that the demands of the business were too fluid to wait

on IT to collect, normalize and store data in a single physical location. Since then, our challenges of

providing an enterprise data “single source of truth” have become even more elusive. The explosion

of data sizes, varieties and formats have made it all but impossible to standardize on a single storage

format. Even worse, with the emerging popularity of the public cloud, data stewards now need to deal

with data locale as well. While it’s tough enough for those who need to engineer and secure the data, it’s

even harder for the consumers of data. We are now asking our business users to engineer, wrangle and

access data on-premise, in the cloud, in databases and in data lakes.

If these types of challenges sound familiar, you may be ready to explore data virtualization. By

abstracting away how and where data is stored, data virtualization frees the enterprise from the


inflexibility and costs that data silos beget. By presenting users with logical data views, data

virtualization creates agility for the enterprise while future-proofing the technology choices for

years to come.

Even though data virtualization technology has been around for several years, the emergence of the

public cloud and the explosion in the number of data platforms and analytics tools has made virtualization

an essential component for enterprises. According to Gartner, by 2022, “60% of all organizations will

implement data virtualization as one key delivery style in their data integration architecture.”

GETTING STARTED ON YOUR SEARCH

There are a variety of data virtualization approaches in the market and it’s important to understand

the different styles before launching your search. According to Andrew Brust in his GigaOM report

titled “Data Virtualization: A Spectrum of Approaches”, there are four main approaches or styles for

data virtualization.

APPROACH DESCRIPTION PROS CONS EXAMPLE VENDORS

Core Data Virtualization Platforms

A stand alone virtual data (semantic) layer that abstracts the physical data platform and location and allows queries across disparate data platforms.

Platform independent

True semantic layer

Caching for performance

Separate service to manage

Data needs to be modeled

AtScale

Denodo

Dremio

TIBCO DV

SQL-On-Anything A query federation engine that allows a single SQL query to combine data from more than one data platform.

Good for file-based access

Scale out with clustering

Users need to understand remote schemas

Unpredictable performance

Amazon Athena

Apache Drill

Presto

Remote Data Source Bridges A database extension that is embedded in a RDBMS that allows SQL access to remote databases using “external” tables.

Good when one primary ware-house is used

No separate service to manage

Users need to understand remote schemas

Unpredictable performance

Amazon Redshift Spectrum

IBM Db2 Big SQL

Microsoft SQL Server Polybase

Oracle Big Data SQL

Teradata QueryGrid

Autonomous Data Warehouses A platform that automates the modeling, integration, and con-nectivity of source data which is then loaded into a target data platform.

Data stays in one place

Performance is easier to predict

Requires data movement (data copies)

Data latency or staleness

Data Virtuality

Infoworks.io

Incorta

Illustration 1: Approaches to data virtualization.


To further summarize, you can think of these approaches as pure data virtualization, database

federation and next generation ETL tools. If you are looking for the most flexible approach, the pure

data virtualization platforms are your best choice. These platforms provide the best degree of “future-

proofing” since they allow users to plug and play with new data platforms as they emerge. Also, these

tools usually allow users to define virtual calculations which promotes the consolidation of business

terms, definitions and calculations in a central location. In contrast, the database federation tools force

you into a single database platform while the autonomous data warehouses require data movement and

lock customers into a single vendor’s ecosystem.

WHAT IS “INTELLIGENT” DATA VIRTUALIZATION?

Traditional data virtualization platforms typically rely on the operator to manually tune query

performance. This becomes a difficult task especially when dealing with multiple data platforms. A

federated query that joins data from more than one data platform can create an unreasonable amount

of data movement in real time. Simple memory and table caching are not enough to handle the variation

in query patterns, and transferring atomic data for heterogeneous joins may create an unacceptable

volume of network traffic which results in a poor and frustrating user experience.

“Intelligent” data virtualization addresses these scale and performance issues. Intelligent data

virtualization platforms avoid high volumes of real time traffic due to federated joins by creating and

managing a distributed and data platform optimized cache. By inspecting query patterns and data

platform and network performance characteristics, this approach ensures that the heavy query lifting

is pushed down to the remote data platforms. By avoiding unnecessary data transfers, intelligent data

virtualization delivers more consistent query performance with far less resource consumption.

Recommendation: The Core Data Virtualization style of data virtualization delivers

the most flexibility for future-proofing your information architecture and minimizing

vendor lock-in.


KEY CONSIDERATIONS

When choosing a vendor, there are a few core capabilities to keep in mind. Depending on your needs,

you can weigh the options accordingly. The following categories are further broken down in our

checklist later in this document.

Analytics Versus Operational Use Cases

Until recently, traditional data virtualization has focused primarily on operational application

integration use cases. In the last few years, enterprises have expanded their use of data virtualization

beyond just development or test deployments. According to Gartner, in 2011 only 11% of surveyed

organizations reported that they were utilizing data virtualization and primarily in operational

application integration—as a semantic tier to multiple datasets that were not permanently stored in

operational data stores, warehouses or marts. By 2018, however, as many as 40% of organizations were

utilizing data virtualization with the majority targeting analytics use cases.

If you are targeting operational application integration as a use case, traditional data virtualization

platforms may be good enough. In these use cases, the data size and query profiles tend to be more

predictable and thus more suitable to a one time optimization effort. However, analytics workloads

are much more varied and tend to scan and aggregate large amounts of data. For these use cases, it is

essential that the virtualization solution support autonomous performance management. One time,

manual performance tuning is not suitable for analytical workloads and will likely lead to low end user

adoption due to unpredictable and poor query performance.

Recommendation: When choosing a vendor, make sure that query performance

management is automated based on user query patterns, network performance and

data platform capabilities.

Recommendation: For analytical use cases, choose an intelligent data virtualization

platform that automatically manages query performance.


Data Sources & Connections

There are two questions to consider when choosing a data virtualization solution with regards to data

platform connections.

The first is rather obvious: how many different data platforms or data sources does the solution support?

To maximize flexibility, your vendor should support data sources that are (1) relational (i.e Oracle,

Teradata, Snowflake), (2) file-based (i.e. CSV, JSON, XML, HDFS, S3), (3) API based (i.e. REST, HTML)

and (4) application based (i.e. Salesforce, Workday, Service Now). With this type of coverage, your

virtual view designers can incorporate just about any data, on-premise, in the cloud, structured and

unstructured without involving ETL or manual data movement.

The second consideration may be less obvious but is even more critical than the first. The question to

ask is “Where does the query processing happen?”. There are two different answers to this question.

The first answer is to deploy a “least common denominator” approach. In this style, the virtualization

engine exposes its own proprietary dialect (usually based on Postgres) that is responsible for all the data

processing and aggregation. This approach becomes problematic when your data platforms have varied

data types or require platform specific functions. With a lowest common denominator approach, the

virtualization engine is required to stream atomic level data to the vendor’s proprietary query engine for

applying calculations to aggregation functions. This approach tends to break down when data is large

since data platform query push down is limited.

The second answer and approach is a pure query push down. In this approach, queries are passed through

to the data platform and aggregations and platform specific functions are performed remotely. In this

case, for table joins, only the pre-aggregated data is streamed to the virtualization engine which minimizes

the amount of query time and data movement.

For analytical use cases, it’s imperative to minimize query time data transfer for heterogeneous table joins.

Recommendation: For analytical use cases, choose a data virtualization platform

that maximizes query pushdown to minimize query time data movement.


Development Environment

When data virtualization first arrived on the scene in the early 2000s, IT managed data pipelines and the

centralized data warehouse was king. Later, business intelligence (BI) innovators like Tableau and Qlik

democratized reporting and analytics with their user friendly, self service oriented tools. As these tools

took off in the enterprise, IT’s role as the centralized arbiter of data access became less the norm and

more of the exception.

When evaluating data virtualization vendors, it’s important to keep this democratization and self service

trend in mind. Therefore, a citizen data scientist or business analyst should be able to design virtual

data views and semantic models in the data virtualization platform. In fact, the process of defining

virtual data views should be as easy as building a Tableau worksheet. With this in mind, it’s imperative

that you choose a data virtualization vendor that supports a web-based (as opposed to desktop based)

design environment that promotes multiple, simultaneous designers and leverages a library to promote

re-use and standardization.

Calculations & Analytical Functions (OLAP)

If you are looking to address analytical use cases, it’s imperative that your chosen platform can handle

the complex analytical calculations required by business applications. For example, just about every

analytical use case requires time calculations and time intelligence. At a minimum, this means that your

data virtualization platform must support period-over-period (i.e., this year/last year), period-to-date

(i.e., year to date, quarter to date) and moving averages (i.e., monthly average sales). In addition, many

analytical use cases require semi-addition metrics: distinct counts for counting unique customers, first

and last functions for reporting on beginning and ending inventory levels. These types of calculations

are often referred to as OLAP calculations. In addition to OLAP functionality, support of the OLAP

protocol (MDX) means that the virtual platform can also serve queries from tools like Excel, Cognos,

BusinessObjects and Microstrategy that speak the MDX protocol.

Recommendation: Choose a data virtualization platform with a web-based,

multi-user design environment with libraries to promote re-use and enforce

standardization.


Without support for analytical functions and MDX (OLAP), analytical workloads must be pushed down

into the tooling layer which subverts the power of the universal semantic layer and often results in

inconsistent business definitions and conflicting reports.

Query Performance & Caching Approaches

When evaluating vendors, this is arguably the area where you should spend most of your time. Without

consistent and performant query serving, a virtualization platform has little value.

In analytical use cases, business users are accustomed to interactive query performance since they

typical query proprietary analytical databases or cubes that are designed for fast queries. As a result, a

virtualization platform needs to deliver even better performance than the native platforms they interact

with since the virtualization layer needs to match or beat the existing solutions they are replacing.

To make matters worse, data virtualization queries often include heterogeneous database joins that

further tax query performance.

Data virtualization solutions that simply cache query results or create cached tables are not sufficient

for analytical use cases. As mentioned above, analytical queries are too variable and often scan and

aggregate massive amounts of data on the fly.

Recommendation: Choose a data virtualization platform with a web-based,

multi-user design environment with libraries to promote re-use and enforce

standardization.

Recommendation: Choose a data virtualization vendor that includes a

comprehensive performance management system that goes beyond simple caching

techniques.


Client-side Requirements

In order to maximize the ROI for a data virtualization platform investment, it’s extremely important

to drive broad adoption for your data consumers in the enterprise. There’s little value to a universal

semantic layer if it’s not used by business analysts and data scientists as their primary means to

access data. Therefore, it’s imperative that your chosen data virtualization vendor platform supports

a lightweight client-side footprint. For example, does your data virtualization vendor require the

installation of custom drivers on client desktops? Does the virtualization platform require a desktop

application for managing or discovering new virtual views? If either question is “yes”, IT must now push

client-side software and manage versioning on every desktop that needs access to data. In a large

enterprise, this may be a huge inhibitor to end user adoption.

Security & Governance

Since data virtualization platforms serve as middleware for analytical queries, it’s imperative that the

platform integrates with the enterprise’s security infrastructure. There are two main forms of security to

consider: authentication & authorization.

First, a data virtualization platform must integrate with the enterprise’s single sign-on infrastructure

in order to authenticate users, whether that be Active Directory (AD), LDAP, OAuth or other third party

authentication platforms. The authorization capabilities must flow through the client applications and

the data virtualization platform must synchronize users automatically.

Second, the data virtualization platform must include the ability to hide or mask sensitive columns,

limit data rows based on user access rules and impersonate users when querying the underlying data

sources. Impersonation is especially crucial since using a proxy user (instead of the query user) to query

underlying data sources may circumvent security policies for those data platforms and force users to

duplicate security policies in the virtualization layer.

Recommendation: Choose a data virtualization vendor with a zero client-side

footprint, leveraging existing tooling connectors for accessing the virtualization layer.


FEATURE RANKING

The following feature table is a tool for evaluating different vendors along the capability categories

described above. Use a number between 1 and 5 (5 being best) to score the vendor’s capabilities for

each feature. You may also use the weighting column to personalize the scoring results based on your

most important priorities.

Recommendation: Choose a data virtualization vendor that integrates with your

single sign-on standards and supports column level security, row level security and

impersonation.

GET STARTED


FEATURE CATEGORY FEATUREScore

(1-5, 5 = BEST)Weigh

(1-5, 5- BEST)Weighted Score

(CALC)

Use Cases

Supports analytical workloads

Supports data science workloads

Supports operational workloads

Data Sources & Connections

Supports a large number of data sources

Supports relational/MPP data sources

Supports SaaS data sources (Salesforce, Workday)

Supports file-based data sources (CSV, Excel)

Supports web application based data sources (HTTP, REST)

Supports Big Data data sources (Hadoop, Spark)

Supports non-scalar data sources (JSON, Arrays, Structs)

Development Environment

Supports web-based development (versus client application)

Supports multiple, simultaneous editors for virtual view development

Supports object/calc re-usable across virtual views/projects

Supports development lifecycle (dev/test/prod)

Calculations and Analytical Functions (OLAP)

Supports Time Intelligence (period over period, period to date)

Supports post query calculations

Supports aggregation functions (SUM, AVG, MAX, MIN)

Supports non-additive metrics (Distinct Count, First, Last)

Supports MDX expressions

Supports live Excel pivot tables

Query Performance & Caching

Supports automated query performance management

Supports in-memory caching

Supports persistent (table) caching

Support query & predicate pushdown (no least common denominator)

Support data source/network capability aware optimization

Client-side Requirements

Supports zero client install for data consumers

Supports zero client install for virtual view developers

Supports wide range of tooling natively (i.e. built-in)

Total


CONCLUSION

So much has changed since data virtualization platforms hit the scene in the early 2000’s. Data

has grown in scale and has become more varied, the data lake arrived to compete with the data

warehouse, the public cloud became a compelling destination and data platforms continue to

proliferate, not consolidate. To summarize, here are the key recommendations to keep in mind as you

choose your vendor:

1. The Core Data Virtualization style of data virtualization delivers the most flexibility for future-proofing

your information architecture and minimizing vendor lock-in.

2. When choosing a vendor, make sure that query performance management is automated based on

user query patterns, network performance and data platform capabilities.

3. For analytical use cases, choose an intelligent data virtualization platform that automatically

manages query performance.

4. For analytical use cases, choose a data virtualization platform that maximizes query pushdown to

minimize query time data movement.

5. Choose a data virtualization platform with a web-based, multi-user design environment with libraries

to promote re-use and enforce standardization.

6. Choose a data virtualization vendor that includes a comprehensive performance management system

that goes beyond simple caching techniques.

7. Choose a data virtualization vendor with a zero client-side footprint, leveraging existing tooling

connectors for accessing the virtualization layer.

8. Choose a data virtualization vendor that integrates with your single sign-on standards and supports

column level security, row level security and impersonation.

As you can see, there’s a lot to consider when choosing a data virtualization platform. Refer to this guide

and ranking to help you make order from the chaos and have the confidence to chose the best vendor to

realize your ambitions of simplifying your data analytics stack.


ABOUT ATSCALE

The Global 2000 relies on AtScale – the intelligent data virtualization company – to provide a single, secured and governed

workspace for distributed data. The combination of the company’s Autonomous Data Engineering™ and Universal Semantic

Layer™ powers business intelligence and machine learning resulting in faster, more accurate business decisions at scale. For

more information, visit www.atscale.com.

RESOURCES & FURTHER READING

GigaOM Market Landscape Report: “Data Virtualization: A Spectrum of Approaches”

Gartner Research: “Market Guide for Data Virtualization”

ReadWrite: “How Intelligent Data Addresses the Chasm in Cloud”

AtScale Blog: “What is Intelligent Data Virtualization and Why Do You Need It?”

https://gigaom.com/report/data-virtualization-a-spectrum-of-approachs/

https://www.gartner.com/en/documents/3893219/market-guide-for-data-virtualization

https://readwrite.com/2019/11/22/how-intelligent-data-addresses-the-chasm-in-cloud/

https://www.atscale.com/blog/what-is-intelligent-data-virtualization-and-why-do-you-need-it/

buyers guide 2020 the complete buyer’s guide for ......sql-on-anything a query federation engine...

Documents