polybase in sql server 16 david j. dewitt rimma v. nehme

PolyBase in SQL Server 16 David J. DeWittRimma V. Nehme

Microsoft Jim Gray Systems Lab

A Little Bit of PolyBase History…

2012 2013 2015 ……… 2016…

Past Future

2014

PolyBase in SQL Server PDW V2

TODAYPolyBase for IOT???

PolyBase is now part of

SQL 16 (CTP3)

what?why?how?

PolyBase

what

’s ne

xt

Our Plan for Today…

PolyBase?

IS

what?

Big Picture

Provides a scalable, T-SQL compatible query processing framework for combining

data from both worlds

RDBMS HadoopPolyBase

More Specifically

Allow SQL 16 customers to execute T-SQL queries against relational data in SQL Server and “semistructured” data in HDFS and/or Azure

Hadoop(non-

relational data)

Windows Azure Blob

Storage(WASB)

SQL Server 16

SELECT Results

PolyBase for SQL

Server 16Clusters of SQL

Server 16 instances can be used to

process data residing in HDFS in parallel

Queries can arbitrarily mix relational data

in SQL Server with data

in HDFS or Azure

Full T-SQL interface for HDFS-resident

data

All standard BI tools supported

What Does It Mean?

what?why?how?

PolyBase

what

’s ne

xt

All The Interest in Big Data

why?Increased number and variety of data sources that generate large quantities of data

Sensors (e.g. location, speed, acceleration rates, acoustical, …)Web 2.0 (e.g. twitter, wikis, … )Web clicks

Dramatic decline in the cost of hardware, especially storage

If storage was still $100/GB there would be no big data revolution underway

Realization that data is “too valuable” to delete

Hadoop Community World View

rest of the world

Scalable distributed file system

Underpinnings of the entire Hadoop ecosystem

Highly fault-tolerant

Big Data goes HERE

Append only – no updates!

HDFS(Hadoop)

Hive Impala

HAWQ Spark/Shark

Parallel SQL systems for data warehousing on HDFS

Insight

•Two universes of data• Structured relational data• Semistructured data in HDFS for “big data”

• Polybase Goal:Provide a scalable, T-SQL compatible query processing framework for combining data from both universes

Relational

HDFS

PolybaseInsig

ht

Polybase’s World View PolyBase’s

HDFS

Relational

Insigh

t

(structured)

(semi-structured)

World View

PolyBase

Can I join you?

You sure CAN!

PolyBase Use Cases

Hadoop as an

ETL tool

1

e.g., cleansing data before loading it

HDFS as ‘cold’, but

query-able”

storage

2

Sentiment

analysis

3

e.g., joining relational tables

w/ streams of tweets

Sensor data

analysis

4

e.g., mine sensor data for

predictive analytics

Example #1: Progressive Usage-Based Insurance

Price policies

Structured Customer data

(Non-Relational Sensor data

Sensor data from cars (kept in

Hadoop)

Relational data (kept in SQL Server

PDW/APS)

(based on driver behavior)

Example #2: ShinSeGae (Korea’s Amazon)

Basket Analysis of online shoppers

Structured Product

data

(Non-Relational Social Media

(kept in Hadoop)

Product data (kept in SQL Server

PDW/APS)

(based on social media behavior)

All standard BI tools (e.g. Excel,

Tableau, Powerview) just work!

Example 3: Wind Turbine ManufacturerTurbine MonitoringAnalyzing sensor data from wind turbines deployed world wide (kept in Hadoop) combined with relational turbine data (kept in SQL Server)

Ability to do change detection, proactive maintenance and reporting

what?why?how?

PolyBase

what

’s ne

xt

Hortonworks or Cloudera Distributions

Hadoop 2.0 or above Linux or WindowsOn premise or in Azure

Step 1: PolyBase for SQL 16 Deployment

Step 1: Set up a Hadoop

cluster (if you don’t have one

already)

Or An Azure Storage Account

• Azure Storage Blobs (ASB) exposes an HDFS layer• PolyBase reads and writes from ASB using Hadoop

RecordReaders/Writers• No compute push-down support for ASB

AzureStorageVolume

AzureStorageVolume

AzureStorageVolume

Azure

Step 2: Install One or More SQL Server Instances

19

• Select PolyBase feature.

• Specify PolyBase service accounts (requires domain account for scale-out)

• PolyBase DLLs (Engine and DMS) are installed and registered as Windows Services.

• Pre-requisite: download and install JRE

Compute NodesHead Node

DMS DMS

SQL 16 SQL

16SQL16

SQL16

SQL16

Engine

Service

DMS

DMS DMS DMSDMS

Step 3: Instantiate a PolyBase Scale Out GroupStep 3: 1. One machine is used as a Head Node by starting Engine & DMS services2. Zero or more other machines become Compute Nodes via a stored procedure (shutdown and restart needed)

• Polybase Computation Scale-out Group• Head node is the SQL Server instance to which queries are submitted• Compute nodes are used for scale out query processing on external data

SQL 16 SQL

16SQL16

SQL16

Engine

Service

DMS DMS

Head Node

Compute Nodes

DMS

DMS

After Step 3

Head Node is actually also always a Compute Node

Head Node

SQLxx

SQL16

Engine

Service

DMS

Compute N

ode

In Case You Were Wondering…

Step 4 - Choose Hadoop flavor

Supported Hadoop distributions in CTP3• Cloudera CDH 5.1-5.5 on Linux• Hortonworks 2.0 - 2.3 on Linux• Hortonworks 2.0, 2.2, and 2.3 on Windows

Server

What happens under the covers? • The right client jars to connect to Hadoop are

loaded

-- different numbers map to various Hadoop flavors-- example: value 4 stands for HDP 2.0 on Linux or ASB, value 5 for HDP 2.0 on Windows, value 6 for CHD 5.1 on Linux

Hadoop Cluster

#2

CREATE EXTERNAL DATA SOURCE AV_DFS_CLUSTERWITH (TYPE= HADOOP, …)

CREATE EXTERNAL DATA SOURCE GSL_HDFS_CLUSTERWITH (TYPE= HADOOP, …)

Hadoop clusters must

both be either

Cloudera or

Hortonworks (for now)

AzureStorageVolume

AzureStorageVolume

AzureStorageVolume

Azure

Step 5: Attach a Hadoop Cluster

T-SQL queries submitted here

Queries can only refer to tables

here and/or external tables

here

Compute nodes are used for scale out query processing on external tables in

HDFS/Azure

Tables on these 3 compute nodes

cannot be referenced by queries submitted

to head node

Hadoop clusters can be shared between different SQL 16 compute

groups

After Setup…

If failover clustering

is enabled queries

can be submitted to

any compute node

SQL PDW

SQL 16

Scaleout Storage

Scaleout QP

LanguageSurface

Delivery Vehicle

Relational,HDFS, AzureRelational,

HDFS, AzureSQL DW

T-SQLSubsetT-SQLSubset

Appliance

Cloud(PASS)

Relational, HDFS, AzureRelational, HDFS, Azure

HDFS, Azure

FullT-SQL (RTM)

“Box” &Cloud (IAAS)HDFS, Azure

PDW, SQL DW and SQL 16: Key Differences

Key Technical Challenges

in HDFS (e.g., Text, RC, ORC, Parquet, … )

SupportingArbitrary

File Formats

between compute nodes

and HDFS data nodes

Parallelizing Data

Transfers

in HDFS, using external table

concept

Imposing Structure on Unstructured

Data

of Hadoop clusters

Exploiting Computational Resources

HDFS Hadoop Cluster

HDFSHDFS

DMSSQL Server DMS SQL ServerHDFS Bridge

HDFS Bridge

Compute Node

Compute Node

HDFS Bridge in PolyBase

DMS

HDFSBridge

(augmented w/)

Hides complexity of HDFS

Uses Hadoop “RecordReaders/Writers” standard HDFS file types can be

read/writtenUsed to transfer data in parallel

to & from Hadoop



SupportingArbitrary

File Formats


and HDFS data nodes

Parallelizing Data

Transfers


concept


Data

of Hadoop clusters


Namenode

(HDFS)

Hadoop Cluster

DataNode

FileSyste

m

DataNode

FileSyste

m

DataNode

FileSyste

m

DataNode

FileSyste

m

DataNode

FileSyste

m

DataNode

FileSyste

m

Compute NodesEngine Service

SQL ServerDMS

DB

SQL ServerDMS

DB

SQL ServerDMS

DB

DMSHeadNode

Data moves between clusters in parallel



SupportingArbitrary

File Formats


and HDFS data nodes

Parallelizing Data

Transfers


concept


Data

of Hadoop clusters


Creating External TablesCREATE EXTERNAL DATA SOURCE HadoopCluster WITH (TYPE = Hadoop, LOCATION = 'hdfs://10.193.26.177:8020',

RESOURCE_MANAGER_LOCATION = '10.193.26.178:8050');Once per Hadoop Cluster

CREATE EXTERNAL FILE FORMAT TextFile WITH ( FORMAT_TYPE = DELIMITEDTEXT,

DATA_COMPRESSION = 'org.apache.hadoop.io.compress.GzipCodec',FORMAT_OPTIONS (FIELD_TERMINATOR ='|', USE_TYPE_DEFAULT = TRUE));

Once per File Format

CREATE EXTERNAL TABLE [dbo].[Customer] ( [SensorKey] int NOT NULL, [CustomerKey] int NOT NULL, [Speed] float NOT NULL)WITH (LOCATION='//Sensor_Data//May2014/sensordata.tbl', DATA_SOURCE = HadoopCluster, FILE_FORMAT = TextFile)

HDFS File Path

Creating External Tables (Secure Hadoop)CREATE DATABASE SCOPED CREDENTIAL HadoopCredential

WITH IDENTITY = 'hadoopUserName', Secret = 'hadoopPassword'; Once per Hadoop user

CREATE EXTERNAL FILE FORMAT TextFile WITH ( FORMAT_TYPE = DELIMITEDTEXT,

DATA_COMPRESSION = 'org.apache.hadoop.io.compress.GzipCodec',FORMAT_OPTIONS (FIELD_TERMINATOR ='|', USE_TYPE_DEFAULT = TRUE));

Once per File Format

CREATE EXTERNAL TABLE [dbo].[Customer] ( [SensorKey] int NOT NULL, [CustomerKey] int NOT NULL, [Speed] float NOT NULL)WITH (LOCATION='//Sensor_Data//May2014/sensordata.tbl', DATA_SOURCE = HadoopCluster, FILE_FORMAT = TextFile)

HDFS File Path

CREATE EXTERNAL DATA SOURCE HadoopCluster WITH (TYPE = Hadoop, LOCATION = 'hdfs://10.193.26.177:8020',

RESOURCE_MANAGER_LOCATION = '10.193.26.178:8050', CREDENTIAL = HadoopCredential);

Once per Hadoop user

Selection on external table in HDFS select * from Customer where c_nationkey = 3 and c_acctbal < 0 A possible execution plan:

CREATE temp table

T Execute on compute nodes1

IMPORTFROM HDFS

HDFS Customer file read into T2

EXECUTEQUERY Select * from T where

T.c_nationkey =3 and T.c_acctbal < 03

Polybase Query Example #1



SupportingArbitrary

File Formats


and HDFS data nodes

Parallelizing Data

Transfers


concept


Data

of Hadoop clusters


PolyBase Query Execution

Query Plan

Generator

Query Optimi

zerParserSQL

QueryLogical

operator tree

Physical operator

tree

Query is parsed“External tables” stored on

HDFS are identified

Parallel QO is performed Statistics on HDFS tables are used in the

standard fashion

Engine Service

HDFS

Hadoop

Query plan generator walks optimized query plan

converting subtrees whose inputs are all HDFS files into sequence of

MapReduce jobs

Engine Service submits MapReduce jobs

(as a JAR file) to Hadoop cluster. Leverage computational capabilities of

Hadoop cluster

Hadoop nodes

HDFS

PDW orSQL 16Hadoop

Results

7

2Map job

5

HDFS

blocks DB

3 4 6

SQL Query

1

MapReduce

Cost-based decision on how much computation

to push

SQL operations on HDFS data

pushed into Hadoop as MapReduce jobs

Big Picture Takeaway

Selection and aggregate on external table in HDFS select avg(c_acctbal) from Customer where c_acctbal < 0 group by c_nationkey Execution

plan:

Run MR Job on

HadoopApply filter and compute aggregate on Customer.

1

What really happens here?Step 1) QO compiles predicate into Java and generates a MapReduce (MR) job

PolyBase Query Example #2

Output left in hdfsTemphdfsTemp

<US, $-975.21>

<FRA, $-119.13><UK, $-63.52>

Step 2) QE submits MR job to Hadoop cluster

IMPORThdfsTEMP Read hdfsTemp into T 3

Execution plan:

CREATE temp table

T On compute nodes2

RETURN OPERATIO

N Select * from T4

Run MR Job on

HadoopApply filter and computes aggregate on Customer. Output left in hdfsTemp

1

1. Predicate and aggregate pushed into Hadoop cluster as a MapReduce job

2. Query optimizer makes a cost-based decision on what operators to push

hdfsTemp<US, $-975.21>

<FRA, $-119.13><UK, $-63.52>

PolyBase Query Example #2Selection and aggregate on HDFS table select avg(c_acctbal) from Customer where c_acctbal < 0 group by c_nationkey

SimplicityQuery data in Hadoop and/or data in SQL Server via standard T-SQL

PolyBase

Highest Possible PerformanceParallelized data transfers between SQL Compute Nodes and Hadoop clusters. Push down of SQL operations to Hadoop

OpenSupports most popular Hadoop distributions for both Linux and Windows

Full Integration with Microsoft Office & BIExcel’s PowerPivot, PowerView, Tableau, Cognos, SQL Server Reporting & Analysis Services

Query Capabilities (1) Joining relational and external data

SELECT FROM 1. Querying external tables 2. Joining external with regular SQL tables3. Pushing compute for basic expressions and

aggregates

External tables referring to data in two HDP Hadoop

clusters

SQL Table

Query Capabilities (2)Push-Down Computation

Pushing Computeo Either on data source level

or o Per-query basis using query

hints

Query Capabilities (3)Multiple User IDs

Credential support o Credential support for multiple user IDs associated with

external data source

External Tables ‘just’ show up as regular tables

Excel (e.g. PowerQuery) Tablea

u

Query Capabilities (4)Seamless BI integration

Parallel Import/Load Scenario

SELECT INTO:1. Importing data from Hadoop for higher speed

access2. ‘ETL’ type of processing possible via T-SQL

External table referring to data in HDP Hadoop clusters

new SQL Table created

Export Scenario – Data aging into Hadoop

INSERT INTO 1. Exporting SQL data into Hadoop2. ‘ETL’ type of processing

possible via T-SQL

External table for aging data into Hadoop (as text

files)

1. Will alwaysOn (Hadron) be supported?2. Can a compute node be used for other SQL

workloads?3. Can a compute node share a machine with a

Hadoop data node?4. What SQL Server editions will I need?5. Will the MapR Hadoop distribution be supported?6. Is there a limit on the number of compute nodes?7. Why is it not possible to support different Hadoop

distros simultaneously?

Some Questions You Might Have…

what?why?how?

PolyBase

what

’s ne

xt

1. Each SQL 16 instance can participate in a single PolyBase cluster – A limitation of current DMS software

2. No support for varchar(max) or unique column types

3. No scale out for joins between tables on Head Node and external tables on HDFS (“Local table scaleout”)

CTP3/RTM Limitations

A Very Short Demo• SQL 16 w. Polybase (4 node SQL Scaleout Group)• 3 node Hortonworks 2.0 Linux cluster• Car telemetry data in Hadoop (CarSensor_Data)• Customer data in SQL 16 (Insured_Customers)

1. Simplicity - Query data in Hadoop, Azure and SQL Server via standard T-SQL

2. Highest Possible Performance - Parallelized data transfers between SQL 16 instances and Hadoop data nodes. Push down of SQL operations to Hadoop using MR jobs

3. Open - Supports most popular Hadoop distributions for both Linux and Windows

4. Full Integration with Microsoft Office & BI - Excel’s PowerPivot, PowerView, Tableau, Cognos, SQL Server Reporting & Analysis Services

Wrapup: PolyBase in SQL Server 16

Acknowledgements• The entire PolyBase

development team• Sahaj Saini for the

demo and a few slides• Artin Avanes for his role

as the PolyBase PM since inception

52

THANK YOU

Probable CTP2 Query Plan:1) Use MR job to find Orders > $1000

Select C.Name from Orders O, Customers C where C.id = O.CustId and O.price > $1000

CTP3/RTM Query Processing Limitation

tmp

3) Perform join on Head Node

2) Import result into SQL 16 instance on Head Node

CTP3 plan likely to be:1) Use MR job to find Orders > $1000

A Better Query Plan

2) Import result into SQL 16 instances on Compute nodes

4) Perform join in parallel on compute nodes

3) Redistribute Customers table from Head Node to Compute Nodes

Select C.Name from Customers C, Orders O where C.id = O.CustId and P.price > $1000

The Hadoop EcosystemManagement & Monitoring

(Ambari)

Coor

dina

tion

(Zoo

Keep

er)

Wor

kflow

&Sc

hedu

ling

(Ooz

ie)

Scripting(Pig)

Machine Learning

(Mahout)Query(Hive)

Distributed Processing(MapReduce)

Distributed Storage(HDFS)

NoSQ

L Da

taba

se(H

Base

)

Data

Inte

grat

ion

(Sqo

op/R

EST/

OD

BC)

SQL Results

DB

PolyBase = SQL Server PDW V2 querying HDFS/Azure data, in-situ

HDFS

PolyBasePolybase

Poly

Base

PolyBase

SQL SERVERPDW V2

Standard T-SQL query language. Eliminates need for writing MapReduce jobs

Leverages PDW’s parallel query execution framework

Data moves in parallel directly between Hadoop’s Data Nodes and PDW’s compute nodes

Exploits PDW’s parallel query optimizer to selectively push computations on HDFS data as MapReduce jobs

SQL Server PDW with PolyBase

All of this is now part of

SQL Server 16

Control Node

PDW EngineService

SQL Server

DMS Service

Compute NodeDMS SQL Server





Client ConnectionsUser Queries

JDBC, OLEDB, ODBC, ADO.NET

• Parse SQL• Validate and authorize• Optimize and build query

plan• Execute parallel query• Return results to client

Data Movement Service (DMS)• Separate process on each

node• Shuffles intermediate tables

among compute nodes during query execution

PDW Software

PDW SUMMARY Scalable SQL Server DW offering

Highly competitive performance and costCurrently only available in appliance form factorBut, soon available as SQL DW Service in AzureKEY COMPONENTSOne control node Engine Service +

DMS Compiles and

controls execution of SQL queries in parallel

Multiple compute nodes Each with a SQL Server

instance + DMS Execute SQL queries in

parallel

PolyBase for SQL 16 reuses

Engine Service and DMS

components

SQL16

PolyBase

DLLs

SQL16

PolyBase

DLLs

SQL16

PolyBase

DLLs

SQL16

PolyBase

DLLs

Step 3: Scale OutStep 3: Install one or more SQL Server instances

SQL installer will allow DBA to

specify desire to use PolyBase feature.

If selected, installer will install

PolyBase DLLs (DMS and Engine

Service) on each machine and

register them as Windows Services

Compute NodesEngine Service PDW

ApplianceSQL ServerDMS

DB

SQL ServerDMS

DB

SQL ServerDMS

DB

DMSControlNode

Namenode

(HDFS)DataNode

FileSyste

m

DataNode

FileSyste

m

DataNode

FileSyste

m

DataNode

FileSyste

m

DataNode

FileSyste

m

DataNode

FileSyste

m

Linux ClusterWindows Cluster

Hadoop cluster can be either on premise or in the cloud

Hadoop cluster can be either:

Many popular HDFS file formats supported (text, RC, ORC, … )

Hadoop Cluster

Hortonworks or Cloudera Hadoop distributions

Typical PolyBase Setup

TextFormat

RCFileFormat

ORCFile Format

ParquetFormat(future)…

Key Components Reused for SQL Server 16

Engine Service to parse, optimize, & orchestrate parallel execution of queries over relational tables and HDFS data

DMS used to move data between compute nodes and Hadoop data nodes

HDFS Bridge in DMS used to read/write files/directories in HDFS or Azure External Table

Constructused to “surface” records in external tables in HDFS or Azure files to SQL

MapReduce Job Pushdown used by Engine Service to push computation to Hadoop cluster

polybase in sql server 16 david j. dewitt rimma v. nehme

Documents