polybase in sql server 16 david j. dewitt rimma v. nehme

62
PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme Microsoft Jim Gray Systems Lab

Upload: russell-perry

Post on 06-Jan-2018

225 views

Category:

Documents


0 download

DESCRIPTION

A Little Bit of PolyBase History… PolyBase is now part of SQL 16 (CTP3) PolyBase in SQL Server PDW V2 TODAY PolyBase for IOT??? 2012 2013 2014 2015 2016 … … … … Future Past

TRANSCRIPT

Page 1: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

PolyBase in SQL Server 16 David J. DeWittRimma V. Nehme

Microsoft Jim Gray Systems Lab

Page 2: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

A Little Bit of PolyBase History…

2012 2013 2015 ……… 2016…

Past Future

2014

PolyBase in SQL Server PDW V2

TODAYPolyBase for IOT???

PolyBase is now part of

SQL 16 (CTP3)

Page 3: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

what?why?how?

PolyBase

what

’s ne

xt

Our Plan for Today…

Page 4: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

PolyBase?

IS

what?

Page 5: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

Big Picture

Provides a scalable, T-SQL compatible query processing framework for combining

data from both worlds

RDBMS HadoopPolyBase

Page 6: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

More Specifically

Allow SQL 16 customers to execute T-SQL queries against relational data in SQL Server and “semistructured” data in HDFS and/or Azure

Hadoop(non-

relational data)

Windows Azure Blob

Storage(WASB)

SQL Server 16

SELECT Results

Page 7: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

PolyBase for SQL

Server 16Clusters of SQL

Server 16 instances can be used to

process data residing in HDFS in parallel

Queries can arbitrarily mix relational data

in SQL Server with data

in HDFS or Azure

Full T-SQL interface for HDFS-resident

data

All standard BI tools supported

What Does It Mean?

Page 8: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

what?why?how?

PolyBase

what

’s ne

xt

Page 9: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

All The Interest in Big Data

why?Increased number and variety of data sources that generate large quantities of data

Sensors (e.g. location, speed, acceleration rates, acoustical, …)Web 2.0 (e.g. twitter, wikis, … )Web clicks

Dramatic decline in the cost of hardware, especially storage

If storage was still $100/GB there would be no big data revolution underway

Realization that data is “too valuable” to delete

Page 10: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

Hadoop Community World View

rest of the world

Scalable distributed file system

Underpinnings of the entire Hadoop ecosystem

Highly fault-tolerant

Big Data goes HERE

Append only – no updates!

HDFS(Hadoop)

Hive Impala

HAWQ Spark/Shark

Parallel SQL systems for data warehousing on HDFS

Insight

Page 11: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

•Two universes of data• Structured relational data• Semistructured data in HDFS for “big data”

• Polybase Goal:Provide a scalable, T-SQL compatible query processing framework for combining data from both universes

Relational

HDFS

PolybaseInsig

ht

Polybase’s World View PolyBase’s

HDFS

Relational

Insigh

t

(structured)

(semi-structured)

World View

PolyBase

Can I join you?

You sure CAN!

Page 12: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

PolyBase Use Cases

Hadoop as an

ETL tool

1

e.g., cleansing data before loading it

HDFS as ‘cold’, but

query-able”

storage

2

Sentiment

analysis

3

e.g., joining relational tables

w/ streams of tweets

Sensor data

analysis

4

e.g., mine sensor data for

predictive analytics

Page 13: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

Example #1: Progressive Usage-Based Insurance

Price policies

Structured Customer data

(Non-Relational Sensor data

Sensor data from cars (kept in

Hadoop)

Relational data (kept in SQL Server

PDW/APS)

(based on driver behavior)

Page 14: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

Example #2: ShinSeGae (Korea’s Amazon)

Basket Analysis of online shoppers

Structured Product

data

(Non-Relational Social Media

(kept in Hadoop)

Product data (kept in SQL Server

PDW/APS)

(based on social media behavior)

All standard BI tools (e.g. Excel,

Tableau, Powerview) just work!

Page 15: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

Example 3: Wind Turbine ManufacturerTurbine MonitoringAnalyzing sensor data from wind turbines deployed world wide (kept in Hadoop) combined with relational turbine data (kept in SQL Server)

Ability to do change detection, proactive maintenance and reporting

Page 16: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

what?why?how?

PolyBase

what

’s ne

xt

Page 17: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

Hortonworks or Cloudera Distributions

Hadoop 2.0 or above Linux or WindowsOn premise or in Azure

Step 1: PolyBase for SQL 16 Deployment

Step 1: Set up a Hadoop

cluster (if you don’t have one

already)

Page 18: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

Or An Azure Storage Account

• Azure Storage Blobs (ASB) exposes an HDFS layer• PolyBase reads and writes from ASB using Hadoop

RecordReaders/Writers• No compute push-down support for ASB

AzureStorageVolume

AzureStorageVolume

AzureStorageVolume

Azure

Page 19: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

Step 2: Install One or More SQL Server Instances

19

• Select PolyBase feature.

• Specify PolyBase service accounts (requires domain account for scale-out)

• PolyBase DLLs (Engine and DMS) are installed and registered as Windows Services.

• Pre-requisite: download and install JRE

Page 20: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

Compute NodesHead Node

DMS DMS

SQL 16 SQL

16SQL16

SQL16

SQL16

Engine

Service

DMS

DMS DMS DMSDMS

Step 3: Instantiate a PolyBase Scale Out GroupStep 3: 1. One machine is used as a Head Node by starting Engine & DMS services2. Zero or more other machines become Compute Nodes via a stored procedure (shutdown and restart needed)

Page 21: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

• Polybase Computation Scale-out Group• Head node is the SQL Server instance to which queries are submitted• Compute nodes are used for scale out query processing on external data

SQL 16 SQL

16SQL16

SQL16

Engine

Service

DMS DMS

Head Node

Compute Nodes

DMS

DMS

After Step 3

Page 22: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

Head Node is actually also always a Compute Node

Head Node

SQLxx

SQL16

Engine

Service

DMS

Compute N

ode

In Case You Were Wondering…

Page 23: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

Step 4 - Choose Hadoop flavor

Supported Hadoop distributions in CTP3• Cloudera CDH 5.1-5.5 on Linux• Hortonworks 2.0 - 2.3 on Linux• Hortonworks 2.0, 2.2, and 2.3 on Windows

Server

What happens under the covers? • The right client jars to connect to Hadoop are

loaded

-- different numbers map to various Hadoop flavors-- example: value 4 stands for HDP 2.0 on Linux or ASB, value 5 for HDP 2.0 on Windows, value 6 for CHD 5.1 on Linux

Page 24: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

Hadoop Cluster

#2

CREATE EXTERNAL DATA SOURCE AV_DFS_CLUSTERWITH (TYPE= HADOOP, …)

CREATE EXTERNAL DATA SOURCE GSL_HDFS_CLUSTERWITH (TYPE= HADOOP, …)

Hadoop clusters must

both be either

Cloudera or

Hortonworks (for now)

AzureStorageVolume

AzureStorageVolume

AzureStorageVolume

Azure

Step 5: Attach a Hadoop Cluster

Page 25: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

T-SQL queries submitted here

Queries can only refer to tables

here and/or external tables

here

Compute nodes are used for scale out query processing on external tables in

HDFS/Azure

Tables on these 3 compute nodes

cannot be referenced by queries submitted

to head node

Hadoop clusters can be shared between different SQL 16 compute

groups

After Setup…

If failover clustering

is enabled queries

can be submitted to

any compute node

Page 26: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

SQL PDW

SQL 16

Scaleout Storage

Scaleout QP

LanguageSurface

Delivery Vehicle

Relational,HDFS, AzureRelational,

HDFS, AzureSQL DW

T-SQLSubsetT-SQLSubset

Appliance

Cloud(PASS)

Relational, HDFS, AzureRelational, HDFS, Azure

HDFS, Azure

FullT-SQL (RTM)

“Box” &Cloud (IAAS)HDFS, Azure

PDW, SQL DW and SQL 16: Key Differences

Page 27: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

Key Technical Challenges

in HDFS (e.g., Text, RC, ORC, Parquet, … )

SupportingArbitrary

File Formats

between compute nodes

and HDFS data nodes

Parallelizing Data

Transfers

in HDFS, using external table

concept

Imposing Structure on Unstructured

Data

of Hadoop clusters

Exploiting Computational Resources

Page 28: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

HDFS Hadoop Cluster

HDFSHDFS

DMSSQL Server DMS SQL ServerHDFS Bridge

HDFS Bridge

Compute Node

Compute Node

HDFS Bridge in PolyBase

DMS

HDFSBridge

(augmented w/)

Hides complexity of HDFS

Uses Hadoop “RecordReaders/Writers” standard HDFS file types can be

read/writtenUsed to transfer data in parallel

to & from Hadoop

Page 29: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

Key Technical Challenges

in HDFS (e.g., Text, RC, ORC, Parquet, … )

SupportingArbitrary

File Formats

between compute nodes

and HDFS data nodes

Parallelizing Data

Transfers

in HDFS, using external table

concept

Imposing Structure on Unstructured

Data

of Hadoop clusters

Exploiting Computational Resources

Page 30: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

Namenode

(HDFS)

Hadoop Cluster

DataNode

FileSyste

m

DataNode

FileSyste

m

DataNode

FileSyste

m

DataNode

FileSyste

m

DataNode

FileSyste

m

DataNode

FileSyste

m

Compute NodesEngine Service

SQL ServerDMS

DB

SQL ServerDMS

DB

SQL ServerDMS

DB

DMSHeadNode

Data moves between clusters in parallel

Page 31: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

Key Technical Challenges

in HDFS (e.g., Text, RC, ORC, Parquet, … )

SupportingArbitrary

File Formats

between compute nodes

and HDFS data nodes

Parallelizing Data

Transfers

in HDFS, using external table

concept

Imposing Structure on Unstructured

Data

of Hadoop clusters

Exploiting Computational Resources

Page 32: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

Creating External TablesCREATE EXTERNAL DATA SOURCE HadoopCluster WITH (TYPE = Hadoop, LOCATION = 'hdfs://10.193.26.177:8020',

RESOURCE_MANAGER_LOCATION = '10.193.26.178:8050');Once per Hadoop Cluster

CREATE EXTERNAL FILE FORMAT TextFile WITH ( FORMAT_TYPE = DELIMITEDTEXT,

DATA_COMPRESSION = 'org.apache.hadoop.io.compress.GzipCodec',FORMAT_OPTIONS (FIELD_TERMINATOR ='|', USE_TYPE_DEFAULT = TRUE));

Once per File Format

CREATE EXTERNAL TABLE [dbo].[Customer] ( [SensorKey] int NOT NULL, [CustomerKey] int NOT NULL, [Speed] float NOT NULL)WITH (LOCATION='//Sensor_Data//May2014/sensordata.tbl', DATA_SOURCE = HadoopCluster, FILE_FORMAT = TextFile)

HDFS File Path

Page 33: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

Creating External Tables (Secure Hadoop)CREATE DATABASE SCOPED CREDENTIAL HadoopCredential

WITH IDENTITY = 'hadoopUserName', Secret = 'hadoopPassword'; Once per Hadoop user

CREATE EXTERNAL FILE FORMAT TextFile WITH ( FORMAT_TYPE = DELIMITEDTEXT,

DATA_COMPRESSION = 'org.apache.hadoop.io.compress.GzipCodec',FORMAT_OPTIONS (FIELD_TERMINATOR ='|', USE_TYPE_DEFAULT = TRUE));

Once per File Format

CREATE EXTERNAL TABLE [dbo].[Customer] ( [SensorKey] int NOT NULL, [CustomerKey] int NOT NULL, [Speed] float NOT NULL)WITH (LOCATION='//Sensor_Data//May2014/sensordata.tbl', DATA_SOURCE = HadoopCluster, FILE_FORMAT = TextFile)

HDFS File Path

CREATE EXTERNAL DATA SOURCE HadoopCluster WITH (TYPE = Hadoop, LOCATION = 'hdfs://10.193.26.177:8020',

RESOURCE_MANAGER_LOCATION = '10.193.26.178:8050', CREDENTIAL = HadoopCredential);

Once per Hadoop user

Page 34: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

Selection on external table in HDFS select * from Customer where c_nationkey = 3 and c_acctbal < 0 A possible execution plan:

CREATE temp table

T Execute on compute nodes1

IMPORTFROM HDFS

HDFS Customer file read into T2

EXECUTEQUERY Select * from T where

T.c_nationkey =3 and T.c_acctbal < 03

Polybase Query Example #1

Page 35: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

Key Technical Challenges

in HDFS (e.g., Text, RC, ORC, Parquet, … )

SupportingArbitrary

File Formats

between compute nodes

and HDFS data nodes

Parallelizing Data

Transfers

in HDFS, using external table

concept

Imposing Structure on Unstructured

Data

of Hadoop clusters

Exploiting Computational Resources

Page 36: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

PolyBase Query Execution

Query Plan

Generator

Query Optimi

zerParserSQL

QueryLogical

operator tree

Physical operator

tree

Query is parsed“External tables” stored on

HDFS are identified

Parallel QO is performed Statistics on HDFS tables are used in the

standard fashion

Engine Service

HDFS

Hadoop

Query plan generator walks optimized query plan

converting subtrees whose inputs are all HDFS files into sequence of

MapReduce jobs

Engine Service submits MapReduce jobs

(as a JAR file) to Hadoop cluster. Leverage computational capabilities of

Hadoop cluster

Hadoop nodes

Page 37: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

HDFS

PDW orSQL 16Hadoop

Results

7

2Map job

5

HDFS

blocks DB

3 4 6

SQL Query

1

MapReduce

Cost-based decision on how much computation

to push

SQL operations on HDFS data

pushed into Hadoop as MapReduce jobs

Big Picture Takeaway

Page 38: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

Selection and aggregate on external table in HDFS select avg(c_acctbal) from Customer where c_acctbal < 0 group by c_nationkey Execution

plan:

Run MR Job on

HadoopApply filter and compute aggregate on Customer.

1

What really happens here?Step 1) QO compiles predicate into Java and generates a MapReduce (MR) job

PolyBase Query Example #2

Output left in hdfsTemphdfsTemp

<US, $-975.21>

<FRA, $-119.13><UK, $-63.52>

Step 2) QE submits MR job to Hadoop cluster

Page 39: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

IMPORThdfsTEMP Read hdfsTemp into T 3

Execution plan:

CREATE temp table

T On compute nodes2

RETURN OPERATIO

N Select * from T4

Run MR Job on

HadoopApply filter and computes aggregate on Customer. Output left in hdfsTemp

1

1. Predicate and aggregate pushed into Hadoop cluster as a MapReduce job

2. Query optimizer makes a cost-based decision on what operators to push

hdfsTemp<US, $-975.21>

<FRA, $-119.13><UK, $-63.52>

PolyBase Query Example #2Selection and aggregate on HDFS table select avg(c_acctbal) from Customer where c_acctbal < 0 group by c_nationkey

Page 40: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

SimplicityQuery data in Hadoop and/or data in SQL Server via standard T-SQL

PolyBase

Highest Possible PerformanceParallelized data transfers between SQL Compute Nodes and Hadoop clusters. Push down of SQL operations to Hadoop

OpenSupports most popular Hadoop distributions for both Linux and Windows

Full Integration with Microsoft Office & BIExcel’s PowerPivot, PowerView, Tableau, Cognos, SQL Server Reporting & Analysis Services

Page 41: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

Query Capabilities (1) Joining relational and external data

SELECT FROM 1. Querying external tables 2. Joining external with regular SQL tables3. Pushing compute for basic expressions and

aggregates

External tables referring to data in two HDP Hadoop

clusters

SQL Table

Page 42: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

Query Capabilities (2)Push-Down Computation

Pushing Computeo Either on data source level

or o Per-query basis using query

hints

Page 43: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

Query Capabilities (3)Multiple User IDs

Credential support o Credential support for multiple user IDs associated with

external data source

Page 44: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

External Tables ‘just’ show up as regular tables

Excel (e.g. PowerQuery) Tablea

u

Query Capabilities (4)Seamless BI integration

Page 45: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

Parallel Import/Load Scenario

SELECT INTO:1. Importing data from Hadoop for higher speed

access2. ‘ETL’ type of processing possible via T-SQL

External table referring to data in HDP Hadoop clusters

new SQL Table created

Page 46: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

Export Scenario – Data aging into Hadoop

INSERT INTO 1. Exporting SQL data into Hadoop2. ‘ETL’ type of processing

possible via T-SQL

External table for aging data into Hadoop (as text

files)

Page 47: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

1. Will alwaysOn (Hadron) be supported?2. Can a compute node be used for other SQL

workloads?3. Can a compute node share a machine with a

Hadoop data node?4. What SQL Server editions will I need?5. Will the MapR Hadoop distribution be supported?6. Is there a limit on the number of compute nodes?7. Why is it not possible to support different Hadoop

distros simultaneously?

Some Questions You Might Have…

Page 48: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

what?why?how?

PolyBase

what

’s ne

xt

Page 49: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

1. Each SQL 16 instance can participate in a single PolyBase cluster – A limitation of current DMS software

2. No support for varchar(max) or unique column types

3. No scale out for joins between tables on Head Node and external tables on HDFS (“Local table scaleout”)

CTP3/RTM Limitations

Page 50: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

A Very Short Demo• SQL 16 w. Polybase (4 node SQL Scaleout Group)• 3 node Hortonworks 2.0 Linux cluster• Car telemetry data in Hadoop (CarSensor_Data)• Customer data in SQL 16 (Insured_Customers)

Page 51: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

1. Simplicity - Query data in Hadoop, Azure and SQL Server via standard T-SQL

2. Highest Possible Performance - Parallelized data transfers between SQL 16 instances and Hadoop data nodes. Push down of SQL operations to Hadoop using MR jobs

3. Open - Supports most popular Hadoop distributions for both Linux and Windows

4. Full Integration with Microsoft Office & BI - Excel’s PowerPivot, PowerView, Tableau, Cognos, SQL Server Reporting & Analysis Services

Wrapup: PolyBase in SQL Server 16

Page 52: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

Acknowledgements• The entire PolyBase

development team• Sahaj Saini for the

demo and a few slides• Artin Avanes for his role

as the PolyBase PM since inception

52

Page 53: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

THANK YOU

Page 54: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

Probable CTP2 Query Plan:1) Use MR job to find Orders > $1000

Select C.Name from Orders O, Customers C where C.id = O.CustId and O.price > $1000

CTP3/RTM Query Processing Limitation

tmp

3) Perform join on Head Node

2) Import result into SQL 16 instance on Head Node

Page 55: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

CTP3 plan likely to be:1) Use MR job to find Orders > $1000

A Better Query Plan

2) Import result into SQL 16 instances on Compute nodes

4) Perform join in parallel on compute nodes

3) Redistribute Customers table from Head Node to Compute Nodes

Select C.Name from Customers C, Orders O where C.id = O.CustId and P.price > $1000

Page 56: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

The Hadoop EcosystemManagement & Monitoring

(Ambari)

Coor

dina

tion

(Zoo

Keep

er)

Wor

kflow

&Sc

hedu

ling

(Ooz

ie)

Scripting(Pig)

Machine Learning

(Mahout)Query(Hive)

Distributed Processing(MapReduce)

Distributed Storage(HDFS)

NoSQ

L Da

taba

se(H

Base

)

Data

Inte

grat

ion

(Sqo

op/R

EST/

OD

BC)

Page 57: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

SQL Results

DB

PolyBase = SQL Server PDW V2 querying HDFS/Azure data, in-situ

HDFS

PolyBasePolybase

Poly

Base

PolyBase

SQL SERVERPDW V2

Standard T-SQL query language. Eliminates need for writing MapReduce jobs

Leverages PDW’s parallel query execution framework

Data moves in parallel directly between Hadoop’s Data Nodes and PDW’s compute nodes

Exploits PDW’s parallel query optimizer to selectively push computations on HDFS data as MapReduce jobs

SQL Server PDW with PolyBase

All of this is now part of

SQL Server 16

Page 58: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

Control Node

PDW EngineService

SQL Server

DMS Service

Compute NodeDMS SQL Server

Compute NodeDMS SQL Server

Compute NodeDMS SQL Server

Compute NodeDMS SQL Server

Compute NodeDMS SQL Server

Client ConnectionsUser Queries

JDBC, OLEDB, ODBC, ADO.NET

• Parse SQL• Validate and authorize• Optimize and build query

plan• Execute parallel query• Return results to client

Data Movement Service (DMS)• Separate process on each

node• Shuffles intermediate tables

among compute nodes during query execution

PDW Software

Page 59: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

PDW SUMMARY Scalable SQL Server DW offering

Highly competitive performance and costCurrently only available in appliance form factorBut, soon available as SQL DW Service in AzureKEY COMPONENTSOne control node Engine Service +

DMS Compiles and

controls execution of SQL queries in parallel

Multiple compute nodes Each with a SQL Server

instance + DMS Execute SQL queries in

parallel

PolyBase for SQL 16 reuses

Engine Service and DMS

components

Page 60: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

SQL16

PolyBase

DLLs

SQL16

PolyBase

DLLs

SQL16

PolyBase

DLLs

SQL16

PolyBase

DLLs

Step 3: Scale OutStep 3: Install one or more SQL Server instances

SQL installer will allow DBA to

specify desire to use PolyBase feature.

If selected, installer will install

PolyBase DLLs (DMS and Engine

Service) on each machine and

register them as Windows Services

Page 61: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

Compute NodesEngine Service PDW

ApplianceSQL ServerDMS

DB

SQL ServerDMS

DB

SQL ServerDMS

DB

DMSControlNode

Namenode

(HDFS)DataNode

FileSyste

m

DataNode

FileSyste

m

DataNode

FileSyste

m

DataNode

FileSyste

m

DataNode

FileSyste

m

DataNode

FileSyste

m

Linux ClusterWindows Cluster

Hadoop cluster can be either on premise or in the cloud

Hadoop cluster can be either:

Many popular HDFS file formats supported (text, RC, ORC, … )

Hadoop Cluster

Hortonworks or Cloudera Hadoop distributions

Typical PolyBase Setup

TextFormat

RCFileFormat

ORCFile Format

ParquetFormat(future)…

Page 62: PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

Key Components Reused for SQL Server 16

Engine Service to parse, optimize, & orchestrate parallel execution of queries over relational tables and HDFS data

DMS used to move data between compute nodes and Hadoop data nodes

HDFS Bridge in DMS used to read/write files/directories in HDFS or Azure External Table

Constructused to “surface” records in external tables in HDFS or Azure files to SQL

MapReduce Job Pushdown used by Engine Service to push computation to Hadoop cluster