polybase in sql server 16 david j. dewitt rimma v. nehme
DESCRIPTION
A Little Bit of PolyBase History… PolyBase is now part of SQL 16 (CTP3) PolyBase in SQL Server PDW V2 TODAY PolyBase for IOT??? 2012 2013 2014 2015 2016 … … … … Future PastTRANSCRIPT
PolyBase in SQL Server 16 David J. DeWittRimma V. Nehme
Microsoft Jim Gray Systems Lab
A Little Bit of PolyBase History…
2012 2013 2015 ……… 2016…
Past Future
2014
PolyBase in SQL Server PDW V2
TODAYPolyBase for IOT???
PolyBase is now part of
SQL 16 (CTP3)
what?why?how?
PolyBase
what
’s ne
xt
Our Plan for Today…
PolyBase?
IS
what?
Big Picture
Provides a scalable, T-SQL compatible query processing framework for combining
data from both worlds
RDBMS HadoopPolyBase
More Specifically
Allow SQL 16 customers to execute T-SQL queries against relational data in SQL Server and “semistructured” data in HDFS and/or Azure
Hadoop(non-
relational data)
Windows Azure Blob
Storage(WASB)
SQL Server 16
SELECT Results
PolyBase for SQL
Server 16Clusters of SQL
Server 16 instances can be used to
process data residing in HDFS in parallel
Queries can arbitrarily mix relational data
in SQL Server with data
in HDFS or Azure
Full T-SQL interface for HDFS-resident
data
All standard BI tools supported
What Does It Mean?
what?why?how?
PolyBase
what
’s ne
xt
All The Interest in Big Data
why?Increased number and variety of data sources that generate large quantities of data
Sensors (e.g. location, speed, acceleration rates, acoustical, …)Web 2.0 (e.g. twitter, wikis, … )Web clicks
Dramatic decline in the cost of hardware, especially storage
If storage was still $100/GB there would be no big data revolution underway
Realization that data is “too valuable” to delete
Hadoop Community World View
rest of the world
Scalable distributed file system
Underpinnings of the entire Hadoop ecosystem
Highly fault-tolerant
Big Data goes HERE
Append only – no updates!
HDFS(Hadoop)
Hive Impala
HAWQ Spark/Shark
Parallel SQL systems for data warehousing on HDFS
Insight
•Two universes of data• Structured relational data• Semistructured data in HDFS for “big data”
• Polybase Goal:Provide a scalable, T-SQL compatible query processing framework for combining data from both universes
Relational
HDFS
PolybaseInsig
ht
Polybase’s World View PolyBase’s
HDFS
Relational
Insigh
t
(structured)
(semi-structured)
World View
PolyBase
Can I join you?
You sure CAN!
PolyBase Use Cases
Hadoop as an
ETL tool
1
e.g., cleansing data before loading it
HDFS as ‘cold’, but
query-able”
storage
2
Sentiment
analysis
3
e.g., joining relational tables
w/ streams of tweets
Sensor data
analysis
4
e.g., mine sensor data for
predictive analytics
Example #1: Progressive Usage-Based Insurance
Price policies
Structured Customer data
(Non-Relational Sensor data
Sensor data from cars (kept in
Hadoop)
Relational data (kept in SQL Server
PDW/APS)
(based on driver behavior)
Example #2: ShinSeGae (Korea’s Amazon)
Basket Analysis of online shoppers
Structured Product
data
(Non-Relational Social Media
(kept in Hadoop)
Product data (kept in SQL Server
PDW/APS)
(based on social media behavior)
All standard BI tools (e.g. Excel,
Tableau, Powerview) just work!
Example 3: Wind Turbine ManufacturerTurbine MonitoringAnalyzing sensor data from wind turbines deployed world wide (kept in Hadoop) combined with relational turbine data (kept in SQL Server)
Ability to do change detection, proactive maintenance and reporting
what?why?how?
PolyBase
what
’s ne
xt
Hortonworks or Cloudera Distributions
Hadoop 2.0 or above Linux or WindowsOn premise or in Azure
Step 1: PolyBase for SQL 16 Deployment
Step 1: Set up a Hadoop
cluster (if you don’t have one
already)
Or An Azure Storage Account
• Azure Storage Blobs (ASB) exposes an HDFS layer• PolyBase reads and writes from ASB using Hadoop
RecordReaders/Writers• No compute push-down support for ASB
AzureStorageVolume
AzureStorageVolume
AzureStorageVolume
Azure
Step 2: Install One or More SQL Server Instances
19
• Select PolyBase feature.
• Specify PolyBase service accounts (requires domain account for scale-out)
• PolyBase DLLs (Engine and DMS) are installed and registered as Windows Services.
• Pre-requisite: download and install JRE
Compute NodesHead Node
DMS DMS
SQL 16 SQL
16SQL16
SQL16
SQL16
Engine
Service
DMS
DMS DMS DMSDMS
Step 3: Instantiate a PolyBase Scale Out GroupStep 3: 1. One machine is used as a Head Node by starting Engine & DMS services2. Zero or more other machines become Compute Nodes via a stored procedure (shutdown and restart needed)
• Polybase Computation Scale-out Group• Head node is the SQL Server instance to which queries are submitted• Compute nodes are used for scale out query processing on external data
SQL 16 SQL
16SQL16
SQL16
Engine
Service
DMS DMS
Head Node
Compute Nodes
DMS
DMS
After Step 3
Head Node is actually also always a Compute Node
Head Node
SQLxx
SQL16
Engine
Service
DMS
Compute N
ode
In Case You Were Wondering…
Step 4 - Choose Hadoop flavor
Supported Hadoop distributions in CTP3• Cloudera CDH 5.1-5.5 on Linux• Hortonworks 2.0 - 2.3 on Linux• Hortonworks 2.0, 2.2, and 2.3 on Windows
Server
What happens under the covers? • The right client jars to connect to Hadoop are
loaded
-- different numbers map to various Hadoop flavors-- example: value 4 stands for HDP 2.0 on Linux or ASB, value 5 for HDP 2.0 on Windows, value 6 for CHD 5.1 on Linux
Hadoop Cluster
#2
CREATE EXTERNAL DATA SOURCE AV_DFS_CLUSTERWITH (TYPE= HADOOP, …)
CREATE EXTERNAL DATA SOURCE GSL_HDFS_CLUSTERWITH (TYPE= HADOOP, …)
Hadoop clusters must
both be either
Cloudera or
Hortonworks (for now)
AzureStorageVolume
AzureStorageVolume
AzureStorageVolume
Azure
Step 5: Attach a Hadoop Cluster
T-SQL queries submitted here
Queries can only refer to tables
here and/or external tables
here
Compute nodes are used for scale out query processing on external tables in
HDFS/Azure
Tables on these 3 compute nodes
cannot be referenced by queries submitted
to head node
Hadoop clusters can be shared between different SQL 16 compute
groups
After Setup…
If failover clustering
is enabled queries
can be submitted to
any compute node
SQL PDW
SQL 16
Scaleout Storage
Scaleout QP
LanguageSurface
Delivery Vehicle
Relational,HDFS, AzureRelational,
HDFS, AzureSQL DW
T-SQLSubsetT-SQLSubset
Appliance
Cloud(PASS)
Relational, HDFS, AzureRelational, HDFS, Azure
HDFS, Azure
FullT-SQL (RTM)
“Box” &Cloud (IAAS)HDFS, Azure
PDW, SQL DW and SQL 16: Key Differences
Key Technical Challenges
in HDFS (e.g., Text, RC, ORC, Parquet, … )
SupportingArbitrary
File Formats
between compute nodes
and HDFS data nodes
Parallelizing Data
Transfers
in HDFS, using external table
concept
Imposing Structure on Unstructured
Data
of Hadoop clusters
Exploiting Computational Resources
HDFS Hadoop Cluster
HDFSHDFS
DMSSQL Server DMS SQL ServerHDFS Bridge
HDFS Bridge
Compute Node
Compute Node
HDFS Bridge in PolyBase
DMS
HDFSBridge
(augmented w/)
Hides complexity of HDFS
Uses Hadoop “RecordReaders/Writers” standard HDFS file types can be
read/writtenUsed to transfer data in parallel
to & from Hadoop
Key Technical Challenges
in HDFS (e.g., Text, RC, ORC, Parquet, … )
SupportingArbitrary
File Formats
between compute nodes
and HDFS data nodes
Parallelizing Data
Transfers
in HDFS, using external table
concept
Imposing Structure on Unstructured
Data
of Hadoop clusters
Exploiting Computational Resources
Namenode
(HDFS)
Hadoop Cluster
DataNode
FileSyste
m
DataNode
FileSyste
m
DataNode
FileSyste
m
DataNode
FileSyste
m
DataNode
FileSyste
m
DataNode
FileSyste
m
Compute NodesEngine Service
SQL ServerDMS
DB
SQL ServerDMS
DB
SQL ServerDMS
DB
DMSHeadNode
Data moves between clusters in parallel
Key Technical Challenges
in HDFS (e.g., Text, RC, ORC, Parquet, … )
SupportingArbitrary
File Formats
between compute nodes
and HDFS data nodes
Parallelizing Data
Transfers
in HDFS, using external table
concept
Imposing Structure on Unstructured
Data
of Hadoop clusters
Exploiting Computational Resources
Creating External TablesCREATE EXTERNAL DATA SOURCE HadoopCluster WITH (TYPE = Hadoop, LOCATION = 'hdfs://10.193.26.177:8020',
RESOURCE_MANAGER_LOCATION = '10.193.26.178:8050');Once per Hadoop Cluster
CREATE EXTERNAL FILE FORMAT TextFile WITH ( FORMAT_TYPE = DELIMITEDTEXT,
DATA_COMPRESSION = 'org.apache.hadoop.io.compress.GzipCodec',FORMAT_OPTIONS (FIELD_TERMINATOR ='|', USE_TYPE_DEFAULT = TRUE));
Once per File Format
CREATE EXTERNAL TABLE [dbo].[Customer] ( [SensorKey] int NOT NULL, [CustomerKey] int NOT NULL, [Speed] float NOT NULL)WITH (LOCATION='//Sensor_Data//May2014/sensordata.tbl', DATA_SOURCE = HadoopCluster, FILE_FORMAT = TextFile)
HDFS File Path
Creating External Tables (Secure Hadoop)CREATE DATABASE SCOPED CREDENTIAL HadoopCredential
WITH IDENTITY = 'hadoopUserName', Secret = 'hadoopPassword'; Once per Hadoop user
CREATE EXTERNAL FILE FORMAT TextFile WITH ( FORMAT_TYPE = DELIMITEDTEXT,
DATA_COMPRESSION = 'org.apache.hadoop.io.compress.GzipCodec',FORMAT_OPTIONS (FIELD_TERMINATOR ='|', USE_TYPE_DEFAULT = TRUE));
Once per File Format
CREATE EXTERNAL TABLE [dbo].[Customer] ( [SensorKey] int NOT NULL, [CustomerKey] int NOT NULL, [Speed] float NOT NULL)WITH (LOCATION='//Sensor_Data//May2014/sensordata.tbl', DATA_SOURCE = HadoopCluster, FILE_FORMAT = TextFile)
HDFS File Path
CREATE EXTERNAL DATA SOURCE HadoopCluster WITH (TYPE = Hadoop, LOCATION = 'hdfs://10.193.26.177:8020',
RESOURCE_MANAGER_LOCATION = '10.193.26.178:8050', CREDENTIAL = HadoopCredential);
Once per Hadoop user
Selection on external table in HDFS select * from Customer where c_nationkey = 3 and c_acctbal < 0 A possible execution plan:
CREATE temp table
T Execute on compute nodes1
IMPORTFROM HDFS
HDFS Customer file read into T2
EXECUTEQUERY Select * from T where
T.c_nationkey =3 and T.c_acctbal < 03
Polybase Query Example #1
Key Technical Challenges
in HDFS (e.g., Text, RC, ORC, Parquet, … )
SupportingArbitrary
File Formats
between compute nodes
and HDFS data nodes
Parallelizing Data
Transfers
in HDFS, using external table
concept
Imposing Structure on Unstructured
Data
of Hadoop clusters
Exploiting Computational Resources
PolyBase Query Execution
Query Plan
Generator
Query Optimi
zerParserSQL
QueryLogical
operator tree
Physical operator
tree
Query is parsed“External tables” stored on
HDFS are identified
Parallel QO is performed Statistics on HDFS tables are used in the
standard fashion
Engine Service
HDFS
Hadoop
Query plan generator walks optimized query plan
converting subtrees whose inputs are all HDFS files into sequence of
MapReduce jobs
Engine Service submits MapReduce jobs
(as a JAR file) to Hadoop cluster. Leverage computational capabilities of
Hadoop cluster
Hadoop nodes
HDFS
PDW orSQL 16Hadoop
Results
7
2Map job
5
HDFS
blocks DB
3 4 6
SQL Query
1
MapReduce
Cost-based decision on how much computation
to push
SQL operations on HDFS data
pushed into Hadoop as MapReduce jobs
Big Picture Takeaway
Selection and aggregate on external table in HDFS select avg(c_acctbal) from Customer where c_acctbal < 0 group by c_nationkey Execution
plan:
Run MR Job on
HadoopApply filter and compute aggregate on Customer.
1
What really happens here?Step 1) QO compiles predicate into Java and generates a MapReduce (MR) job
PolyBase Query Example #2
Output left in hdfsTemphdfsTemp
<US, $-975.21>
<FRA, $-119.13><UK, $-63.52>
Step 2) QE submits MR job to Hadoop cluster
IMPORThdfsTEMP Read hdfsTemp into T 3
Execution plan:
CREATE temp table
T On compute nodes2
RETURN OPERATIO
N Select * from T4
Run MR Job on
HadoopApply filter and computes aggregate on Customer. Output left in hdfsTemp
1
1. Predicate and aggregate pushed into Hadoop cluster as a MapReduce job
2. Query optimizer makes a cost-based decision on what operators to push
hdfsTemp<US, $-975.21>
<FRA, $-119.13><UK, $-63.52>
PolyBase Query Example #2Selection and aggregate on HDFS table select avg(c_acctbal) from Customer where c_acctbal < 0 group by c_nationkey
SimplicityQuery data in Hadoop and/or data in SQL Server via standard T-SQL
PolyBase
Highest Possible PerformanceParallelized data transfers between SQL Compute Nodes and Hadoop clusters. Push down of SQL operations to Hadoop
OpenSupports most popular Hadoop distributions for both Linux and Windows
Full Integration with Microsoft Office & BIExcel’s PowerPivot, PowerView, Tableau, Cognos, SQL Server Reporting & Analysis Services
Query Capabilities (1) Joining relational and external data
SELECT FROM 1. Querying external tables 2. Joining external with regular SQL tables3. Pushing compute for basic expressions and
aggregates
External tables referring to data in two HDP Hadoop
clusters
SQL Table
Query Capabilities (2)Push-Down Computation
Pushing Computeo Either on data source level
or o Per-query basis using query
hints
Query Capabilities (3)Multiple User IDs
Credential support o Credential support for multiple user IDs associated with
external data source
External Tables ‘just’ show up as regular tables
Excel (e.g. PowerQuery) Tablea
u
Query Capabilities (4)Seamless BI integration
Parallel Import/Load Scenario
SELECT INTO:1. Importing data from Hadoop for higher speed
access2. ‘ETL’ type of processing possible via T-SQL
External table referring to data in HDP Hadoop clusters
new SQL Table created
Export Scenario – Data aging into Hadoop
INSERT INTO 1. Exporting SQL data into Hadoop2. ‘ETL’ type of processing
possible via T-SQL
External table for aging data into Hadoop (as text
files)
1. Will alwaysOn (Hadron) be supported?2. Can a compute node be used for other SQL
workloads?3. Can a compute node share a machine with a
Hadoop data node?4. What SQL Server editions will I need?5. Will the MapR Hadoop distribution be supported?6. Is there a limit on the number of compute nodes?7. Why is it not possible to support different Hadoop
distros simultaneously?
Some Questions You Might Have…
what?why?how?
PolyBase
what
’s ne
xt
1. Each SQL 16 instance can participate in a single PolyBase cluster – A limitation of current DMS software
2. No support for varchar(max) or unique column types
3. No scale out for joins between tables on Head Node and external tables on HDFS (“Local table scaleout”)
CTP3/RTM Limitations
A Very Short Demo• SQL 16 w. Polybase (4 node SQL Scaleout Group)• 3 node Hortonworks 2.0 Linux cluster• Car telemetry data in Hadoop (CarSensor_Data)• Customer data in SQL 16 (Insured_Customers)
1. Simplicity - Query data in Hadoop, Azure and SQL Server via standard T-SQL
2. Highest Possible Performance - Parallelized data transfers between SQL 16 instances and Hadoop data nodes. Push down of SQL operations to Hadoop using MR jobs
3. Open - Supports most popular Hadoop distributions for both Linux and Windows
4. Full Integration with Microsoft Office & BI - Excel’s PowerPivot, PowerView, Tableau, Cognos, SQL Server Reporting & Analysis Services
Wrapup: PolyBase in SQL Server 16
Acknowledgements• The entire PolyBase
development team• Sahaj Saini for the
demo and a few slides• Artin Avanes for his role
as the PolyBase PM since inception
52
THANK YOU
Probable CTP2 Query Plan:1) Use MR job to find Orders > $1000
Select C.Name from Orders O, Customers C where C.id = O.CustId and O.price > $1000
CTP3/RTM Query Processing Limitation
tmp
3) Perform join on Head Node
2) Import result into SQL 16 instance on Head Node
CTP3 plan likely to be:1) Use MR job to find Orders > $1000
A Better Query Plan
2) Import result into SQL 16 instances on Compute nodes
4) Perform join in parallel on compute nodes
3) Redistribute Customers table from Head Node to Compute Nodes
Select C.Name from Customers C, Orders O where C.id = O.CustId and P.price > $1000
The Hadoop EcosystemManagement & Monitoring
(Ambari)
Coor
dina
tion
(Zoo
Keep
er)
Wor
kflow
&Sc
hedu
ling
(Ooz
ie)
Scripting(Pig)
Machine Learning
(Mahout)Query(Hive)
Distributed Processing(MapReduce)
Distributed Storage(HDFS)
NoSQ
L Da
taba
se(H
Base
)
Data
Inte
grat
ion
(Sqo
op/R
EST/
OD
BC)
SQL Results
DB
PolyBase = SQL Server PDW V2 querying HDFS/Azure data, in-situ
HDFS
PolyBasePolybase
Poly
Base
PolyBase
SQL SERVERPDW V2
Standard T-SQL query language. Eliminates need for writing MapReduce jobs
Leverages PDW’s parallel query execution framework
Data moves in parallel directly between Hadoop’s Data Nodes and PDW’s compute nodes
Exploits PDW’s parallel query optimizer to selectively push computations on HDFS data as MapReduce jobs
SQL Server PDW with PolyBase
All of this is now part of
SQL Server 16
Control Node
PDW EngineService
SQL Server
DMS Service
Compute NodeDMS SQL Server
Compute NodeDMS SQL Server
Compute NodeDMS SQL Server
Compute NodeDMS SQL Server
Compute NodeDMS SQL Server
Client ConnectionsUser Queries
JDBC, OLEDB, ODBC, ADO.NET
• Parse SQL• Validate and authorize• Optimize and build query
plan• Execute parallel query• Return results to client
Data Movement Service (DMS)• Separate process on each
node• Shuffles intermediate tables
among compute nodes during query execution
PDW Software
PDW SUMMARY Scalable SQL Server DW offering
Highly competitive performance and costCurrently only available in appliance form factorBut, soon available as SQL DW Service in AzureKEY COMPONENTSOne control node Engine Service +
DMS Compiles and
controls execution of SQL queries in parallel
Multiple compute nodes Each with a SQL Server
instance + DMS Execute SQL queries in
parallel
PolyBase for SQL 16 reuses
Engine Service and DMS
components
SQL16
PolyBase
DLLs
SQL16
PolyBase
DLLs
SQL16
PolyBase
DLLs
SQL16
PolyBase
DLLs
Step 3: Scale OutStep 3: Install one or more SQL Server instances
SQL installer will allow DBA to
specify desire to use PolyBase feature.
If selected, installer will install
PolyBase DLLs (DMS and Engine
Service) on each machine and
register them as Windows Services
Compute NodesEngine Service PDW
ApplianceSQL ServerDMS
DB
SQL ServerDMS
DB
SQL ServerDMS
DB
DMSControlNode
Namenode
(HDFS)DataNode
FileSyste
m
DataNode
FileSyste
m
DataNode
FileSyste
m
DataNode
FileSyste
m
DataNode
FileSyste
m
DataNode
FileSyste
m
Linux ClusterWindows Cluster
Hadoop cluster can be either on premise or in the cloud
Hadoop cluster can be either:
Many popular HDFS file formats supported (text, RC, ORC, … )
Hadoop Cluster
Hortonworks or Cloudera Hadoop distributions
Typical PolyBase Setup
TextFormat
RCFileFormat
ORCFile Format
ParquetFormat(future)…
Key Components Reused for SQL Server 16
Engine Service to parse, optimize, & orchestrate parallel execution of queries over relational tables and HDFS data
DMS used to move data between compute nodes and Hadoop data nodes
HDFS Bridge in DMS used to read/write files/directories in HDFS or Azure External Table
Constructused to “surface” records in external tables in HDFS or Azure files to SQL
MapReduce Job Pushdown used by Engine Service to push computation to Hadoop cluster