4 new insights through big data new world of big data & dw – yet another ‘hype’? 5 …...
Post on 23-Dec-2015
215 Views
Preview:
TRANSCRIPT
Polybase in the Modern Data Warehouse
Lionel Pénuchot
DBI-B337
I. New Insights through ‘Big Data’o Traditional vs. Modern Data Warehousing o New Choices & Data Platforms - When to use what?
II. PolyBase in Microsoft Analytics System (APS) o Overview about Microsoft’s APSo PolyBase’s Visiono PolyBase’s building blocks & functionality
III. Real-world customer production use cases with PolyBase
o Use Case Category I – Integration with external Hadoop o Use Case Category II – Integration with Microsoft Azure (Hybrid)o Use Case Category II – Integrated Appliance with PDW & HDInsight Region
Agenda
4
New Insights through Big Data
New World of Big Data & DW – Yet another ‘Hype’?
5
… data warehousing has reached the most significant tipping point since its inception. The biggest, possibly most elaborate data management system in IT is changing.
– Gartner, “The State of Data Warehousing in 2012”
Data sources
ETL
Data warehouse
BI and analytics
Impact on traditional DW
6
Data sources
ETL
Data warehouse
BI and analytics
Increasing data volumes
1
Near real-time
insights
3
Non-relational data
New data sources and types
2
Cloud-born data
4
DATA ENRICHMENT AND FEDERATED QUERY
Extract, transform, load
Single query model Data quality Master data
management
INFRASTRUCTURE
DATA MANAGEMENT AND PROCESSING
Non-relationalRelational Analytical Streaming Internal and external
Data sources Non-relational data
BI AND ANALYTICS
Self-service CollaborationCorporate PredictiveMobile
The modern Data Warehouse
When to use what?
Big Data Deployments (source: Gartner April 2014)
57%
14%
15%
10%4%
Gartner Survey April 2014
Piloting on premise Piloting in the cloudProduction on-premise with clus-terProduction on premise with appliance Production in cloud
2014 – 5% think Hadoop will replace existing DW solution (2013: 10%)
↓ 50% decline even with the presence of Hadoop 2.0
Big Data Applications (source: Gartner April 2014)
Interactive Analytics DBMS Stream Processing Search Graph applications0%
10%
20%
30%
40%
50%
60%
→ primarily includes SQL over HDFS (!)
Challenges for a modern data warehouse
Keep existing & legacy investment
Limitedscalability and
ability to handle new data types
Acquire Big Data solution
Significant training and data
silos
Buy new tier-one hardware appliance
High acquisition and migration
costs
Acquire business intelligence & tools
Complex with low adoption
12
Microsoft’s Analytics Platform Systems (APS)
Introducing the Microsoft Analytics Platform SystemThe turnkey modern data warehouse appliance
Next-generation performance at scale
Enterprise-ready Big Data
Engineered foroptimal value
• Relational and non-relational data in a single appliance
• Enterprise-ready Hadoop aka HDInsight Region (HDI)
• Integrated querying across HDI and PDW via PolyBase
• Integration with Microsoft BI/Excel & 3rd party BI tools
• Near real-time performance with In-Memory Columnstore
• Ability to scale out to accommodate growing data
• Removal of data warehouse bottlenecks with MPP SQL Server
• Concurrency that fuels rapid adoption
• Industry’s lowest data warehouse appliance price per terabyte
• Value through a single appliance solution
• Value with flexible hardware options using commodity hardware
Core APS Features MPP and In-Memory Columnstore for best-in-class performance
• Store data in columnar format for massive compression
• Load data into or out of memory for next-generation performance with up to 60% improvement in data loading speed
• Updateable and clustered for real-time trickle loading
Up to 100x faster queries
Updateable clustered columnstore vs. table with customary indexing
Up to 15xmore compression
Columnstore index representation
C1
C3
C5
C4
C2
C6
Parallel query execution
Query
Results
What’s HDInsight in the appliance? o Hadoop bits run on same HW as PDW regiono Separate appliance region – interconnected
via Infiniband to PDW region o Hortonworks HDP 2.0 bits on Windows
Server – including YARN execution framework o PolyBase certified with HDInsight on-premise
Core APS Features HDInsight Region
15
1. Querying all of data sources through APS/PDWo allowing to query relational &
non-relational stores (on-premise) with single query interface
o On-premise data sources include not only Hadoop, but also RDBMSs or document-oriented stores*
2. APS/PDW as cloud-attach for Azureo allowing to scale storage &
leverage cloud compute (SQL DB, Azure HDI, Azure DocDB, Azure ML, Azure Search)
o Trickle effect (data migration from on-premise to Azure)
IBM DB2
Oracle
SQL Server
Your Apps
APS with PolyBase
On-premise
Hadoop
1.
Azure Storage
Microsoft Azure
cloud serviceHDInsight (Hadoop)
Azure SQL Database
Your Apps
2.
Core APS Features PolyBase for the (Big) Data Continuum
17
PolyBase’s Principles
Split-based query
processing
Best-In-Class Performance
Supporting various Hadoop
distributions
Openness & Flexibility
Integration with Microsoft
Azure
Hybrid
Seamless with Microsoft BI & Third-Party
BI tools
Existing skill sets & tools
Mature T-SQL language surface
Preserving T-SQL
Semantics
APS Overview
Demo
18
19
PolyBase – Building Blocks &
Functionality
Microsoft APS w/
PolyBase
APS control & data nodes
Social
Apps
Sensor
&RFID
Mobile
Apps
WebApps
PolyBase – External Tables, Data Sources & File Formats
Data Scientists, BI Users, DB Admins
Your Apps
PowerPivot PowerView
PolyBase Split-Based Query
Processing
External Table
External Data
Source
External File Format
Hadoop
Relational DW
Split-based Query Processing
Your Apps
APS with PolyBase
On-premise
Hadoop
Indefinite Azure Storage
Microsoft Azure
Azure Express Route
Key principle - Leverage computation power of external Hadoop clusters
o Transparent pushing of computation into Hadoop on-the-fly
o No map/reduce or Hadoop skills needed o Reducing data volume to be transferred
Generalizing key principle in future releaseso Apply for all external data sourceso Azure for indefinite storage + as appliance
extension by making use of Azure compute o Leveraging compute of other data sources -
relational data sources or NoSQL platforms
Extensible approach o Ideas to plug-in other push-down engines in
futureo Further performance improvements in near
future
PolyBase Query Use Cases PolyBase query scenarios (SELECT)
1. Run T-SQL over HDFS/Hadoop data2. Combine data from different Hadoop
clusters (e.g. production & dev/test) via PolyBase
3. Join relational data with non-relational data in HDFS/Hadoop
Customer Value o Ease-of-use & Improved Time-To-
Insights― Build the data lake w/o heavily investing
into new resources, i.e. Java & map/reduce experts
― Leverage familiar & mature T-SQL scriptsand constructs
― Seamless tool integration w/ PolyBase
HadoopMicrosoft APS w/
PolyBase
Your Apps
T-SQL
HadoopDev/Test
PolyBase Import & Export Use Cases
PolyBase ETL scenarios (CTAS/CETAS)o Storing subset of Hadoop in PDW for frequent
access, i.e. in SQL Server’s columnar format (→ updatable CCI)
o Enabling data aging scenarios to more economic storage (→ ‘data lake’)
Customer Value ― Avoids the need of maintaining a separate
import or export utility― Allows building multi-temperature DW
platformso PDW/APS acts as hot query engine processing
most recent/relevant data sets o Aged data immediately accessible via external
tableso No need for deleting any data anymore
HadoopMicrosoft APS w/
PolyBase
Your Apps
T-SQL
PolyBase for Hybrid Use Cases (Azure Integration)
APS w/ PolyBase
Your Apps
T-SQL
Azure Storage
cloud serviceHDInsight (Hadoop)
Azure SQL Database
Your Apps
APS as Azure-attach o Azure as cheap storage for keeping your data
forever o Mesh-up on-premise and cloud apps on your own
termso PolyBase as bridge between on-premise and
Azure
Customer Value o Indefinite storage and compute
― Azure as extension for your on-premise data assetso Cloud transition on your own terms
― Move only subsets of on-premise data, e.g. non-sensitive data
o Leverage new Azure data services ― Reduced capex & availability of new emerging data
services in Azure for on-premise focused users
PolyBase for Round-Trip Scenarios
APS w/ PolyBase
Your Apps
T-SQL
Azure Storage
cloud serviceHDInsight (Hadoop)
Azure SQL Database
Your Apps
HadoopDev/Test
PolyBase RoundTrip Scenario (CETAS)o Processing non-relational data with relational data
and exporting it to your Big Data platform of choice
Customer Value ― Total freedom & flexibility o User can decide where to store the results of querying
different data sets
― Simplicity o Example: 1 T-SQL statement for
a) querying different Hadoop data sets, b) combining with relational data and c) storing results in Azure or in a different
Hadoop cluster
Polybase & Hadoop
Demo
26
27
Real-world customer production use cases with
PolyBase
Listening to SQL customers – ‘Web 2.0’
‘We love Open-Source, but we love SQL Server Analysis Services (even more)’ o SQL Server SMP & Hadoop cluster
Pain Points with current solution1. ‘We are currently using a mix of SQOOP & BCP.’2. The integration with SQL Server should be
better, simpler & as fast as possible’ !s3. ‘Take a definition of an aggregate in Hive and
map it through SQL DDL’! 4. ‘Loading data from Hadoop to SQL must be
easier’. 5. ‘Hadoop is hard to deploy and maintain'.
insights
Use Case Category 1 – Integration with external Hadoop
clusters
Listening to SQL customers – Car Insurance
‘Pay-as-you-drive’ - Driver Discount & Policy adjustment o SQL Server 2012 PDW & HDP 2.0/2.1 on Linux
What they want to accomplish1. ‘We have increasing set of non-relational data – sensor
data & web data which we want to store in Hadoop.’2. ‘We want to get answers to our complex questions
fast’. 3. ‘We don’t want to learn a new tool or programming
language or paradigm to get the answers’.4. ‘We know how to use Microsoft tools and services’. 5. ‘We don’t want to maintain a complex infrastructure –
we trust Microsoft & SQL Server to help us’.
Listening to SQL customers – ShinSeGae
Investing into Online Shopping website (‘Korea’s Amazon’) o SQL Server 2012 PDW & HDP 1.3/HDP 2.0 on Linux
What they want to accomplish1. ‘We want perform complex data mining on
customer purchase data – basket analysis’. 2. ‘We want to understand the social media data
(reviews/Twitter) – specifically around our products & stores’.
3. ‘We will use Hadoop to keep all of our data ~ envisioned to be around 480 TB. PDW will be the efficient analysis engine for the hot & more recent data’.
4. ‘PDW & PolyBase are much faster than Hive’. 5. ‘We’re interested in using data mining cloud
services in Azure (hybrid scenarios)’
HDP on Linux
APS/PDW
EDW
Analytic information(right customer
targeting)
Campaign
Recommendation engine & personalized advertising
Online Shopping
MallSSG.com(renewal)
Recent/hot data stored in
PDW
Solution Architecture (Details) – ShinSeGae
PolyBase
Queries
raw/cold/warm data
Complex Event Processing (Storm)Message Queues
(KAFKA, Open source)
Tracking Log Servers
Web log data(160GB/daily) – External Tables A, B, C1.
Unstructured/semi-structured text data - External Tables D, E, F Text (Board/SNS/
Internal Text )
Weather..
2.
Company emails – External Tables G, H, I
3.
Mails
EIS
OLAP(Tabular)
DATA Mining
Visualization
(Silverlight)
BI analyst
Operational Data Store
10 GB Ethernet
Hybrid scenarios - ShinSeGae
What they want to accomplisho On-premise: 15-20 HDP Hadoop cluster + 1
Rack PDWo How they want using Azure:
1. Cost-optimal backup for Hadoop data [WASB]
2. Scale-out environment to spin-up more compute on demand at busy times
3. Platform to leverage new data services, in particular Azure HDI and Azure Machine Learning
o Daily - 50 GB compressed data into Azure o Either planning to use custom API or
PolyBase CETASo Planning to use PolyBase to combine data from
on-premise and Azure
APS with scale-out PDW region
Your Apps
T-SQL
Azure Storage
cloud serviceHDInsight (Hadoop)
Azure SQL Database
Your Apps
Listening to SQL customers – TeleCom
‘Understanding network quality’ o SQL Server 2012 PDW & Hadoop on Linux
What they want to accomplish1. ‘We collect millions of network records for
quality assessment and capacity planning – on a daily basis’.
2. ‘Hadoop will be used for storage and ETL of these network record files’.
3. ‘PDW for more operational analysis, ad-hoc analysis, operational reports’.
4. ‘We are using Polybase along with Oozie-based orchestration for a seamless & automated integration’.
Hadoop cluster (18+
servers)
APS/PDW
EDW
Hot operational PDW data
Solution Architecture (Details) – Telcom
PolyBase
Queries
raw/cold data (Petabyte of network
logs)
High-frequency Event
Processing (Network logs)
Capturing Network logs (>300 GB/per day) – External Tables A,
B, C
Network quality
analysis
Capacity Planning
Visualization
(PowerPivot)
BI analyst/Planner/Decision-maker
Operational Data Store
Infiniband
Oozie Work-flows
Remote procedure calls via stored procedures to trigger PolyBase queries
HCatalog
Usage of Hive’s Metadata stores
Listening to SQL customers – Oil & Gas
‘Analyzing oil drilling rigs’o SQL Server 2012 PDW & external Hadoop
What they want to accomplish1. ‘Each drilling rig is equipped with sensor devices
and we want to store the incoming sensor data into Hadoop’.
2. ‘We want to use PolyBase to access these sensor data’.
3. ‘We want to monitor our rigs in near real-time’.4. ‘We want to conduct predictive analysis – which rigs
are running poorly, which rigs show similar symptoms’.
Use Case Category 2 – Integration with Microsoft Azure
Listening to SQL customers – Government
‘Bridging the gap between cloud & on-prem’ o SQL Server 2012 PDW & Azure HDInsight
What they want to accomplish1. ‘HDInsight/Hadoop in the cloud to store and massage
our raw data (XML files) generated by our web-application’.
2. ‘PDW to keep the data on-prem (legal requirement) and to have an efficient query engine for analysis purposes’.
3. ‘PolyBase is a great way of accessing our files in the cloud via simple T-SQL.’.
4. ‘With this solution, we can allow web users to quickly ask questions while the heavy, more complex business analysis is accomplished by PDW users’.
Microsoft BI stack
IBM Cognos
Solution Architecture (Details) – Government
PolyBase
Queries APS/PDW
EDW
PDW/APS for fast query response & data
processing of hot data
Operational Data Store
Public Internet or Azure Express
Route
Transforming to large text files ~ 10 GBs each
(External Tables)
HDI on Azure
cheap data store – alternative to Hadoop on-
prem solution
Azure Blob Storage
Web Application for Tax Filing (e-
invoice)
Web apps- Generating tons of
smaller XML files (~7KB each)
Other Web Feeds
HDI tools for data transformation
Use Case Category 3 – Unified Appliance with PDW and
HDInsight region
Listening to SQL customers – Beverage & Vending Machines
‘What are you drinking? Why is the machine down’? o SQL Server/APS with PDW & HDI region
What they want to accomplish1. ‘We want a complete solution stack – we
do not have Hadoop experts in-house and don’t have the money to get it’.
2. ‘We want to store all raw data coming from vending machines into Hadoop’.
3. ‘360 degree of all our data – structured customer data & unstructured data coming from vending machines’.
4. ‘Predicate maintenance of machines’.
Listening to SQL customers - Microsoft DDSG
What they want to accomplishoOffering data management services to other
Microsoft groupsoCookie scoring for msn.com o 3TB of data for each analysiso Added PolyBase for Reporting & PowerQuery o Previously – o Analysis solely done via HiveQLo Results are writing back in SQL Server for reportingo PowerQuery via HiveODBC driver
‘Data Management Services for Microsoft Teams’o APS with PDW & HDInsight region
Solution Architecture (Details) – Internal Microsoft Data Scientist
PowerQuery/PowerView/PowerMap
Data scientist group 2 - Using Polybase for existing tooling (T-SQL, BI
tools) performing processing of complex analytical queries & consistent
management experience
Full Rack PDW
Polybase Queries
PDW regionHDI region
1 scale unit HDI regionS
ecu
re G
ate
way &
A
D In
teg
rati
onmsn.com – Log
files
Analyzing ~3 TB Web Traffic
Microsoft servers – Log
files
Data scientist group 1 - using chaining of Hive queries & PowerQuery via HiveODBC
Hive & PowerQuery via Hive ODBC
Analytical queries via SSDT
APS with PDW & HDI region
System Center & Admin-Console
Infiniband
Polybase & Azure
Demo
44
27 Hands on Labs + 8 Instructor Led Labs in Hall 7
DBI Track resources
Free SQL Server 2014 Technical Overview e-book
microsoft.com/sqlserver and Amazon Kindle StoreFree online training at Microsoft Virtual Academy
microsoftvirtualacademy.com Try new Azure data services previews!Azure Machine Learning, DocumentDB, and Stream Analytics
DBI-B310 Microsoft Analytics Platform System OverviewDBI-B331 Microsoft Analytics Platform System Deep Dive
Resources
Resources
Learning
Microsoft Certification & Training Resources
www.microsoft.com/learning
Developer Network
http://developer.microsoft.com
TechNet
Resources for IT Professionals
http://microsoft.com/technet
Sessions on Demand
http://channel9.msdn.com/Events/TechEd
TechEd Mobile app for session evaluations is currently offline
SUBMIT YOUR TECHED EVALUATIONSFill out an evaluation via
CommNet Station/PC: Schedule Builder
LogIn: europe.msteched.com/catalog
We value your feedback!
© 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
top related