microsoft data integration pipelines: azure data factory and ssis

53

Upload: mark-kromer

Post on 29-Jan-2018

745 views

Category:

Technology


2 download

TRANSCRIPT

Automate pipelines:Tools for building ETL pipelines

Mark Kromer

Microsoft Sr. Program Manager

Information Management

@kromerbigdata

ETL Pipeline Objectives

• Consume hybrid disparate data (Extract)• On-prem + Cloud

• Structured, un-structured, semi-structured

• Calculate and format data for analytics (Transform)• Transform, aggregate, join, normalize

• Address large-scale Big Data requirements (Scale / Load)• Scale-up or Scale-out data movement and transformation

• Operationalize (Automate)• Create pipeline orchestrations for different org requirements

• Manage & monitor multiple pipelines

Hybrid Data Integration Pattern 1:Analyze blog comments

Azure SQL Database

SQL Server

(on-premises)

Data Management

Gateway Req’d for ADF

Azure Data Factory (PaaS)

Capture blog comments via API

Drop into Blob Store

Copy & lookup

Power BI Dashboard

Visualize and analyze

SSIS (self-hosted)

Transform via SPROC (ELT)

Transform via Dataflow (ETL)

Hybrid Data Integration Pattern 2:Sentiment Analysis with Machine Learning

Azure Data Factory

Power BI

Blob Storage

Azure Functions

Hybrid Data Integration Pattern 3:Modern Data Warehouse

Daily flat files

OLTP DBTables

Analytical Schemas

AML: Churn Model

Customer Call Details

Azure Data Factory (PaaS)

SSIS (self-hosted)

Social Media (un/semi structured)

SQL Server Integration Services (SSIS)SSIS is a platform for building enterprise-grade data integration solutions

User-friendly code-free authoring / management client tools: SQL Server Data Tools (SSDT) SQL Server Management Studio (SSMS)

Wealth of connectors + rich transformations toExtract, Transform, and Load (ETL) data between various sources and destinations, on premises and in the cloud

Low Total Cost of Ownership (TCO)

Azure Feature Pack

Provides Azure connectivity components for SSIS

1. Move / transfer hybrid data between various sources and destinations, on premises and in Azure

2. Develop ELT workflows with Big Data transformations / processing in Azure

3. Combine traditional ETL and modern ELT workflows spanning on-premises and Azure

SSIS Azure Feature Pack FeaturesSSIS Azure Feature Pack contains:1. Connection Managers

1. Azure Subscription Connection Manager

2. Azure Storage Connection Manager

3. ADLS Connection Manager (NEW)

2. Control Flows / Tasks1. Azure Blob Upload / Download Tasks

2. Azure HDInsight Hive / Pig Tasks

3. Azure HDInsight Create / Delete Cluster Tasks

4. Azure SQL DW Upload Task (NEW)

3. Data Flows1. Azure Blob Source / Destination

2. ADLS Source / Destination (NEW)

4. Azure Blob Enumerator

SQL DW + SSIS – Quick introSQL DW is Microsoft’s scale-out database in the cloud

Built on Massively Parallel Processing (MPP)architecture

Capable of processing huge volumes of relational and non-relational data.

It divides data and processing capability across multiple nodes

Control Node receives, optimizes, and distributes requests to Compute Nodes that work in parallel.

There are 2 ways to load data into SQL DW using SSIS:

Front-loading through Control Node with data flows

Back-loading through Compute Nodes with PolyBase

Azure SQL DW Upload Task – Typical scenario

Azure Blob Storage SQL DW

Cloud

On-Premise

SQL Server Flat File SSIS Machine

0. Export to a flat file

1. Upload to Azure Blob

2. Create an external table

3. Trigger PolyBase to load data

Azure SQL DW Upload Task automates steps 1 – 2 – 3 below:

Azure SQL DW Upload Task

On Azure SQL DW Upload Task Editor, you can1. Name and describe a create / insert table task

2. Select and configure UTF-8-encoded text file(s) as your data source

3. Select and configure Azure Storage Connection Manager + new / existing blob container as your data staging area

4. Select and configure ADO.NET Connection Manager for SQL DW + new / existing table as your data destination

5. Map source and destination columns for the create / insert table task

6. Define metadata / data types for source columns

Azure SQL DW Upload Task

Following configurations on Azure SQL DW Upload Task Editor, T-SQL script that triggers PolyBase to load data from your Azure Blob Storage into SQL DW will be automatically generated

You can still manually edit this auto-generated T-SQL script to customize it for your particular needs

RAW DATA,DATA CHAOS

REFINED, ORGANIZED DATA

DATA CLARITY,BETTER DECISIONS

DATA MOVEMENT DATA TRANSFORMATIONS BUSINESS INTELLIGENCE

AZURE DATA FACTORYHYBRID DATA INTEGRATION AT SCALE

Customer Profitability

Sentiment Analysis

Market Basket

Analysis

Machine Learning, Big Data Analytics, SQL, NoSQL, Data Warehouse , Data Lake

ADF: Orchestrate data services at scale with fully managed Data Integration cloud service

PREPARE TRANSFORM & ANALYZE

INGEST

SQL

<>

SQL

DATA SOURCES

{ }

SQL

• Create, schedule, orchestrate, and manage data pipelines

• Visualize data lineage

• Connect to on-premises and cloud data sources

• Monitor data pipeline health

• Automate cloud resource management

• Move relational data for Hadoop processing

• Transform with Hive, Pig, PySpark, SQL SPROC or custom code

Cloud Analytics – Common Challenges

ELT with Apache Spark Activity from ADF Pipeline

Create new pipeline and HDI Cluster

Linked Service for Spark from Azure

Portal

Invoke Python script from Spark activity to

transform data at scale

Schedule, monitor and manage pipeline from

ADF

Verify results and perform analytics from

Jupyter notebooks / PBI

ELT with Azure Data Lake from ADF Pipeline

Create new pipeline and Azure Data Lake

Analytics Linked Service from Azure

Portal

Perform data transformations at

scale with U-SQL script

Schedule, monitor and manage pipeline from

ADF

Verify results and perform analytics from

ADLA

Monitor & Manage Pipelines

Microsoft Data Integration Roadmap

• SSIS• SQL Server 2017

• SSIS on Linux

• Scale-out

• ADF• SSIS as a Cloud “Integration Runtime”

• Code-free web-based user experience

• Control Flow orchestration + Data Flow steps

• On-Demand Spark Cluster