etl for the masses with power query and m
DESCRIPTION
This is the slide deck for my presentation given at SQL Saturday 280 in Vienna.on March 6th 2014.TRANSCRIPT
ETL for the MassesRégis Baccaro – IBM
@regbac
Our Sponsors
Introduction
Régis Baccaro @regbachttp://Theblobfarm.wordpress.com
http://Thelovefarm.wordpress.com
• Founder and lead organizer of SQL Saturday Denmark
• PASS Regional Mentor
• Works for IBM
• Passionate about the community
• .Net developer, BI dude, SharePoint fellow and accidental DBA
Agenda
• Power Query and the M language
• E and T and L with Power Query
• Data refresh techniques with PQ
• Next step
Introduction
• Power Query• Get data experience
• Filter and combine
• Embedded M for repeatable mashup
• Power Query Formula Language (aka M)• Mostly pure
• Higher-order
• Dynamically typed
• Partially lazy
• Functional programming language
Elements of language
• Expressions – central construct• Evaluated to a single vlaue
• Values• Primitives
• List – ordered seq.
• Record – set of fields
• Table
• Function
Evaluation
• Excel-like (surprise !)
• Nested records• In Records
• In Lists
• Lazy evaluation• Lists and Records (and let)
• Eager evaluation• Everything else
Functions and Standard Library
• Mapping from a set of values to a single value• (named parameters) => function body
• Common set of definitions
Operators
• Meaning varies depending on kind of value
• & = text or list concatenation and records merge
Metadata
• Information about a value that is associated with a value
• A record
• Exists for every value
• Unobtrusive way to add information
• Accessed with Value.Metadata
Let .....in expression
• So far only literal values
• Let allows a set of value to be:• Computed
• Named
• Used in subsequent expressions that follows the in
let
Source = Web.Page(Web.Contents("http://www.cvr.dk/Site/Forms/CompanySearch/CompanySearch.aspx?......),
RowCount = Table.RowCount(Source)
in
RowCount
IF expression
• Select between 2 expression based on logical condition
Error expression
• When an expression evaluation cannot yield a value
• Raised with error
• Handled with try
• Produces an Error record
• try...otherwise Used with default values
Keywords and Operators
• and as each else error false if in is let meta not otherwise or section shared then true try type #binary #date #datetime #datetimezone #duration #infinity #nan #sections #shared #table #time
• , ; = < <= > >= <> + - * / & ( ) [ ] { } @ ! ? => .. ...
The ”E” - Why is Power Query great for Extracting data
• Multiple data sources
Hey wait ! Where is PDW ?
Query folding - A step toward declarative ETL approach
• Declarative vs Imperative
• Query folding similar to predicate pushdown
• Does Power Query have a Query Optimizer ?
• Demo
Query folding - the unofficial list:• SQL Databases• OData and OData based sources, such
as the Windows Azure Marketplace and SharePoint Lists
• Active Directory• HDFS.Files, Folder.Files, and
Folder.Contents (for basic operations on paths)
• Column removal• Renaming• Joins• Type conversions
Real life scenario – ETL for the masses
• Seen a lot of demos
• Build a lot of demos
• They are always so clean !
Real life scenario
Transform
• M is how the magic happens!
• Data manipulation• Records
• Lists
• Tables
• Merging
• Function calls
What about our scenario?
• Where should I get my data from?• Pure Excel
• Excel and MDS/DQS/SSIS/SQL
• Web, SQL, XML, ?
• Let me show you ! Input
• (cvr web)
Let’s go to homegrown data?
• Bad web service
• Bad HTML structure
• Let’s go with local data that we can control• SQL Server
• Excel
• Let’s Query!
Isolated DB
Local storage
Clean up before you merge!
• DQSKnowledge base with CVR
+ Cleansing project with LinkedIn input
________________________________________
= Demo2.1_AndreasStrandbyClean
• Hit ratio increased...
+
=
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0
50
100
150
200
250
Cleanjoin
Nestedjoin
Merge
Hit Total
Smarter Power Query
• Expression.Evaluate()
• Examples• Load query text from file
• Load function from file
• Passing parameters (as constants)
• Demo
Refreshing Power Query data
• Different solutions
• All with flaws !
Refreshing Power Query data – with VB6 !
• Back from 2006
• From Kim GreenLee
Plus Minus
Can be scheduled VB6 – are you kidding ?
More robust than the non-technical solution
Refreshing Power Query data – with PowerShell
Plus Minus
Robust Hard to troubleshoot
Can not run in a task in windows task scheduler unless the user has checked that the user has to be logged on to run
Refreshing Power Query data – The non-technical way
• Let me show you !
Plus Minus
Very easy Not very corporate !
The spreadsheet needs to be open
Excel file not saved
Locked out when it refreshes
Refreshing Power Query data – The non-technical way part 2
• Let me show you !
Plus Minus
Very easy Not very corporate !
Uses technique from previous The spreadsheet needs to be open
Refreshing Power Query data – with SSIS
Plus Minus
Robust Requires a SQL Server (wait, it’s a plus!)
Needs a SSIS / C# developer
Refreshing Power Query data – with SSIS
• Using DQS for cleansing input
• Let me show you !
How is Power query going to be used?
• Data store accumulating interesting data points
• Hook into read only data for reporting purposes or data marts
• One file to accumulate (Produce)
• Multiple files or programs to report (Consume)
• I don’t believe in “Data Steward”
• I believe someone will be in charge of procuring and monitoring data stores of disparate data (such as IT or DBA’s).
Conclusion
• A step toward declarative ETL approach
• Still much work to do !
We have• A declarative data integration language• Only surfaced in Power Query• Can push data to an Excel spreadsheetImagine.....• Connection to heterogenous data sources
THANK YOU!@REGBACHTTP://[email protected]