dive into azure data lake - pass 2017

43
Ike Ellis, MVP, Crafting Bytes Dive into the Azure Data Lake

Upload: ike-ellis

Post on 21-Jan-2018

209 views

Category:

Software


0 download

TRANSCRIPT

Page 1: Dive Into Azure Data Lake - PASS 2017

Ike Ellis, MVP, Crafting Bytes

Dive into the Azure Data Lake

Page 2: Dive Into Azure Data Lake - PASS 2017

Please silence cell phonesPlease silence cell phones

2

Page 3: Dive Into Azure Data Lake - PASS 2017

Free online webinar

events

Free 1-day local

training events

Local user groups

around the world

Online special

interest user groups

Business analytics

training

Free Online Resources

PASS Blog

White Papers

Session Recordings

Newsletter www.pass.org

Explore everything PASS has to offer

PASS Connector

BA Insights

Get involved

Page 4: Dive Into Azure Data Lake - PASS 2017

Session evaluations

Download the GuideBook App

and search: PASS Summit 2017

Follow the QR code link

displayed on session signage

throughout the conference

venue and in the program guide

Your feedback is important and valuable.

Go to passSummit.com

Submit by 5pm Friday, November 10th to win prizes. 3 Ways to Access:

Page 5: Dive Into Azure Data Lake - PASS 2017

Ike EllisPartner, Crafting Bytes

Microsoft MVP

Since 2011

Frequent PASS Summit Speaker

And speak often at SQL Saturdays + maker of youtube videos – Ike’s tips!

Author of Developing Azure Solutions

Now in 2nd Edition/ikeellis @ike_ellis ellisike

Page 6: Dive Into Azure Data Lake - PASS 2017

Thank YouLearn more from Ike Ellis

[email protected]

m

@ike_ellis

Page 7: Dive Into Azure Data Lake - PASS 2017

What is a data lake as a data architecture term?“If you think of a datamart as a store of bottled water

– cleansed, packaged, and structured for easy

consumption – the data lake is a large body of water

in a more natural state.” – James Dixon – creator of

the term

Page 8: Dive Into Azure Data Lake - PASS 2017

Data Lake Attributes

Contains unstructured,

structured, semi-structured

and multi-structured data.

Data lakes have been

traditionally Azure Blob

Storage, Amazon S3, and

Hadoop HDFS.

Data is transformed in place

with no movement.

Page 9: Dive Into Azure Data Lake - PASS 2017

Complexities with Standard MPP

Typical HDFS installation

Typical Hadoop installation

Redshift

Page 10: Dive Into Azure Data Lake - PASS 2017

Azure Data Lake Store

Cloud based file system that is unlimited in size

Compatible with Hadoop/Spark

• Hive

• Pig

• Sqoop

Page 11: Dive Into Azure Data Lake - PASS 2017
Page 12: Dive Into Azure Data Lake - PASS 2017

Azure Data Lake Store

• Unlimited Storage – a single

petabyte file

• Tuned for batch jobs

• Use active directory for security

• Store all data in native format

• Uses ADL:// and a URI

• Exposes WebHDFS API

Page 13: Dive Into Azure Data Lake - PASS 2017

Import data

• Use the Azure Portal

• .NET SDK

• Data Factory

• DistCp (Hadoop

Distributed Copy)

• Apache Sqoop

Page 14: Dive Into Azure Data Lake - PASS 2017

Azure Data Lake Store vs Azure Blob Storage• ADLS is optimized for analytics

• Blob Storage holds tons of data inappropriate for analytics like VHDs

• ADLS has folders, Blob Storage has containers

• ADLS uses WebHDFS Rest API, Azure Blob Storage uses the Azure SDK Rest APIs

• ADLS has no size limits

• ADLS has no geo-redundancy yet

• ADLS is more expensive

• Azure Blob Storage can be better for analytics if there are a lot of small files

Page 15: Dive Into Azure Data Lake - PASS 2017

DEMO

Provisioning Azure Data Lake Store

Provisioning Azure Data Lake Analytics

Connecting the two

Page 16: Dive Into Azure Data Lake - PASS 2017

Azure Data Lake Analytics

Service that queries data for analysis

ADLA runs jobs for:

querying

cleaning

aggregating

transforming

ADLA uses U-SQL as a language

Page 17: Dive Into Azure Data Lake - PASS 2017

U-SQL Job Syntax

Page 18: Dive Into Azure Data Lake - PASS 2017

U-SQL combines SQL and C#SQL is a declarative language

• Don’t need to write all the steps

• Just write what data you want and let the optimizer get the data for

you

• Multi-threading is done for you in the background

C# is a procedural language that is often a better fit

• Custom Functions

• Multi-threading is very difficult

U-SQL unifies both of these languages

• Highly customizable and extensible way to interact with data

Page 19: Dive Into Azure Data Lake - PASS 2017

U-SQL – Start with a simple script

@hitters =

EXTRACT Rank int

, PlayerName string

, Age int

, Team string

, League string

, Games int

, HomeRuns int?

, BattingAvg decimal?

, OneBasePer decimal?

FROM "/BaseballStats/Baseball.csv"

USING Extractors.Csv(encoding: Encoding.[ASCII], skipFirstNRows: 1);

OUTPUT @hitters

TO "/output/hitters.txt"

USING Outputters.Text();

19

Page 20: Dive Into Azure Data Lake - PASS 2017

Let’s run this script?

@hitters =

EXTRACT Rank int

, PlayerName string

, Age int

, Team string

, League string

, Games int

, HomeRuns int?

, BattingAvg decimal?

, OneBasePer decimal?

FROM "/BaseballS tats/Baseball.csv"

USING Extractors.Csv(encoding: Encoding.[ASCII], skipFirstNRows: 1);

@bestHitters = select * from @hitters

where HomeRuns != null;

OUTPUT @bestHitters

TO "/output/hitters.txt"

USING Outputters.Text();

20

Page 21: Dive Into Azure Data Lake - PASS 2017

OK, here’s the good scripts

@hitters =

EXTRACT Rank int

, PlayerName string

, Age int

, Team string

, League string

, Games int

, HomeRuns int?

, BattingAvg decimal?

, OneBasePer decimal?

FROM "/BaseballStats/Baseball.csv"

USING Extractors.Csv(encoding: Encoding.[ASCII], skipFirstNRows: 1);

@bestHitters = SELECT * FROM @hitters

WHERE HomeRuns != null AND HomeRuns != 0;

OUTPUT @bestHitters

TO "/output/besthitters.txt"

USING Outputters.Text();

21

Page 22: Dive Into Azure Data Lake - PASS 2017

We can do GROUP BY@hitters =

EXTRACT Rank int

, PlayerName string

, Age int

, Team string

, League string

, Games int

, HomeRuns int?

, BattingAvg decimal?

, OneBasePer decimal?

FROM "/BaseballStats/Baseball.csv"

USING Extractors.Csv(encoding: Encoding.[ASCII], skipFirstNRows: 1);

@bestHitters = SELECT Team, MAX(HomeRuns) AS HomeRunTotal

FROM @hitters

WHERE HomeRuns != null AND HomeRuns != 0

GROUP BY Team;

OUTPUT @bestHitters

TO "/output/besthitters.txt"

USING Outputters.Text();22

Page 23: Dive Into Azure Data Lake - PASS 2017

JOINS, TOO! Does this script work?@hitters =

EXTRACT Rank int

, PlayerName string

, Age int

, Team string

, League string

, Games int

, HomeRuns int?

, BattingAvg decimal?

, OneBasePer decimal?

FROM "/BaseballStats/Baseball.csv"

USING Extractors.Csv(encoding: Encoding.[ASCII], skipFirstNRows: 1);

@bestHitters = SELECT Team, MAX(HomeRuns) AS HomeRunTotal FROM @hitters

WHERE HomeRuns != null AND HomeRuns != 0

GROUP BY Team;

@bestHitterWithName = SELECT PlayerName, Team, HomeRuns

FROM @hitters h

JOIN @bestHitters bh

ON h.Team == bh.Team AND h.HomeRuns == bh.HomeRunTotal

OUTPUT @bestHitterWithName

TO "/output/besthitterswithnames.txt"

USING Outputters.Text();

23

Page 24: Dive Into Azure Data Lake - PASS 2017

Nope! Need AS for alias, ambiguiouscolumn names, INNER for JOINS@hitters =

EXTRACT Rank int

, PlayerName string

, Age int

, Team string

, League string

, Games int

, HomeRuns int?

, BattingAvg decimal?

, OneBasePer decimal?

FROM "/BaseballStats/Baseball.csv"

USING Extractors.Csv(encoding: Encoding.[ASCII], skipFirstNRows: 1);

@bestHitters = SELECT Team, MAX(HomeRuns) AS HomeRunTotal FROM @hitters

WHERE HomeRuns != null AND HomeRuns != 0

GROUP BY Team;

@bestHitterWithName = SELECT h.PlayerName, h.Team, h.HomeRuns

FROM @hitters AS h

INNER JOIN @bestHitters AS bh

ON h.Team == bh.Team AND h.HomeRuns == bh.HomeRunTotal;

OUTPUT @bestHitterWithName

TO "/output/besthitterswithnames.txt"

USING Outputters.Text();

24

Page 25: Dive Into Azure Data Lake - PASS 2017

You can process every file in a folder like this@log = EXTRACT date string,

time string,

client_ip string,

username string,

server_ip string,

port int,

method string,

stem string,

query string,

status string,

server_bytes int,

client_bytes int,

time_taken int,

user_agent string,

referrer string

FROM "/iislogs/{*}.txt" USING Extractors.Text(' ', silent:true);

@dailysummary = SELECT date, COUNT(*) AS hits, SUM(server_bytes) AS bytes_sent, SUM(client_bytes) AS bytes_received

FROM @log

GROUP BY date; 25

Page 26: Dive Into Azure Data Lake - PASS 2017

U-SQL Tips

Case Sensitive

C# Data Type

• String instead of varchar

C# Comparison - ==, !=

COMPARE NULLS!

Page 27: Dive Into Azure Data Lake - PASS 2017

Typical U-SQL Job

1) Extract

2) Assign results to variable

3) Create a schema on read

4) Use schema when we process it

5) Filter data that we need

6) Apply the results to a variable

7) Output the variable someplace, like a CSV file

Page 28: Dive Into Azure Data Lake - PASS 2017

Visual Studio Code

28

Page 29: Dive Into Azure Data Lake - PASS 2017

VS Code Cool Features

• Generate Code Behind

• Format Code

• Submit Jobs

• Automatically takes you to the portal

29

Page 30: Dive Into Azure Data Lake - PASS 2017

Visual Studio

• Autocomplete

• Local Data Lake

for testing

• Show the

advanced tab

Page 31: Dive Into Azure Data Lake - PASS 2017

U-SQL Catalogs

CREATE EXTERNAL TABLE

Let’s you create a managed table that’s permanent

Can also create other metadata objects

• Views – Saved query

• Table Valued Function

• Stored Procedure

Page 32: Dive Into Azure Data Lake - PASS 2017

Demo U-SQL Catalog

Page 33: Dive Into Azure Data Lake - PASS 2017

Views

Page 34: Dive Into Azure Data Lake - PASS 2017

Functions

Page 35: Dive Into Azure Data Lake - PASS 2017

Procedures

Page 36: Dive Into Azure Data Lake - PASS 2017

U-SQL and C#

Use a code behind file

Or create an assembly and deploy it to use it all the time.

Deployment script

Page 37: Dive Into Azure Data Lake - PASS 2017

DEMO U-SQL Code Behind

Page 38: Dive Into Azure Data Lake - PASS 2017

DEMO U-SQL Custom Assemblies

1. Create a class library

2. Name project Data Utilities

3. Implement class

4. View the Cloud Explorer pane

5. In the Solution Explorer pane, right-click the DataUtilities project and click Register Assembly.

6. select the webdata database.

7. In the Azure portal, on the blade for your Azure Data Lake Analytics account, click Data Explorer; and then browse to the Assemblies folder in your webdata database to verify that the assembly has been registered.

8. Use in a query

Page 39: Dive Into Azure Data Lake - PASS 2017

U-SQL Job Architecture

Each job step is divided into vertices.

Each vertex represents a single piece of work

Each unit (the slider bar at the top) is functionally similar

to a physical node.

Each available unit is going to grab a vertex and work

through it until the task is completed.

Then it will move on to the next task.

Page 40: Dive Into Azure Data Lake - PASS 2017

Job Monitor

Job List

Replay Job

See Vertices

Can use the portal or Visual Studio

Page 41: Dive Into Azure Data Lake - PASS 2017

Designing for performance

Page 42: Dive Into Azure Data Lake - PASS 2017

Add ADLS data for Hive in HDInsight

• Connect ADLS with HDInsight

• Provide a security principal

• Query WebADFS with the ADL:// prefix

42

Page 43: Dive Into Azure Data Lake - PASS 2017

ALL DONE!

Ike Ellis

@ike_ellis

Crafting Bytes

We’re hiring a Data Experts!

Microsoft MVP

Chairperson of the San Diego TIG

Book co-author – Developing Azure Solutions

Upcoming course on Azure Data Lake

www.craftingbytes.com

www.ikeellis.com