paris datageeks meetup 05102016

30

Upload: michel-caradec

Post on 17-Jan-2017

122 views

Category:

Data & Analytics


0 download

TRANSCRIPT

U-SQL CASE STUDY

Paris Datageeks, 05/10/2016

01/05/2023 CEGID3

Michel Caradec• [email protected]

Project Manager, Software/Data Engineer at Cegid Background

• Business Intelligence, ETL, OLAP, Data Manipulation• C#, R, Python

About Me

Business Case

Azure Data Lake

U-SQL Case Study

Questions

Agenda

BUSINESS CASE

01/05/2023 CEGID6

Cegid web sites armed with tracking solutions• Extend web analytics data

Data Engineer: collect and prepare data Data Scientists: consume data in Azure ML Studio

• Visitors usage knowledge (browsing)• Provide better experience (recommendations)

Business Case

01/05/2023 CEGID7

Business Case

1 single visitBounce rate: 63%

Session length: 2 minMainly on Homepage

Average of 4.5 sessionsHigh screen resolutionWindows OSUse of internal SearchDuring the day

Average of 2.5 sessions1.5 pages / sessionBounce rate: 80%iOS (iPhone, iPad)In the evening and WENo conversion

Average of 7 sessions5 pages / session

Session length: 6 minMainly Solutions pages

More conversions

Tactilevisitor

3.5%

Returningvisitor

7%

Addictvisitor3.5%

One shotvisitor86%

Visitors clustering

Metrics built from a subset of the dataset.Do not represent real traffic.

AZURE DATA LAKE

01/05/2023 CEGID9

ADL Store: repository, schema-on-read, Web HDFS ADL Analytics: distributed processing using U-SQL

Azure Data Lake = Big Data as a Service

© Microsoft Azure

01/05/2023 CEGID10

REFERENCE ASSEMBLY [Cegid.DigitalAnalytics.Commons];

@data = EXTRACT user string, timestamp DateTime, heart int FROM "quantified-{user}.tsv" USING Extractors.Tsv();

@agg = SELECT *, timestamp.Date AS date, timestamp.Hour AS hour FROM @data;

@agg = SELECT user, date, hour, AVG(heart) AS avg, MIN(heart) AS min, MAX(heart) AS max FROM @agg GROUP BY user, date, hour;

OUTPUT @agg TO "quantified.csv" USING Outputters.Csv();

U-SQL = SQL + C#

Extract Transform OutputADLS, Azure Storage Blobs, Azure SQL

File schema-on-read SQL-like data manipulation C# data types C# integration Can store as relational

And many more… User-defined aggregators (C#) User-defined operators (C#) Custom Extractors, Outputters (C#) File sets for multiple input files access

patterns Credentials

Inspired by Michael Rys presentationat SQL Server PASS Deutschland 2016

Extr

act

Out

put

Tran

sfor

m

CREATE TABLE quantified(user string, date DateTime, hour int, avg long?, min int?, max int?, INDEX idx CLUSTERED(user, date, hour) DISTRIBUTED BY HASH(user));

INSERT INTO quantified SELECT * FROM @agg;

Out

put’

CASE STUDY

01/05/2023 CEGID12

JSON Record

Data sources

“Custom Dimensions”

Contains arrays

01/05/2023 CEGID13

U-SQL Pipeline

JSON / TSV conversion

• cegid-<site>-raw.tsv

Sessions aggregation

• cegid-<site>-

sessions.tsv

01/05/2023 CEGID14

U-SQL Script 1 - JSON / TSV Conversion

*.json files extraction

Records to JSON objects

conversion

Cegid IPS filtering

JSON objects fields extraction

Custom dimensions extraction

Custom dimensions pivot

01/05/2023 CEGID15

Data extracted as raw text@json_records_raw =EXTRACT json_raw string FROM "{*}/{date:yyyy}{date:MM}{date:dd}.json"USING Extractors.Text(delimiter : Char.MinValue, quoting : false);

JSON / TSV Conversion - Step 1/7

*.json files extraction

Records to JSON objects conversion

Cegid IPS filtering

JSON objects fields extraction

Custom dimensions extraction

Custom dimensions pivot

Output to TSV files

Azure Storage Blobs 20160901/20160901.json 20160902/20160902.json 20160903/20160903.json 20160904/20160904.json …

@json_records_raw

01/05/2023 CEGID16

DECLARE @fields = new SqlArray<string>() { "timestamp", "eventId", "customerId", ... };@json_records =SELECT JsonFunctions.JsonTuple(

json_raw, @fields.ToArray()) AS json_objectFROM @json_records_raw;

JSON / TSV Conversion - Step 2/7

*.json files extraction

Records to JSON objects conversion

Cegid IPS filtering

JSON objects fields extraction

Custom dimensions extraction

Custom dimensions pivot

Output to TSV files

01/05/2023 CEGID17

@json_records =SELECT json_object["ipAddress"] AS ipAddress, json_objectFROM @json_records;

@json_records =SELECT *FROM @json_recordsLEFT ANTISEMIJOIN (SELECT ip FROM CegidIps) AS ipsON ipAddress == ips.ip;

JSON / TSV Conversion - Step 3/7

*.json files extraction

Records to JSON objects conversion

Cegid IPS filtering

JSON objects fields extraction

Custom dimensions extraction

Custom dimensions pivot

Output to TSV files

01/05/2023 CEGID18

@events =SELECT json_object["timestamp"] AS timestamp,json_object["eventId"] AS eventId,json_object["customerId"] AS customerId,// ...FROM @json_records;

JSON / TSV Conversion - Step 4/7

*.json files extraction

Records to JSON objects conversion

Cegid IPS filtering

JSON objects fields extraction

Custom dimensions extraction

Custom dimensions pivot

Output to TSV files

01/05/2023 CEGID19

@events =SELECT *, CustomDimensions.ParseFromJson(cd) AS cd_arrayFROM @events;

JSON / TSV Conversion - Step 5/7

*.json files extraction

Records to JSON objects conversion

Cegid IPS filtering

JSON objects fields extraction

Custom dimensions extraction

Custom dimensions pivot

Output to TSV files

index value

3 bigFan

4 powerBuyer

2 1443654335461

1 ComkzbXvn8g

value (ordered)ComkzbXvn8g

1443654335461

bigFan

powerBuyer

01/05/2023 CEGID20

@events =SELECT timestamp,

eventId,customerId,verb,// ...cd_array[0] AS cd_01,cd_array[1] AS cd_02,// ...cd_array[19] AS cd_20

FROM @events;

JSON / TSV Conversion - Step 6/7

*.json files extraction

Records to JSON objects conversion

Cegid IPS filtering

JSON objects fields extraction

Custom dimensions extraction

Custom dimensions pivot

Output to TSV files

valueComkzbXvn8g

1443654335461

bigFan

powerBuyer

cd_01 cd_02 cd_03 cd_04ComkzbXvn8g 1443654335461 bigFan powerBuyer

01/05/2023 CEGID21

OUTPUT( SELECT * FROM @events WHERE fullUrl.StartsWith("http://www.cegid.com/uk/"))TO "cegid-uk-raw.tsv"ORDER BY clientId, timestamp ASCUSING Outputters.Tsv(quoting : false);

JSON / TSV Conversion - Step 7/7

*.json files extraction

Records to JSON objects conversion

Cegid IPS filtering

JSON objects fields extraction

Custom dimensions extraction

Custom dimensions pivot

Output to TSV files

01/05/2023 CEGID22

JSON / TSV Conversion

01/05/2023 CEGID23

TSV format

JSON / TSV Conversion - Output

01/05/2023 CEGID24

N events M sessions (M >= 1)• New session if no activity for 30 minutes

Aggregation using U-SQL REDUCE• Microsoft.Analytics.Interfaces.IReducer

User Agent parsing: UAParser (C#) Geocoding (lon/lat, timezone, time lag): Cegid.GeoTools (C#)

U-SQL Script 2 - Sessions Aggregation

cegid-<site>-raw.tsv files extraction

Events to sessions

aggregation

Output to TSV

files

01/05/2023 CEGID25

N sessions 1 visitor Aggregation using U-SQL REDUCE

• Microsoft.Analytics.Interfaces.IReducer

U-SQL Script 3 - Visitors Aggregation

cegid-<site>-sessions.tsv

files extraction

Sessions to visitors

aggregation

Output to TSV

files

01/05/2023 CEGID26

Custom dimensions aggregation (C#)

Visitors Aggregation

visitor_id cd_01 cd_02A homepage articlesA solutions detailA homepage articlesA form newsB homepage articlesB blog news

visitor_id cat_homepage cat_solutions cat_blog cat_form typ_articles typ_detail typ_newsA 2 1 1 2 1 1B 1 1 1 1

cd_01 = page categorycd_02 = content type...cd_20 = …

01/05/2023 CEGID27

Keep C# code centralized in dedicated assemblies• Code-behind good for prototyping

Prefer U-SQL to C# for better optimization• Distributed execution built from U-SQL script, not C#

Properly use parallelism Know your data

• Statistics (cardinality, distribution, skewness)• Growth

Understand U-SQL concepts (vertex, partitioning, etc.) and MapReduce design patterns (will help)

Local mode is good, but do not forget to test on ADL

Lessons Learned

01/05/2023 CEGID28

Pros Packaged solution, zero deployment Cloud agility (scalability, elasticity) Web HDFS storage = Hadoop

compatible (sqoop, etc.) Integrated development

(Visual Studio, local/debug mode) U-SQL = SQL + C# (business code

reuse)

Azure Data Lake Review

Cons Proprietary solution Not on-premises Learning curve

QUESTIONS

Thank you

for your attention