big data analytics: from etl to data engineering

7
Big Data Analytics: From ETL to Data Engineering By Dmitry Tolpeko – November 2016

Upload: dmitry-tolpeko

Post on 22-Jan-2017

278 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Big Data Analytics: From ETL to Data Engineering

Big Data Analytics: From ETL to Data Engineering

By Dmitry Tolpeko – November 2016

Page 2: Big Data Analytics: From ETL to Data Engineering

Is Data Engineering Just a New Term? Is there any difference between ETL and Data Engineering? Or do

they mean the same thing?

I think they are very different, and I will try to explain.

Page 3: Big Data Analytics: From ETL to Data Engineering

What is ETL? ETL means Extract-Transform-Load, and it is a very important part of

building and maintaining a Data Warehouse

Extract: Export data from one or many sources.

Transform (at various stages): Clean, transform, aggregate etc. data

Load (at various stages): Load data into the data warehouse, data marts i.e. make data available for BI tools.

Page 4: Big Data Analytics: From ETL to Data Engineering

ETL Specifics Transform: Mostly SQL and Procedural SQL

SQL Analytic (Window) functions such as LEAD/LAG for complex calculations

Data in Data Warehouse: Often redundant, duplicated and pre-aggregated for performance reasons. But in most cases data is in first normal form i.e. columns contain atomic values.

Also data is in Star schema: a set of dimension and fact tables.

Data Consumers: BI tools that mostly visualize data in interactive (filter, drill-down etc.) reports and dashboards.

Page 5: Big Data Analytics: From ETL to Data Engineering

Big Data Analytics Data Scientists are the first consumers of data now.

Feature engineering is the major step of building statistical and machine learning models for advanced analytics:

Feature vectors containing hundreds of elements with each element containing a list, map, nested structs, buckets, list of maps, map of maps and so on.

Complex Calculations to create feature vectors within a data window (reduce phase) require a non-SQL approach.

Page 6: Big Data Analytics: From ETL to Data Engineering

Skills: ETL Developer vs Data Engineer ETL Developer:

SQL and procedural SQL Analytic SQL functions Data warehouse design and modelling

Data Engineer: Python (Scala, Java), Spark, Pig, SQL Distributed data processing concepts (MapReduce, Spark) Statistics and machine learning concepts