azure hdinsight

Big Data World

With Azure HDInsight

About

▪ Koray Kocabaş

▪ Data Platform (SQL Server) MVP

▪ Yemeksepeti Business Intelligence

▪ Bahcesehir University Instructor

▪ @koraykocabas

▪ https://tr.linkedin.com/in/koraykocabas

▪ Blog: http://www.misjournal.com

▪ E-Mail: [email protected]

Evolution of Data

Internet of Things

Web 2.0

ERP/CRM

• Clickstream

• Sensors / RFID / Devices

• Log Files

• Spatial & GPS Coordinates

• Social Media

• Mobile

• Advertising

• eCommerce

• Digital Marketing

• Search Marketing

• Recommendations

• Payables

• Payroll

• Inventory

• Contacts

• Deal Tracking

• Sales Pipeline

Gigabytes Terabytes Petabytes Exabytes

Big Data Utility Gap

70 % of data generated by

customers

80 % of data being stored

3 % being prepared for

analysis

0.5 % begin analyzed

< 0.5 % begin operationalized

How Does this Work in Practice

• Obsessively collect data

• Keep it forever

• Put the data in one place

Store Everything

• Cleanse, organize and manage your data

• Make the right tools available

• Use the resources wisely to compute, analyze and understand data

Analyze Anything

• Use insights to iteratively improve your productBuild the Right

Thing

Big Data isn’t meaningfulBig Data is not just data

65 + Million Members50 Countries1000 + Devices Supported~25 PB Datawarehouse on Cloud (Read %10)~550 Billion events daily

• 20 Million songs• 24 Million Active Users• 8 Million Daily Active Users• 1 TB of Compressed Data Generated From Users Per Day• 700 node Hadoop Cluster

Big Data is not just data

Cannes Lion 2014 - Grand Prix - Titanium : Honda 'Sound of Honda Ayrton Senna 1989'

2000 + sensors, 200 GB data per a race

Big Data is not just data Boeing generates 20 TB data per hour

~10 Billions row processed (Daily) ~750 Millions row result set (Daily)

New E-Commerce Big Data Flow

Purchase

User

Product

Data Warehouse

Store it All

Overview (ETL to ELT)

Demand Architecture Data Loading

Data Preparation

Analytics Validation

Problem: How can we track?

Web Analytics Companies

Google Analytics & Adobe Omniture

Problem 1: How can we collect dataProblem 2: How can we store dataProblem 3: How can we visualize dataProblem 4: How can we predict data

Buraya Google Analytics Adobe Örneği koy

Azure vs Amazon

Collect Process Analyze Visualize Prediction

Store

Ready to Use

MOOC

Big Data Analytics, Implementing Big Data Analysis, Big Data Analytics with HDInsight, Big Data and Business Analytics Immersion, Getting Started with Microsoft Azure Machine Learning

Real World Big Data in Azure, Big Data on Amazon Web Services, Reporting with MongoDB, Cloud Business Intelligence, HDInsight Deep Dive: Storm HBase and Hive, Data Science & Hadoop Workflows at Scale With Scalding, SQL on Hadoop - Analyzing Big Data with Hive

Introduction to Big Data Analytics, Machine Learning with Big Data, Big Data Analytics for Healthcare, Data Science at Scale, The Data Scientist's Toolbox, R Programming

Master Big Data and Hadoop Step by Step, Hadoop Essentials, Hadoop Starter Kit, Data Analytics using Hadoop eco system, Big Data: How Data Analytics Is Transforming the World, Applied Data Science with R, Hadoop Enterprise Integration

Data Science and Analytics in Context, Introduction to Big Data with Spark, Data Science and Machine Learning Essentials, Machine Learning for Data Science and Analytics, Statistical Thinking for Data Science and Analytics

OLTP vs Hadoop

Hadoop Ecosystem

One more cup of coffee

https://azure.microsoft.com/en-us/pricing/details/hdinsight/https://azure.microsoft.com/en-us/pricing/calculator/#

https://azure.microsoft.com/en-us/pricing/details/hdinsight/

https://azure.microsoft.com/en-us/pricing/calculator/

Developed by Facebook. Later it was adopted in Apache as an open source project.

A data warehouse infrastructure built on top of Hadoop for providing data summarization, query and analysis

Integration between Hadoop and BI and visualization

Provides an SQL Like language called Hive QL to query data

Create Index, includes Partitioning

Not supported Update (isn’t correct)

Hive provides Users, Groups, Roles. But it’s not designed for high security.

Console (hive>), script, ODBC/JDBC, SQuirreL, HUE, Web Interface, etc.

Most popular Business Intelligence Tools support Hive

Data Types

Primitive Data Types: int, bigint, float, double, boolean, decimal, string, timestamp, date etc.

Complex Data Types: arrays, maps, structs

ARRAY<string>: workplace: istanbul, ankaraSTRUCT<sex:string,age:int> : Female,25MAP<string,int>: SOLR:92

Hive RDBMS

SQL Interface SQL Interface

Focus on analytics ay focus on online or analytics

No transactions Transactions usually supported

Partition adds, no random Inserts. Random Insert and Update supported

Distributed processing via map/reduce Distributed processing varies by vendor (if available)

Scales to hundreds of nodes Seldom scale beyond 20 nodes

Built for commodity hardware Often built on proprietary hardware (especially when scaling out)

Low cost per petabyte What's petabyte? :) (note: Are you sure?)

Hive Architecture

SQL on Hadoop Frameworks

• Apache Hive• Impala• Presto (Facebook)• EMC/Pivotal HAWQ• BigSQL by IBM

OLTP vs Hive

http://hortonworks.com/wp-content/uploads/downloads/2013/08/Hortonworks.CheatSheet.SQLtoHive.pdf

http://hortonworks.com/wp-content/uploads/downloads/2013/08/Hortonworks.CheatSheet.SQLtoHive.pdf

Originally developed at Yahoo! (Huge contributions from Hortonworks, Twitter)

A Platform for analyzing large data sets that consists of high-level language for expressing data analysis programs

Processing large semi-structured data sets using Hadoop Map Reduce

Write complex MapReduce jobs using a simple script language (Pig Latin)

Pig provides a bunch of aggregation function (AVG, COUNT, SUM, MAX, MIN etc.)

Developers can develop UDF

Console (grunt), script, java, HUE (Hadoop User Experience by Cloudera)

Easy to use and efficient

Data Types

Simple Data Types: int, float, double, chararray (UTF-8), bytearrayComplex Data Types: map (Key,Value), Tuple, Bag (list of tuples)

Commands

Loading: LOAD, STORE, DUMPFiltering: FILTER, FOREACH, DISTINCTGrouping: JOIN, GROUP, COGROUP, CROSSOrdering: ORDER, LIMITMerging & Split: UNION, SPLIT

Data Types

Simple Data Types: int, float, double, chararray (UTF-8), bytearrayComplex Data Types: map (Key,Value), Tuple, Bag (list of tuples)

Commands

Loading: LOAD, STORE, DUMPFiltering: FILTER, FOREACH, DISTINCTGrouping: JOIN, GROUP, COGROUP, CROSSOrdering: ORDER, LIMITMerging & Split: UNION, SPLIT

SQL SCRIPT PIG SCRIPT

SELECT * FROM TABLE A=LOAD 'DATA' USING PigStorage('\t') AS (col1:int, col2:int, col3:int);

SELECT col1+col2, col3 FROM TABLE B=FOREACH A GENERATE col1+col2, col3;

SELECT col1+col2, col3 FROM TABLE WHERE col3>10 C=FILTER B by col3>10;

SELECT col1, col2, sum(col3) FROM X GROUP BY col1, col2 D=GROUP A BY (col1,col2);

E=FOREACH D GENERATE FLATTEN(group), SUM(A.col3);

... HAVING sum(col3) > 5 F=FILTER E BY $2>5;

... ORDER BY col1 G=ORDER F BY $0

SELECT DISTINCT col1 FROM TABLE I=FOREACH A GENERATE col1;

J=DISTINCT I;

SELECT col1,COUNT(DISTINCT col2) FROM TABLE GROUP BY col1 K=GROUP A BY col1;

L=FOREACH K {M=DISTINCT A.col2; GENERATE FLATTEN(group), count(M);}

Methods of Creating Azure HDInsight (Azure Portal)

Methods of Creating Azure HDInsight (Powershell)

Methods of Creating Azure HDInsight (.Net SDK)

Methods of Creating Azure HDInsight (SSIS)

Demo

• Create Hadoop Cluster (HDInsight)• Create Database and Table (Hive)• Data Load (Hive)• Querying (Hive)• Analyzing BreakingBad Subtitle (Pig)

Case Study Klout

• Collect and normalize more than 12 billion signals a day

• Hive data warehouse of more than 1 trillion rows

• Klout acquired for $200 million by Lithium Technologies

Necessary to use HDInsight or Hadoop?

• Find the Major Problem

Thank you

azure hdinsight

Data & Analytics