azure hdinsight
TRANSCRIPT
About
▪ Koray Kocabaş
▪ Data Platform (SQL Server) MVP
▪ Yemeksepeti Business Intelligence
▪ Bahcesehir University Instructor
▪ @koraykocabas
▪ https://tr.linkedin.com/in/koraykocabas
▪ Blog: http://www.misjournal.com
▪ E-Mail: [email protected]
Evolution of Data
Internet of Things
Web 2.0
ERP/CRM
• Clickstream
• Sensors / RFID / Devices
• Log Files
• Spatial & GPS Coordinates
• Social Media
• Mobile
• Advertising
• eCommerce
• Digital Marketing
• Search Marketing
• Recommendations
• Payables
• Payroll
• Inventory
• Contacts
• Deal Tracking
• Sales Pipeline
Gigabytes Terabytes Petabytes Exabytes
Big Data Utility Gap
70 % of data generated by
customers
80 % of data being stored
3 % being prepared for
analysis
0.5 % begin analyzed
< 0.5 % begin operationalized
How Does this Work in Practice
• Obsessively collect data
• Keep it forever
• Put the data in one place
Store Everything
• Cleanse, organize and manage your data
• Make the right tools available
• Use the resources wisely to compute, analyze and understand data
Analyze Anything
• Use insights to iteratively improve your productBuild the Right
Thing
Big Data isn’t meaningfulBig Data is not just data
65 + Million Members50 Countries1000 + Devices Supported~25 PB Datawarehouse on Cloud (Read %10)~550 Billion events daily
• 20 Million songs• 24 Million Active Users• 8 Million Daily Active Users• 1 TB of Compressed Data Generated From Users Per Day• 700 node Hadoop Cluster
Big Data is not just data
Cannes Lion 2014 - Grand Prix - Titanium : Honda 'Sound of Honda Ayrton Senna 1989'
2000 + sensors, 200 GB data per a race
Google Analytics & Adobe Omniture
Problem 1: How can we collect dataProblem 2: How can we store dataProblem 3: How can we visualize dataProblem 4: How can we predict data
MOOC
Big Data Analytics, Implementing Big Data Analysis, Big Data Analytics with HDInsight, Big Data and Business Analytics Immersion, Getting Started with Microsoft Azure Machine Learning
Real World Big Data in Azure, Big Data on Amazon Web Services, Reporting with MongoDB, Cloud Business Intelligence, HDInsight Deep Dive: Storm HBase and Hive, Data Science & Hadoop Workflows at Scale With Scalding, SQL on Hadoop - Analyzing Big Data with Hive
Introduction to Big Data Analytics, Machine Learning with Big Data, Big Data Analytics for Healthcare, Data Science at Scale, The Data Scientist's Toolbox, R Programming
Master Big Data and Hadoop Step by Step, Hadoop Essentials, Hadoop Starter Kit, Data Analytics using Hadoop eco system, Big Data: How Data Analytics Is Transforming the World, Applied Data Science with R, Hadoop Enterprise Integration
Data Science and Analytics in Context, Introduction to Big Data with Spark, Data Science and Machine Learning Essentials, Machine Learning for Data Science and Analytics, Statistical Thinking for Data Science and Analytics
One more cup of coffee
https://azure.microsoft.com/en-us/pricing/details/hdinsight/https://azure.microsoft.com/en-us/pricing/calculator/#
Developed by Facebook. Later it was adopted in Apache as an open source project.
A data warehouse infrastructure built on top of Hadoop for providing data summarization, query and analysis
Integration between Hadoop and BI and visualization
Provides an SQL Like language called Hive QL to query data
Create Index, includes Partitioning
Not supported Update (isn’t correct)
Hive provides Users, Groups, Roles. But it’s not designed for high security.
Console (hive>), script, ODBC/JDBC, SQuirreL, HUE, Web Interface, etc.
Most popular Business Intelligence Tools support Hive
Data Types
Primitive Data Types: int, bigint, float, double, boolean, decimal, string, timestamp, date etc.
Complex Data Types: arrays, maps, structs
ARRAY<string>: workplace: istanbul, ankaraSTRUCT<sex:string,age:int> : Female,25MAP<string,int>: SOLR:92
Hive RDBMS
SQL Interface SQL Interface
Focus on analytics ay focus on online or analytics
No transactions Transactions usually supported
Partition adds, no random Inserts. Random Insert and Update supported
Distributed processing via map/reduce Distributed processing varies by vendor (if available)
Scales to hundreds of nodes Seldom scale beyond 20 nodes
Built for commodity hardware Often built on proprietary hardware (especially when scaling out)
Low cost per petabyte What's petabyte? :) (note: Are you sure?)
Hive Architecture
SQL on Hadoop Frameworks
• Apache Hive• Impala• Presto (Facebook)• EMC/Pivotal HAWQ• BigSQL by IBM
OLTP vs Hive
http://hortonworks.com/wp-content/uploads/downloads/2013/08/Hortonworks.CheatSheet.SQLtoHive.pdf
Originally developed at Yahoo! (Huge contributions from Hortonworks, Twitter)
A Platform for analyzing large data sets that consists of high-level language for expressing data analysis programs
Processing large semi-structured data sets using Hadoop Map Reduce
Write complex MapReduce jobs using a simple script language (Pig Latin)
Pig provides a bunch of aggregation function (AVG, COUNT, SUM, MAX, MIN etc.)
Developers can develop UDF
Console (grunt), script, java, HUE (Hadoop User Experience by Cloudera)
Easy to use and efficient
Data Types
Simple Data Types: int, float, double, chararray (UTF-8), bytearrayComplex Data Types: map (Key,Value), Tuple, Bag (list of tuples)
Commands
Loading: LOAD, STORE, DUMPFiltering: FILTER, FOREACH, DISTINCTGrouping: JOIN, GROUP, COGROUP, CROSSOrdering: ORDER, LIMITMerging & Split: UNION, SPLIT
Data Types
Simple Data Types: int, float, double, chararray (UTF-8), bytearrayComplex Data Types: map (Key,Value), Tuple, Bag (list of tuples)
Commands
Loading: LOAD, STORE, DUMPFiltering: FILTER, FOREACH, DISTINCTGrouping: JOIN, GROUP, COGROUP, CROSSOrdering: ORDER, LIMITMerging & Split: UNION, SPLIT
SQL SCRIPT PIG SCRIPT
SELECT * FROM TABLE A=LOAD 'DATA' USING PigStorage('\t') AS (col1:int, col2:int, col3:int);
SELECT col1+col2, col3 FROM TABLE B=FOREACH A GENERATE col1+col2, col3;
SELECT col1+col2, col3 FROM TABLE WHERE col3>10 C=FILTER B by col3>10;
SELECT col1, col2, sum(col3) FROM X GROUP BY col1, col2 D=GROUP A BY (col1,col2);
E=FOREACH D GENERATE FLATTEN(group), SUM(A.col3);
... HAVING sum(col3) > 5 F=FILTER E BY $2>5;
... ORDER BY col1 G=ORDER F BY $0
SELECT DISTINCT col1 FROM TABLE I=FOREACH A GENERATE col1;
J=DISTINCT I;
SELECT col1,COUNT(DISTINCT col2) FROM TABLE GROUP BY col1 K=GROUP A BY col1;
L=FOREACH K {M=DISTINCT A.col2; GENERATE FLATTEN(group), count(M);}
Demo
• Create Hadoop Cluster (HDInsight)• Create Database and Table (Hive)• Data Load (Hive)• Querying (Hive)• Analyzing BreakingBad Subtitle (Pig)
Case Study Klout
• Collect and normalize more than 12 billion signals a day
• Hive data warehouse of more than 1 trillion rows
• Klout acquired for $200 million by Lithium Technologies