hadoop on azure 101 what is the big deal? dennis mulder solution architect microsoft corporation
TRANSCRIPT
Center of Excellence
Hadoop on Azure 101 What is the Big Deal?Dennis MulderSolution ArchitectMicrosoft Corporation
Windows Azure Center of ExcellenceSpotlight
PilotsAssessment
Architectureand Design Guidance
Modern AppsGlobal Scale Design Sessions
Global Services Team10 Senior Cloud Architects
DennisMulder
US, EMEA, APAC
8
Pilots
Cloud Apps
Champs
Services
Dennis Mulder, Solution Architect, [email protected] ContactPilots Engage
Agenda
What is happening?
Why Big Data?
Understanding the Basics
Microsoft and Hadoop
SocialMobility
mobile appswill bedownloadedin 2012
Total digital content will grow
48% from 2011
Mobility
Data security concerns=
91% of organizations expect to spend on mobile devices in 2012
1/2 of companies expect to use internal social network apps in 2012
2.7 zettabytes in 2012
>80% of new apps in 2012 will be distributed/deployed on clouds
32%of businesses are likely to invest in BI and analytics in 2012
from infrastructure to application platforms
The strategic focus in the cloudwill shiftin 2012In 2012, mobile
devices will outship PCs by more than
2:1and generate more revenue than PCs for the first time
85BILLIO
N
Social networking will follow not just people but also appliances, devices and products
34% of CIOs say technology as a service (cloud) will have the most profound effect on the CIO role in the future
2/3 of mobile apps developed in 2012 will integrate with analytics offerings
49% of CIOs rank BI as the top project priority for 2012
Big dataCloud
Four megatrends will dominate the next decade
mobile appswill bedownloadedin 2012
Total digital content will grow
48% from 2011
Mobility
Data security concerns=
91% of organizations expect to spend on mobile devices in 2012
1/2 of companies expect to use internal social network apps in 2012
2.7 zettabytes in 2012
>80% of new apps in 2012 will be distributed/deployed on clouds
32%of businesses are likely to invest in BI and analytics in 2012
from infrastructure to application platforms
The strategic focus in the cloudwill shiftin 2012In 2012, mobile
devices will outship PCs by more than
2:1and generate more revenue than PCs for the first time
85BILLIO
N
Social networking will follow not just people but also appliances, devices and products
34% of CIOs say technology as a service (cloud) will have the most profound effect on the CIO role in the future
2/3 of mobile apps developed in 2012 will integrate with analytics offerings
49% of CIOs rank BI as the top project priority for 2012
SocialMobility Big data
Microsoft is embracing these megatrends
Cloud
How will technology megatrends enable you to save money,
drive innovation, grow your business, and attract and retain customers?
Rethinking and evolving business strategies
Social Big dataMobility Cloud
Why Big Data?
Internet of things Audio /
VideoLog Files
Text/Image
Social Sentiment
Data Market FeedseGov Feeds
Weather
Wikis / Blogs
Click Stream
Sensors / RFID / Devices
Spatial & GPS Coordinates
WEB 2.0Mobile
Advertising
Collaboration
eCommerce
Digital Marketing
Search Marketing
Web Logs
Recommendations
ERP / CRM
Sales Pipeline
PayablesPayroll
Inventory
Contacts
Deal Tracking
Terabytes(10E12)
Gigabytes(10E9)
Exabytes(10E18)
Petabytes(10E15)
Velocity - Variety - variability
Volu
me
1980190,000$
20100.07$
19909,000$
200015$Storage/GB
ERP / CRM WEB 2.0
Internet of things
What is Big Data?
Example Scenarios
The Potential: Solving Specific Industry ProblemseCommerce: mining web logs: collaborative filtering, user experience optimisation…Manufacturing: detecting trends and anomalies in sensor data: predicting and understanding faultsCapital Markets: joining market and external data: correlation detection for investment strategy identification, risk calculations…Retail Banking: historical transaction mining: fraud detection, customer segmentation…
Industry-specific data-sets leveraged to improve decision making and generate new revenue streams
OPERATIONAL DATA
Traditional E-Commerce Data Flow
NEW USER REGISTRY
NEW PURCHASE
NEW PRODUCT
Excess Data
Logs
ETL Some Data
Data Warehouse
OPERATIONAL DATA
New E-Commerce Big Data Flow
Raw Data“Store it All” Cluster
Raw Data“Store it All” Cluster
NEW USER REGISTRY
NEW PURCHASE
NEW PRODUCT
Data Warehouse
Logs
Logs
How much do views for certain products increase when our TV ads run?
Understanding the Basics Move the Compute to the Data
Hadoop Distributed Architecture
FIRST, STORE THE DATA
Server
ServerServer
So How Does It Work?
Files
Server
SECOND, TAKE THE PROCESSING TO THE DATA
So How Does It Work?
// Map Reduce function in JavaScript
var map = function (key, value, context) {var words = value.split(/[^a-zA-Z]/);for (var i = 0; i < words.length; i++) {
if (words[i] !== "")context.write(words[i].toLowerCase(),1);}}};
var reduce = function (key, values, context) {var sum = 0;while (values.hasNext()) {sum += parseInt(values.next());
}context.write(key, sum);};
ServerServer
ServerServer
RUNTIME
Code
MapReduce – Workflow
21Map tasks
53705 $65
53705 $30
53705 $15
54235 $75
54235 $22
02115 $15
02115 $15
44313 $10
44313 $25
44313 $55
5 53705 $15
6 44313 $10
5 53705 $65
0 54235 $22
9 02115 $15
6 44313 $25
3 10025 $95
8 44313 $55
2 53705 $30
1 02115 $15
4 54235 $75
7 10025 $60
Mapper
Mapper
4 54235 $75
7 10025 $60
2 53705 $30
1 02115 $15
10025 $60
5 53705 $65
0 54235 $22
5 53705 $15
6 44313 $10
3 10025 $95
8 44313 $55
9 02115 $15
6 44313 $25
10025 $95
Scenario: Get sum sales grouped by zipCode
Data
Node3
Data
Node2
Data
Node1
Blocks of the Sales file in HDFS
GroupBy
GroupBy
(custId, zipCode, amount)
One output bucket per reduce task
Map
Reducer
Reducer
Reduce tasks
Reducer
53705 $65
54235 $75
54235 $22
10025 $95
44313 $55
10025 $60
Mapper
53705 $30
53705 $15
02115 $15
02115 $15
44313 $10
44313 $25
Mapper
53705 $65
53705 $30
53705 $15
44313 $10
44313 $25
10025 $95
44313 $55
10025 $60
54235 $75
54235 $22
02115 $15
02115 $15
Sort
Sort
Sort
53705 $65
53705 $30
53705 $15
44313 $10
44313 $25
44313 $55
10025 $95
10025 $60
54235 $75
54235 $22
02115 $15
02115 $15
SUM
SUM
SUM
10025 $155
44313 $90
53705 $110
54235 $97
02115 $30
Done!
Sh
uffl
e
Reduce
HD Insight
MICROSOFT CONFIDENTIAL – INTERNAL ONLY
Front end
Front end
Stream Layer
Partition Layer
HDFS on Azure: Tale of two File Systems
Name Node
de
Data Node Data Node
Front end
HDFS API
DFS (1 Data Node per Worker Role)and Compute Cluster
Azure Storage (ASV)
…
Azure Blob Storage
Azure Storage (ASV)• Default file system for HDInsight Service• Provides shareable, persistent, highly-scalable Storage with high
availability (Azure Blob Store)• Azure storage itself does not provide compute• Fast access from compute nodes to data in same data center• Several file systems, addressable via:asv[s]:<container>@<account>.blob.core.windows.net/<path>
• Requires storage key in core-site.xml:<property> <name>fs.azure.account.key.accountname</name> <value>enterthekeyvaluehere</value></property>
Distributed Storage(HDFS)
Query(Hive)
Distributed Processing
(MapReduce)
Scripting(Pig)
NoSQ
L Data
base
(HB
ase
)
Metadata(HCatalog)
Data
Inte
gra
tion
( OD
BC
/ SQ
OO
P/ REST)
Rela
tiona
l(S
QL
Serve
r)
Machine Learning(Mahout)
Graph(Pegasus)
Stats processin
g(RHadoo
p)
Eve
nt Pip
elin
e(Flu
me)
Active Directory (Security)
Monitoring & Deployment
(System Center)
C#, F#, .NET
JavaScript
Pipelin
e / w
orkflo
w(O
ozie
)
Azure Storage Vault (ASV)
PD
W Po
lybase
Busin
ess
Inte
lligence
(E
xcel, Po
wer
Vie
w, S
SA
S)
HDINSIGHT / HADOOP Eco-System
World's Data (Azure Data Marketplace)
Eve
nt
Drive
n
Proce
ssing
LegendRed = Core HadoopBlue = Data processingPurple = Microsoft integration points and value addsOrange = Data MovementGreen = Packages
Programming HDInsightExisting Ecosystem
Hive, Pig, Mahout, Cascading, Scalding, Scoobi, Pegasus…
.NET
JavaScript
DevOps / IT Pros
C#, F# Map/Reduce, LINQ to Hive, .NET management clients
JavaScript Map/Reduce, Browser hosted console, Node.js management clients
PowerShell, Cross Platform CLI tools
Authoring Jobs App Integration
Building Developer Experiences
Core Hadoop
Consistent REST API’s
Breadth of Clients (Java, JS, .NET, etc)
Authoring frameworks and languages
End User Tooling (IDE’s, Analyst tools, Command lines)
ConnectivityProgrammabilitySecurityLoosely coupled
LightweightLow cost to
extendScenario oriented
Innovation flows upward
New compute models
Perf enhancements
Extend breadth & depthEnable new scenariosIntegrate with current tool chains
Traditional RDBMS vs. MapReduce
TRADITIONAL RDBMS MAPREDUCE
Data Size Gigabytes (Terabytes) Petabytes (Hexabytes)
Access Interactive and Batch Batch
Updates Read / Write many times Write once, Read many times
Structure Static Schema Dynamic Schema
Integrity High (ACID) Low
Scaling Nonlinear Linear
DBA Ratio 1:40 1:3000
Reference: Tom White’s Hadoop: The Definitive Guide
The Hadoop EcosystemETL Tools BI Reporting RDBMS
Reference: Tom White’s Hadoop: The Definitive Guide
Deploying and Interacting With HDInsight Service
demo
Hadoop on WindowsInsights to all users by activating new types of data
Integrate with Microsoft Business Intelligence
Choice of deployment on Windows Server + Windows Azure
Integrate with Windows Components (AD, Systems Center)Easy installation and configuration of Hadoop on Windows
Simplified programming with . Net & Javascript integration
Integrate with SQL Server Data Warehousing
Diff
ere
nti
ati
on
Microsoft Big Data Solution
Power View Excel with PowerPivot Embedded BIPredictive Analytics
APPsLOBCRMERP
Microsoft EDW
SSAS SSRS
Devices CrawlersSensors Bots
Hadoop On Windows ServerHDInsight
Summary Hadoop is about massive compute and massive data The code is brought to the data Map -> Split the work Reduce -> Combine the results Relational databases vs Hadoop?
Wrong question - Serve different needs
Resources• http://www.windowsazure.com/• http://hadoop.apache.org/• Nuget: http://nuget.org/packages?q=hadoop• Hadoop SDK: http://hadoopsdk.codeplex.com
Windows Azure Center of ExcellenceSpotlight
PilotsAssessment
Architectureand Design Guidance
Modern AppsGlobal Scale Design Sessions
Global Services Team10 Senior Cloud Architects
DennisMulder
US, EMEA, APAC
8
Pilots
Cloud Apps
Champs
Services
Dennis Mulder, Solution Architect, [email protected] ContactPilots Engage
© 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.