hadoop on azure 101 what is the big deal? dennis mulder solution architect microsoft corporation

37
Center of Excellence Hadoop on Azure 101 What is the Big Deal? Dennis Mulder Solution Architect Microsoft Corporation

Upload: anthony-jenkins

Post on 24-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Hadoop on Azure 101 What is the Big Deal? Dennis Mulder Solution Architect Microsoft Corporation

Center of Excellence

Hadoop on Azure 101 What is the Big Deal?Dennis MulderSolution ArchitectMicrosoft Corporation

Page 2: Hadoop on Azure 101 What is the Big Deal? Dennis Mulder Solution Architect Microsoft Corporation

Windows Azure Center of ExcellenceSpotlight

PilotsAssessment

Architectureand Design  Guidance

Modern AppsGlobal Scale Design Sessions

Global Services Team10 Senior Cloud Architects

DennisMulder

US, EMEA, APAC

8

Pilots

Cloud Apps

Champs

Services

Dennis Mulder, Solution Architect, [email protected] ContactPilots Engage

Page 3: Hadoop on Azure 101 What is the Big Deal? Dennis Mulder Solution Architect Microsoft Corporation

Agenda

What is happening?

Why Big Data?

Understanding the Basics

Microsoft and Hadoop

Page 4: Hadoop on Azure 101 What is the Big Deal? Dennis Mulder Solution Architect Microsoft Corporation

SocialMobility

mobile appswill bedownloadedin 2012

Total digital content will grow

48% from 2011

Mobility

Data security concerns=

91% of organizations expect to spend on mobile devices in 2012

1/2 of companies expect to use internal social network apps in 2012

2.7 zettabytes in 2012

>80% of new apps in 2012 will be distributed/deployed on clouds

32%of businesses are likely to invest in BI and analytics in 2012

from infrastructure to application platforms

The strategic focus in the cloudwill shiftin 2012In 2012, mobile

devices will outship PCs by more than

2:1and generate more revenue than PCs for the first time

85BILLIO

N

Social networking will follow not just people but also appliances, devices and products

34% of CIOs say technology as a service (cloud) will have the most profound effect on the CIO role in the future

2/3 of mobile apps developed in 2012 will integrate with analytics offerings

49% of CIOs rank BI as the top project priority for 2012

Big dataCloud

Four megatrends will dominate the next decade

Page 5: Hadoop on Azure 101 What is the Big Deal? Dennis Mulder Solution Architect Microsoft Corporation

mobile appswill bedownloadedin 2012

Total digital content will grow

48% from 2011

Mobility

Data security concerns=

91% of organizations expect to spend on mobile devices in 2012

1/2 of companies expect to use internal social network apps in 2012

2.7 zettabytes in 2012

>80% of new apps in 2012 will be distributed/deployed on clouds

32%of businesses are likely to invest in BI and analytics in 2012

from infrastructure to application platforms

The strategic focus in the cloudwill shiftin 2012In 2012, mobile

devices will outship PCs by more than

2:1and generate more revenue than PCs for the first time

85BILLIO

N

Social networking will follow not just people but also appliances, devices and products

34% of CIOs say technology as a service (cloud) will have the most profound effect on the CIO role in the future

2/3 of mobile apps developed in 2012 will integrate with analytics offerings

49% of CIOs rank BI as the top project priority for 2012

SocialMobility Big data

Microsoft is embracing these megatrends

Cloud

Page 6: Hadoop on Azure 101 What is the Big Deal? Dennis Mulder Solution Architect Microsoft Corporation

How will technology megatrends enable you to save money,

drive innovation, grow your business, and attract and retain customers?

Rethinking and evolving business strategies

Social Big dataMobility Cloud

Page 7: Hadoop on Azure 101 What is the Big Deal? Dennis Mulder Solution Architect Microsoft Corporation

Why Big Data?

Page 8: Hadoop on Azure 101 What is the Big Deal? Dennis Mulder Solution Architect Microsoft Corporation
Page 9: Hadoop on Azure 101 What is the Big Deal? Dennis Mulder Solution Architect Microsoft Corporation
Page 10: Hadoop on Azure 101 What is the Big Deal? Dennis Mulder Solution Architect Microsoft Corporation
Page 11: Hadoop on Azure 101 What is the Big Deal? Dennis Mulder Solution Architect Microsoft Corporation

Internet of things Audio /

VideoLog Files

Text/Image

Social Sentiment

Data Market FeedseGov Feeds

Weather

Wikis / Blogs

Click Stream

Sensors / RFID / Devices

Spatial & GPS Coordinates

WEB 2.0Mobile

Advertising

Collaboration

eCommerce

Digital Marketing

Search Marketing

Web Logs

Recommendations

ERP / CRM

Sales Pipeline

PayablesPayroll

Inventory

Contacts

Deal Tracking

Terabytes(10E12)

Gigabytes(10E9)

Exabytes(10E18)

Petabytes(10E15)

Velocity - Variety - variability

Volu

me

1980190,000$

20100.07$

19909,000$

200015$Storage/GB

ERP / CRM WEB 2.0

Internet of things

What is Big Data?

Page 12: Hadoop on Azure 101 What is the Big Deal? Dennis Mulder Solution Architect Microsoft Corporation

Example Scenarios

Page 13: Hadoop on Azure 101 What is the Big Deal? Dennis Mulder Solution Architect Microsoft Corporation

The Potential: Solving Specific Industry ProblemseCommerce: mining web logs: collaborative filtering, user experience optimisation…Manufacturing: detecting trends and anomalies in sensor data: predicting and understanding faultsCapital Markets: joining market and external data: correlation detection for investment strategy identification, risk calculations…Retail Banking: historical transaction mining: fraud detection, customer segmentation…

Industry-specific data-sets leveraged to improve decision making and generate new revenue streams

Page 14: Hadoop on Azure 101 What is the Big Deal? Dennis Mulder Solution Architect Microsoft Corporation

OPERATIONAL DATA

Traditional E-Commerce Data Flow

NEW USER REGISTRY

NEW PURCHASE

NEW PRODUCT

Excess Data

Logs

ETL Some Data

Data Warehouse

Page 15: Hadoop on Azure 101 What is the Big Deal? Dennis Mulder Solution Architect Microsoft Corporation

OPERATIONAL DATA

New E-Commerce Big Data Flow

Raw Data“Store it All” Cluster

Raw Data“Store it All” Cluster

NEW USER REGISTRY

NEW PURCHASE

NEW PRODUCT

Data Warehouse

Logs

Logs

How much do views for certain products increase when our TV ads run?

Page 16: Hadoop on Azure 101 What is the Big Deal? Dennis Mulder Solution Architect Microsoft Corporation

Understanding the Basics Move the Compute to the Data

Page 17: Hadoop on Azure 101 What is the Big Deal? Dennis Mulder Solution Architect Microsoft Corporation

Hadoop Distributed Architecture

Page 18: Hadoop on Azure 101 What is the Big Deal? Dennis Mulder Solution Architect Microsoft Corporation

FIRST, STORE THE DATA

Server

ServerServer

So How Does It Work?

Files

Server

Page 19: Hadoop on Azure 101 What is the Big Deal? Dennis Mulder Solution Architect Microsoft Corporation

SECOND, TAKE THE PROCESSING TO THE DATA

So How Does It Work?

// Map Reduce function in JavaScript

var map = function (key, value, context) {var words = value.split(/[^a-zA-Z]/);for (var i = 0; i < words.length; i++) {

if (words[i] !== "")context.write(words[i].toLowerCase(),1);}}};

var reduce = function (key, values, context) {var sum = 0;while (values.hasNext()) {sum += parseInt(values.next());

}context.write(key, sum);};

ServerServer

ServerServer

RUNTIME

Code

Page 20: Hadoop on Azure 101 What is the Big Deal? Dennis Mulder Solution Architect Microsoft Corporation

MapReduce – Workflow

Page 21: Hadoop on Azure 101 What is the Big Deal? Dennis Mulder Solution Architect Microsoft Corporation

21Map tasks

53705 $65

53705 $30

53705 $15

54235 $75

54235 $22

02115 $15

02115 $15

44313 $10

44313 $25

44313 $55

5 53705 $15

6 44313 $10

5 53705 $65

0 54235 $22

9 02115 $15

6 44313 $25

3 10025 $95

8 44313 $55

2 53705 $30

1 02115 $15

4 54235 $75

7 10025 $60

Mapper

Mapper

4 54235 $75

7 10025 $60

2 53705 $30

1 02115 $15

10025 $60

5 53705 $65

0 54235 $22

5 53705 $15

6 44313 $10

3 10025 $95

8 44313 $55

9 02115 $15

6 44313 $25

10025 $95

Scenario: Get sum sales grouped by zipCode

Data

Node3

Data

Node2

Data

Node1

Blocks of the Sales file in HDFS

GroupBy

GroupBy

(custId, zipCode, amount)

One output bucket per reduce task

Map

Page 22: Hadoop on Azure 101 What is the Big Deal? Dennis Mulder Solution Architect Microsoft Corporation

Reducer

Reducer

Reduce tasks

Reducer

53705 $65

54235 $75

54235 $22

10025 $95

44313 $55

10025 $60

Mapper

53705 $30

53705 $15

02115 $15

02115 $15

44313 $10

44313 $25

Mapper

53705 $65

53705 $30

53705 $15

44313 $10

44313 $25

10025 $95

44313 $55

10025 $60

54235 $75

54235 $22

02115 $15

02115 $15

Sort

Sort

Sort

53705 $65

53705 $30

53705 $15

44313 $10

44313 $25

44313 $55

10025 $95

10025 $60

54235 $75

54235 $22

02115 $15

02115 $15

SUM

SUM

SUM

10025 $155

44313 $90

53705 $110

54235 $97

02115 $30

Done!

Sh

uffl

e

Reduce

Page 23: Hadoop on Azure 101 What is the Big Deal? Dennis Mulder Solution Architect Microsoft Corporation

HD Insight

Page 24: Hadoop on Azure 101 What is the Big Deal? Dennis Mulder Solution Architect Microsoft Corporation

MICROSOFT CONFIDENTIAL – INTERNAL ONLY

Front end

Front end

Stream Layer

Partition Layer

HDFS on Azure: Tale of two File Systems

Name Node

de

Data Node Data Node

Front end

HDFS API

DFS (1 Data Node per Worker Role)and Compute Cluster

Azure Storage (ASV)

Azure Blob Storage

Page 25: Hadoop on Azure 101 What is the Big Deal? Dennis Mulder Solution Architect Microsoft Corporation

Azure Storage (ASV)• Default file system for HDInsight Service• Provides shareable, persistent, highly-scalable Storage with high

availability (Azure Blob Store)• Azure storage itself does not provide compute• Fast access from compute nodes to data in same data center• Several file systems, addressable via:asv[s]:<container>@<account>.blob.core.windows.net/<path>

• Requires storage key in core-site.xml:<property> <name>fs.azure.account.key.accountname</name> <value>enterthekeyvaluehere</value></property>

Page 26: Hadoop on Azure 101 What is the Big Deal? Dennis Mulder Solution Architect Microsoft Corporation

Distributed Storage(HDFS)

Query(Hive)

Distributed Processing

(MapReduce)

Scripting(Pig)

NoSQ

L Data

base

(HB

ase

)

Metadata(HCatalog)

Data

Inte

gra

tion

( OD

BC

/ SQ

OO

P/ REST)

Rela

tiona

l(S

QL

Serve

r)

Machine Learning(Mahout)

Graph(Pegasus)

Stats processin

g(RHadoo

p)

Eve

nt Pip

elin

e(Flu

me)

Active Directory (Security)

Monitoring & Deployment

(System Center)

C#, F#, .NET

JavaScript

Pipelin

e / w

orkflo

w(O

ozie

)

Azure Storage Vault (ASV)

PD

W Po

lybase

Busin

ess

Inte

lligence

(E

xcel, Po

wer

Vie

w, S

SA

S)

HDINSIGHT / HADOOP Eco-System

World's Data (Azure Data Marketplace)

Eve

nt

Drive

n

Proce

ssing

LegendRed = Core HadoopBlue = Data processingPurple = Microsoft integration points and value addsOrange = Data MovementGreen = Packages

Page 27: Hadoop on Azure 101 What is the Big Deal? Dennis Mulder Solution Architect Microsoft Corporation

Programming HDInsightExisting Ecosystem

Hive, Pig, Mahout, Cascading, Scalding, Scoobi, Pegasus…

.NET

JavaScript

DevOps / IT Pros

C#, F# Map/Reduce, LINQ to Hive, .NET management clients

JavaScript Map/Reduce, Browser hosted console, Node.js management clients

PowerShell, Cross Platform CLI tools

Page 28: Hadoop on Azure 101 What is the Big Deal? Dennis Mulder Solution Architect Microsoft Corporation

Authoring Jobs App Integration

Building Developer Experiences

Core Hadoop

Consistent REST API’s

Breadth of Clients (Java, JS, .NET, etc)

Authoring frameworks and languages

End User Tooling (IDE’s, Analyst tools, Command lines)

ConnectivityProgrammabilitySecurityLoosely coupled

LightweightLow cost to

extendScenario oriented

Innovation flows upward

New compute models

Perf enhancements

Extend breadth & depthEnable new scenariosIntegrate with current tool chains

Page 29: Hadoop on Azure 101 What is the Big Deal? Dennis Mulder Solution Architect Microsoft Corporation

Traditional RDBMS vs. MapReduce

TRADITIONAL RDBMS MAPREDUCE

Data Size Gigabytes (Terabytes) Petabytes (Hexabytes)

Access Interactive and Batch Batch

Updates Read / Write many times Write once, Read many times

Structure Static Schema Dynamic Schema

Integrity High (ACID) Low

Scaling Nonlinear Linear

DBA Ratio 1:40 1:3000

Reference: Tom White’s Hadoop: The Definitive Guide

Page 30: Hadoop on Azure 101 What is the Big Deal? Dennis Mulder Solution Architect Microsoft Corporation

The Hadoop EcosystemETL Tools BI Reporting RDBMS

Reference: Tom White’s Hadoop: The Definitive Guide

Page 31: Hadoop on Azure 101 What is the Big Deal? Dennis Mulder Solution Architect Microsoft Corporation

Deploying and Interacting With HDInsight Service

demo

Page 32: Hadoop on Azure 101 What is the Big Deal? Dennis Mulder Solution Architect Microsoft Corporation

Hadoop on WindowsInsights to all users by activating new types of data

Integrate with Microsoft Business Intelligence

Choice of deployment on Windows Server + Windows Azure

Integrate with Windows Components (AD, Systems Center)Easy installation and configuration of Hadoop on Windows

Simplified programming with . Net & Javascript integration

Integrate with SQL Server Data Warehousing

Diff

ere

nti

ati

on

Page 34: Hadoop on Azure 101 What is the Big Deal? Dennis Mulder Solution Architect Microsoft Corporation

Summary Hadoop is about massive compute and massive data The code is brought to the data Map -> Split the work Reduce -> Combine the results Relational databases vs Hadoop?

Wrong question - Serve different needs

Page 35: Hadoop on Azure 101 What is the Big Deal? Dennis Mulder Solution Architect Microsoft Corporation

Resources• http://www.windowsazure.com/• http://hadoop.apache.org/• Nuget: http://nuget.org/packages?q=hadoop• Hadoop SDK: http://hadoopsdk.codeplex.com

Page 36: Hadoop on Azure 101 What is the Big Deal? Dennis Mulder Solution Architect Microsoft Corporation

Windows Azure Center of ExcellenceSpotlight

PilotsAssessment

Architectureand Design  Guidance

Modern AppsGlobal Scale Design Sessions

Global Services Team10 Senior Cloud Architects

DennisMulder

US, EMEA, APAC

8

Pilots

Cloud Apps

Champs

Services

Dennis Mulder, Solution Architect, [email protected] ContactPilots Engage

Page 37: Hadoop on Azure 101 What is the Big Deal? Dennis Mulder Solution Architect Microsoft Corporation

© 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.