big time: introducing hadoop on azure

Post on 01-Nov-2014

524 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Introduction to HDInsight service (aka Hadoop on Azure)

TRANSCRIPT

Big Data

The problem is simple

• While the storage capacities of hard drives have increased massively over the years, access speeds—the rate at which data can be read from drives have not kept up.

• One typical drive from 1990 could store 1,370 MB of data and had a transfer speed of 4.4 MB/s

• so you could read all the data from a full drive in around five minutes.

• Over 20 years later, one terabyte drives are the norm, but the transfer speed is around 100 MB/s, so it takes more than two and a half hours to read all the data off the disk.

ParallelGo

Cloud computing changes the way applications grow

http://journals.worldnomads.com/davidsgibson/photo/22804/664941/USA/Elephant-shaped-cloud!

Yaniv Rodenski Senior Consultant, Sela Grouphttp://blogs.microsoft.co.il/blogs/roadanTwitter: @YRodenski

yanivr@sela.co.il

BIG-TIME:Introducing Hadoop on Azure

David GinzburgBig Data infrastructure consultantTwitter: @David_Ginzburg

davidginzburg@gmail.com

1

34

AGENDA

2

Apache™ Hadoop™

Apache™ Hadoop™

Hadoop Distributed File System (HDFS)

HDFS Client

Hadoop Distributed File System (HDFS)

HDFS Client

Hadoop Distributed File System (HDFS)

HDFS Client

MapReduce via WordCount

Hello World

Hello Azure

Goodbye Cruel World

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2

2

1

1

1

A new way to MapReduce

DEMO

Hadoop MapReduce Processing

Input Split

Input Split

Input Split

Merge

Hadoop MapReduce Processing

Job Client

MapReduce TMI

Input Split

Partition, Sort,

and spill to disk

Buffer

Fetch

MapReduce TMI

Sort

Output

Map Outpu

t

Map Outpu

t

Map Outpu

t

Map Outpu

t

Merge result

Merge result

Partitioners

Combiners

The TeraSort Use case

••

The TeraSort Use case

Beginners Pitfalls

••

Beginners Pitfalls

••

Distinct Values Problem Statement

:// . . /2012/02/01/ -http highlyscalable wordpress com mapreduce/patterns

Distinct Values Problem Statement

:// . . /2012/02/01/ -http highlyscalable wordpress com mapreduce/patterns

Distinct Values Problem Statement

:// . . /2012/02/01/ -http highlyscalable wordpress com mapreduce/patterns

Distinct Values Problem Statement

:// . . /2012/02/01/ -http highlyscalable wordpress com mapreduce/patterns

Administrating Hadoop in the real world

DEMO

Why did Microsoft choose Hadoop?

Hadoop on Azure

Using hadooponazure.com

DEMO

Windows Azure Compute

Azure Role

Supporting service

Application

Configuration

Hadoop on Azure Roles

Azure Role

Monitoring service (RdAdmin)

Hadoop services

Configuration

Hadoop MapReduce Processing

Head Node

Name Node

Worker Node

Data Node

Worker Node

Data Node

Worker Node

Data Node

Worker Node

Data Node

Fabric Controller

Worker Node

Data Node

Worker Node

Data Node

Worker Node

Data Node

Hadoop MapReduce Processing

Head Node

Name Node

Worker Node

Data Node

Worker Node

Data Node

Worker Node

Data Node

Worker Node

Data Node

Fabric Controller

Worker Node

Data Node

Worker Node

Data Node

Worker Node

Data Node

Hadoop MapReduce Processing

Head Node

Name Node

Worker Node

Data Node

Worker Node

Data Node

Worker Node

Data Node

Worker Node

Data Node

Fabric Controller

Worker Node

Data Node

Worker Node

Data Node

Worker Node

Data Node

Worker Node

Data Node

The Head Node Template

••

The Worker Node Template

Node VM Templates

HEAD NODE WORKER NODE

VM Template Extra Large Medium

Cores 8 2

Memory 14 GB 3.5 GB

HD 2 TB 489 GB

Cloud Storage

High Availability on Azure

Fabric Controller

Head Node

Name Node

Head Node

Name Node

Azure Storage

Elastic MapReduce

Elastic MapReduce

Storage Client

Amazon S3

Head Node

Jobtracker

Worker Node

Tasktracker

Worker Node

Tasktracker

Worker Node

Tasktracker

Azure Storage

Elastic MapReduce

Storage Client

Amazon S3

Head Node

Jobtracker

Worker Node

Tasktracker

Worker Node

Tasktracker

Worker Node

Tasktracker

Azure Storage

Head Node

Jobtracker

Worker Node

Tasktracker

Worker Node

Tasktracker

Worker Node

Tasktracker

Elastic MapReduce

Storage Client

Amazon S3

Azure Storage

$$ $ $ $$ $ $ $

Using Elastic MapReduce

DEMO

Azure Blob Considerations

Storage Size Limitations

IsotopeJS

Using the JavaScript interactive console

DEMO

Using Hive

DEMO

Summary

Q & A

Resources

http://bit.ly/roadan My Blog

Apache™ Hadoop™http://hadoop.apache.org

http://www.hadooponazure.com

Hadoop on Azure

Tom Whitehttp://shop.oreilly.com/product/9780596521981.do

Hadoop: The Definitive Guide

http://www.windowsazure.com/en-us/develop/overviewWindows Azure Developer center

Thanks!Yaniv Rodenski

Twitter: @YRodenski

top related