“introducing hadoop on azure:

22
“Introducing Hadoop on Azure: Joe Hummel, PhD Visiting Researcher: U. of California, Irvine Adjunct Professor: U. of Illinois, Chicago & Loyola U., Chicago Materials: http://www.joehummel.net/downloads.html Email: [email protected] hello Map-Reduce!”

Upload: kamala

Post on 25-Feb-2016

22 views

Category:

Documents


0 download

DESCRIPTION

hello Map-Reduce!”. “Introducing Hadoop on Azure:. Joe Hummel, PhD Visiting Researcher: U. of California, Irvine Adjunct Professor: U. of Illinois, Chicago & Loyola U., Chicago. Materials: http ://www.joehummel.net/downloads.html Email: [email protected]. Agenda. A little history… - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: “Introducing  Hadoop  on Azure:

“Introducing Hadoop on Azure:

Joe Hummel, PhDVisiting Researcher: U. of California, IrvineAdjunct Professor: U. of Illinois, Chicago &

Loyola U., Chicago

Materials: http://www.joehummel.net/downloads.htmlEmail: [email protected]

hello Map-Reduce!”

Page 2: “Introducing  Hadoop  on Azure:

Hadoop on Azure 2

A little history… Why Hadoop? How it works Demos Summary

Agenda

Page 3: “Introducing  Hadoop  on Azure:

Hadoop on Azure 3

Map-Reduce is from functional programming

A little history…

// function returns 1 if i is prime, 0 if not:let isPrime(i) = ...

// sums 2 numbers:let sum(x, y) = return x + y

// count the number of primes in 1..N:let countPrimes(N) = let L = [ 1 .. N ] // [ 1, 2, 3, 4, 5, 6, ... ] let T = map isPrime L // [ 0, 1, 1, 0, 1, 0, ... ] let count = reduce sum T // 42 return count

Page 4: “Introducing  Hadoop  on Azure:

4

Created by to drive internet search◦ BIG data ― scalable to TBs and beyond◦ Parallelism: to get the performance◦ Data partitioning: to drive the parallelism◦ Fault tolerance: at this scale, machines are going to crash, a lot…

A little more history…

BIGData

pagehits

Page 5: “Introducing  Hadoop  on Azure:

Hadoop on Azure 5

Search engines: Google, Yahoo, Bing Facebook Twitter Financials Health industry Insurance Credit card companies Just about any company collecting user data…

Who’s using Hadoop

Page 6: “Introducing  Hadoop  on Azure:

6

Freely-available framework for big data◦ http://hadoop.apache.org/

Based on concept of Map-Reduce:

Hadoop today

BIGdata

Map

Map

Map

Map...

Reduce R

map function reduce intermediate results

...

Page 7: “Introducing  Hadoop  on Azure:

Hadoop on Azure 7

Massively-parallel

Mapper

Mapper

Mapper

Mapper

MapperMapper

Mapper

Mapper

Mapper

Mapper

MapperMapper

Mapper

Mapper

Mapper

Mapper

MapperMapper

ReducerReducer

Reducer

ReducerReducerReducer

Page 8: “Introducing  Hadoop  on Azure:

8

WorkflowMap

Sort

Reduce

Merge

[ <key1, [value,value,…]>, <key2, [value,value,…]>, … ]

[ <key1, value>, <key2, value>… ] R

Data

Map

Sort

Map

Sort

[ <key1,value>, <key4,value>, <key2,value>, … ]

[ <key1,value>, <key1,value>, … ]

Page 9: “Introducing  Hadoop  on Azure:

Hadoop on Azure 9

Netflix data-mining…

Example

NetflixMovieReview

s(.txt)

Netflix Data

Mining App

Average rating…

movieid,userid,rating,date1,2390087,3,2005-09-06217,5567801,5,2006-01-0342,1121098,3,2006-03-251,8972234,5,2003-12-02...

Page 10: “Introducing  Hadoop  on Azure:

10

Map

Sort

Reduce

Merge

[ <1, [3,5]>, <42, [3,1]>, <134, [2, …]>, <217, [5, …]>, … ]

[ <1, 4>, <42, 2>, <134, ?>, … ] R

Data

Map

Sort

Map

Sort

[ <1,3>, <217,5>, <42,3>, <1,5>, <134,2>, <42,1>, … ]

[ <1,3>, <1,5>, <42,3>, <42,1>, <134,2>, <217,5>, … ]

NetflixWorkflow

Page 11: “Introducing  Hadoop  on Azure:

Hadoop on Azure 11

To compute average rating for every movie:

Netflix map/ reduce functions?

// Javascript version:var map = function (key, value, context){ var values = value.split(","); // field 0 contains movieid, field 2 the rating: context.write(values[0], values[2]);};

var reduce = function (key, values, context) { var sum = 0; var count = 0;

while (values.hasNext()) { count++; sum += parseInt(values.next()); } context.write(key, sum/count);};

Page 12: “Introducing  Hadoop  on Azure:

Hadoop on Azure 12

Traditional use of Hadoop Upload data to HDFS

◦ Hadoop file system

Write map / reduce functions◦ default is to use Java◦ most languages supported: C, C++, C#, JavaScript, Python, …

Compile and upload code◦ For Java, you upload .jar file◦ For others, .exe or script

Submit MapReduce job Wait for job to complete

Page 13: “Introducing  Hadoop  on Azure:

Hadoop on Azure 13

When to use Hadoop? Queries against big datasets Embarrassingly-parallel problems

◦ Solution must fit into map-reduce framework

Non-real-time demands

Hadoop is not for:◦ Small datasets (< 1GB?)◦ Sub-second / real-time needs (though clearly Google makes it work)

Page 14: “Introducing  Hadoop  on Azure:

14

We’ll be working with Chicago crime data…◦ https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2 ◦ http://www.cityofchicago.org/city/en/narr/foia/CityData.html

Data set for demo

1 GB

5M rows

Page 15: “Introducing  Hadoop  on Azure:

15

Compute top-10 crimes…

Goal?

0486 3669030820 308074...0890 166916

IUCR Count

IUCR = Illinois Uniform Crime Codeshttps://data.cityofchicago.org/Public-Safety/Chicago-Police-Department-Illinois-Uniform-Crime-R/c7ck-438e

Page 16: “Introducing  Hadoop  on Azure:

16

Hadoop on Azure… Supports traditional Hadoop usage

◦ Upload data◦ Write MapReduce program◦ Submit job

Additional features:◦ Allows access to persistent data from Azure Storage Vault◦ Provides interactive JavaScript console◦ Built-in higher-level query languages (PIG, HIVE)

Demo

Hadoop on Azure

Page 17: “Introducing  Hadoop  on Azure:

Hadoop on Azure 17

Demo: map reduce functions

// Javascript version:var map = function (key, value, context){ var values = value.split(","); context.write(values[4], 1);};

var reduce = function (key, values, context) { var sum = 0; while (values.hasNext()) { sum += parseInt(values.next()); } context.write(key, sum);};

0486 3669030820 308074...

Page 18: “Introducing  Hadoop  on Azure:

Hadoop on Azure 18

Demo: PIG command// interactive PIG with explicit Map-Reduce functions:pig.from("asv://datafiles/CC-from-2001.txt"). mapReduce("scripts/IUCR-Count.js", "IUCR, Count:long"). orderBy("Count DESC"). take(10). to("output-from-2001")

// visualize the results:file = fs.read("output-from2001/part-r-00000")data = parse(file.data, "IUCR, Count:long")graph.bar(data)

Page 19: “Introducing  Hadoop  on Azure:

19

Microsoft is offering free access to Hadoop◦ Request invitation @ http://www.hadooponazure.com/

Hadoop connector for Excel◦ Process data using Hadoop, analyze/visualize using Excel

Hadoop on Azure

Hadoop on Azure

Page 20: “Introducing  Hadoop  on Azure:

20

That’s it!

Hadoop on Azure

Page 21: “Introducing  Hadoop  on Azure:

21Hadoop on Azure

Summary Hadoop is all about big data processing

◦ Scalable, parallel, fault-tolerant

Easy to understand programming model◦ Map-Reduce◦ But then solution must fit into this framework…

Rich ecosystem developing around Hadoop◦ Technologies: PIG, HIVE, HBase, …◦ Companies: Cloudera, Hortonworks, MapR, …