hadoop as a service boston azure 29-march-2012 copyright (c) 2011, bill wilder – use allowed under...

37
Hadoop as a Service Boston Azure 29-March-2012 Copyright (c) 2011, Bill Wilder – Use allowed under Creative Commons license http ://creativecommons.org/licenses/by-nc-sa/3.0 / Boston Azure User Group http ://www.bostonazure.org @bostonazure Bill Wilder http://blog.codingoutlou d.com @codingoutloud Big Data tools for the Windows Azure cloud platform

Upload: charity-linda-clarke

Post on 24-Dec-2015

219 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: Hadoop as a Service Boston Azure 29-March-2012 Copyright (c) 2011, Bill Wilder – Use allowed under Creative Commons license

Hadoop as a Service

Boston Azure29-March-2012

Copyright (c) 2011, Bill Wilder – Use allowed under Creative Commons license http://creativecommons.org/licenses/by-nc-sa/3.0/

Boston Azure User Grouphttp://www.bostonazure.org@bostonazure

Bill Wilderhttp://blog.codingoutloud.com@codingoutloud

Big Data tools for the Windows Azure cloud platform

                                        

Page 2: Hadoop as a Service Boston Azure 29-March-2012 Copyright (c) 2011, Bill Wilder – Use allowed under Creative Commons license

Windows Azure MVP

Windows Azure Consultant

Boston Azure User Group Founder

Cloud Architecture Patterns book (due 2012)

Bill Wilder

Page 3: Hadoop as a Service Boston Azure 29-March-2012 Copyright (c) 2011, Bill Wilder – Use allowed under Creative Commons license

http://hadoop.apache.org/

http://www.google.com/imgres?imgurl=http://www.davidgreco.it/MySite/Blog/Entries/

2009/11/17_When_a_Camel_encounters_an_Elephant_files/droppedImage.jpg&imgrefurl=http://www.davidgreco.it/

MySite/Blog/Entries/2009/11/17_When_a_Camel_encounters_an_Elephant.html&h

=233&w=494&sz=36&tbnid=5-63lDrs-cySWM:&tbnh=58&tbnw=124&zoom=1&docid=45Met2K5vSAjvM&hl=en&sa=X&ei=FsJ0T_H0CYru0gHan8T_Ag&ved=0CGAQ9QE

wAg&dur=831

Page 6: Hadoop as a Service Boston Azure 29-March-2012 Copyright (c) 2011, Bill Wilder – Use allowed under Creative Commons license

HQL used in demo

show tables;describe hivesampletable;describe extended hivesampletable;select count(*) from hivesampletables;select country from hivesampletable;select distinct country from hivesampletable;select avg(querydwelltime) from hivesampletable;select devicemodel, SUM(querydwelltime) as totaldwelltime from hivesampletable group by devicemodel order by totaldwelltime DESC limit 10;

Page 7: Hadoop as a Service Boston Azure 29-March-2012 Copyright (c) 2011, Bill Wilder – Use allowed under Creative Commons license

JS / Hadoop used in demo

• “hello boston azure” (shows it’s really JavaScript!)• #ls• #cat WordCount.js • pig.from(“data").mapReduce("WordCount.js", "word,

count:long").orderBy("count DESC").take(10).to(“out")GRATUITOUS JS GRAPH DRAWING WITH OUTPUT• file = fs.read("DaVinciTop10Words") • data = parse(file.data, "word, count:long") • graph.bar(data)

Page 9: Hadoop as a Service Boston Azure 29-March-2012 Copyright (c) 2011, Bill Wilder – Use allowed under Creative Commons license

(March talk was mostly DEMOS)

THE REST OF DECK WAS NOT SHOWN, BUT IS INCLUDED FOR ADD’L CONTEXT.IT IS SAME AS MY HADOOP ON AZURE TALK from JAN 2012 Boston Azure User Group Meeting

Page 10: Hadoop as a Service Boston Azure 29-March-2012 Copyright (c) 2011, Bill Wilder – Use allowed under Creative Commons license

We will consider…

1. How might we build a simple Word Frequency Counter?

2. What are Map and Reduce?3. How do we scale our Word Frequency

Counter?– Hint: we might use Hadoop

4. How does Windows Azure make Hadoop easier with “Hadoop as a Service”

– CTP

Page 11: Hadoop as a Service Boston Azure 29-March-2012 Copyright (c) 2011, Bill Wilder – Use allowed under Creative Commons license

Five exabytes of data created

every two days- Eric Schmidt

(CEO Google at the time)

As much as from the dawn of civilization up until 2003

Page 12: Hadoop as a Service Boston Azure 29-March-2012 Copyright (c) 2011, Bill Wilder – Use allowed under Creative Commons license

Three Vs• Volume lots of it already• Velocity more of it every day• Variety many sources, many formats

“Big Data” Challenge

Page 13: Hadoop as a Service Boston Azure 29-March-2012 Copyright (c) 2011, Bill Wilder – Use allowed under Creative Commons license

Short History of Hadoop //////

1. Inspired by:• Google Map/Reduce paper

– http://research.google.com/archive/mapreduce.html • Google File System (GFS)

– Goals: distributed, fault tolerant, fast enough

2. Born in: Lucene Nutch project• Built in Java• Hadoop cluster appears as single über-

machine

Page 14: Hadoop as a Service Boston Azure 29-March-2012 Copyright (c) 2011, Bill Wilder – Use allowed under Creative Commons license

Hadoop: batch processing, big data

• Batch, not real-time or transactional• Scale out with commodity hardware• Big customers like LinkedIn and Yahoo!

– Clusters with 10s of Petabytes • (pssst… these fail… daily)

• Import data from Azure Blob, Data Market , S3– Or from files, like we will do in our example

Page 15: Hadoop as a Service Boston Azure 29-March-2012 Copyright (c) 2011, Bill Wilder – Use allowed under Creative Commons license

Word Frequency Counter – how?

• The “hello world” of Hadoop / MapReduce– But we start without Hadoop / MapReduce

• Input: large corpus– Wikipedia extract for example – Can handle into PB

• Output: list of words, ordered by frequencythe 31415 be 9265 to 3589 of 793 and 238 … …

Page 16: Hadoop as a Service Boston Azure 29-March-2012 Copyright (c) 2011, Bill Wilder – Use allowed under Creative Commons license

Simple Word Frequency Counter

const string file = @"e:\dev\azure\hadoop\wordcount\davinci.txt";var text = File.ReadAllText(file);

var matches = Regex.Matches(text, @"\b[\w]*\b");var words = (from m in matches.Cast<Match>() where !string.IsNullOrEmpty(m.Value) orderby m.Value.ToLower() select m.Value.ToLower()).ToArray();

var wordCounts = new Dictionary<string, int>();foreach (var word in words){ if (wordCounts.ContainsKey(word)) wordCounts[word]++; else wordCounts.Add(word, 1);}

foreach (var wc in wordCounts) Console.WriteLine(wc.Key + " : " + wc.Value);

Read in all text

Parse out words

Normalize & Sort

How many times does each word appear

aware : 7away : 99

awning : 2awoke : 1axes : 16

axil : 3axiom : 2

Output REDUCEMAP

Page 17: Hadoop as a Service Boston Azure 29-March-2012 Copyright (c) 2011, Bill Wilder – Use allowed under Creative Commons license

Map

“Apply a function to each element in this list and return a new list”

Reduce

“Apply a function collectively to all elements in this list and return the final answer.”

Page 18: Hadoop as a Service Boston Azure 29-March-2012 Copyright (c) 2011, Bill Wilder – Use allowed under Creative Commons license

Map Example 1square(x) { return x*x }{ 1, 2, 3, 4 } { 1, 4, 9, 16 }

Reduce Example 1

sum(x, y) { return x + y }{ 1, 2, 3, 4 } 10

Page 19: Hadoop as a Service Boston Azure 29-March-2012 Copyright (c) 2011, Bill Wilder – Use allowed under Creative Commons license

Map Example 1square(x) { return x*x }{ 1, 2, 3, 4 } { 1, 4, 9, 16 }{ square(1), square(2), square(3), square(4) }

Reduce Example 1

sum(x, y) { return x + y }{ 1, 2, 3, 4 } 10sum(sum(sum(1,2),3),4)

Page 20: Hadoop as a Service Boston Azure 29-March-2012 Copyright (c) 2011, Bill Wilder – Use allowed under Creative Commons license

Map Example 2

strlen(s) { return s.Length }{ “Celtics”, “Bruins” } { 7, 6 }

Reduce Example 2

strlen_sum(x, y) { return x.Length + y.Length }{ “Celtics”, “Bruins” } 13

Page 21: Hadoop as a Service Boston Azure 29-March-2012 Copyright (c) 2011, Bill Wilder – Use allowed under Creative Commons license

Map Example 3 (the fancy one)

fancy_mapper(s) { if (s == “SkipMe”) return null; return ToLower(s) + “, “ + s.Length; }

{ “Will”, “Dan”, “SkipMe”, “Kevin”, “T.J.” } { “will, 4”, “dan, 3”, “kevin, 5”, “t.j., 4” }

Page 22: Hadoop as a Service Boston Azure 29-March-2012 Copyright (c) 2011, Bill Wilder – Use allowed under Creative Commons license

Problems with Word Counter?

What happens if our data is…• 1 GB, 1 TB, 1 PB, …

What happens if our data is…• Images, videos, tweets, Facebook updates, …

What happens if our processing is…• Complex, multiple steps, …

Page 23: Hadoop as a Service Boston Azure 29-March-2012 Copyright (c) 2011, Bill Wilder – Use allowed under Creative Commons license

Simplified Example

• Word Frequency Counter• Which word appears most frequently, and

how many times?

Page 24: Hadoop as a Service Boston Azure 29-March-2012 Copyright (c) 2011, Bill Wilder – Use allowed under Creative Commons license

Workflow

1.Setup2.Map

3. Shuffle4. Reduce

5.Celebrate

Page 25: Hadoop as a Service Boston Azure 29-March-2012 Copyright (c) 2011, Bill Wilder – Use allowed under Creative Commons license

Hadoop Cluster

• 1 MASTER NODE Many SLAVE NODES - Job Tracker - Task Tracker on each• “the boss”

HDFS on all nodes

Page 26: Hadoop as a Service Boston Azure 29-March-2012 Copyright (c) 2011, Bill Wilder – Use allowed under Creative Commons license

Step 1. Setup

(Assumes: you’ve installed Hadoop on a cluster of computers)

You supply:1. Map and Reduce logic

– This is “code” – packaged in a Java JAR file– Other language support exists, more coming

2. A big pile of input files– “Get the data here”– For Word Frequency Counter, we might use

Wikipedia or Project Gutenberg files

3. Go!

Page 27: Hadoop as a Service Boston Azure 29-March-2012 Copyright (c) 2011, Bill Wilder – Use allowed under Creative Commons license

Step 2. Map

• Job Tracker distributes your Mapper and Reducer– To every node

• Job Tracker distributes your data– To some nodes (at least 3) in 64 MB chunks

• Task Tracker on each node calls Mapper– Repeatedly until done; lots of parallelism

• Job Tracker watches for problems– Manages retries, failures, optimistic attempts

Page 28: Hadoop as a Service Boston Azure 29-March-2012 Copyright (c) 2011, Bill Wilder – Use allowed under Creative Commons license

Mapper’s Job is Simple*

• Read input, write output – that’s all– Function signature: Map(text) – Parses the text and returns { key, value }– Map(“a b a foo”) returns {a, 1}, {b, 1}, {a, 1}, {foo,

1}

* for Word Frequency Counter!

Page 29: Hadoop as a Service Boston Azure 29-March-2012 Copyright (c) 2011, Bill Wilder – Use allowed under Creative Commons license

Actual Java Map Function

public void map(Object key, Text value, Context context

) throws IOException, InterruptedException{ StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, “1”); }}

Page 30: Hadoop as a Service Boston Azure 29-March-2012 Copyright (c) 2011, Bill Wilder – Use allowed under Creative Commons license

Step 3. Shuffle

• Shuffle collects all data from Map, organizes by key, and redistributes data to the nodes for Reduce

Page 31: Hadoop as a Service Boston Azure 29-March-2012 Copyright (c) 2011, Bill Wilder – Use allowed under Creative Commons license

Step 3. Shuffle – example

• Mapper input:“the Bruins are the best in the NHL”

• Mapper output:{ the, 1 } { the, 1 } { the, 1 }{ Bruins, 1 } { best, 1 } { NHL, 1 }{ are, 1 } { in, 1 }

• Shuffle transforms this into Reducer input:{ are, [ 1 ] } { in, [ 1 ] } { Bruins,

[ 1 ] }{ best, [ 1 ] } { the, [ 1, 1, 1 ] } { NHL, [ 1 ] }

Page 32: Hadoop as a Service Boston Azure 29-March-2012 Copyright (c) 2011, Bill Wilder – Use allowed under Creative Commons license

Step 4. Reduce

• Output from Step 3. Shuffle has been distributed to datanodes

• Your “Reducer” is called on local data– Repeatedly, until all complete– Tasks run in parallel on nodes

• This is very simple for Word Frequency Counter! – Function signature:

Reduce(key, values[]) – Adds up all the values and returns { key, sum }

Page 33: Hadoop as a Service Boston Azure 29-March-2012 Copyright (c) 2011, Bill Wilder – Use allowed under Creative Commons license

Actual Java Reduce Functionpublic void reduce( Text key,

Iterable<IntWritable> values, Context context

) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); }}

Page 34: Hadoop as a Service Boston Azure 29-March-2012 Copyright (c) 2011, Bill Wilder – Use allowed under Creative Commons license

Step 5. Celebrate!

• You are done – you have your output• In more complex scenario, might repeat

– Hadoop tool ecosystem knows how to do this• There are other projects in the Hadoop

ecosystem for …– Multi-step jobs– Managing a data warehouse – Supporting ad hoc querying – And more!

Page 35: Hadoop as a Service Boston Azure 29-March-2012 Copyright (c) 2011, Bill Wilder – Use allowed under Creative Commons license

www.hadooponazure.com

demo

Page 36: Hadoop as a Service Boston Azure 29-March-2012 Copyright (c) 2011, Bill Wilder – Use allowed under Creative Commons license

There’s a LOT MORE to the Hadoops…

• Hadoop streaming interface allows other languages– C#

• HIVE (HiveQL)• Pig (Pig Latin language)• Cascading.org• Commercial companies dedicated:

– HortonWorks

Page 37: Hadoop as a Service Boston Azure 29-March-2012 Copyright (c) 2011, Bill Wilder – Use allowed under Creative Commons license

Questions?Comments?

More information?

?