ruby on hadoop

25
Ruby on Hadoop Tuesday, January 8, 13

Upload: ted-omeara

Post on 26-Jan-2015

108 views

Category:

Technology


0 download

DESCRIPTION

Introduction to Hadoop, as well as a brief overview of the Wukong and wukon-hadoop gems

TRANSCRIPT

Page 1: Ruby on hadoop

Ruby on HadoopTuesday, January 8, 13

Page 2: Ruby on hadoop

Introduction

Hi.I’m Ted O’Meara

...and I just quit my job last week.

@tomearatedomeara.com

Tuesday, January 8, 13

Page 3: Ruby on hadoop

MapReduceTuesday, January 8, 13

Page 4: Ruby on hadoop

History of MapReduce

•First implemented by Google

•Used in CouchDB, Hadoop, etc.

•Helps to “distill” data into a concentrated result set

Tuesday, January 8, 13

Page 5: Ruby on hadoop

What is MapReduce?

Tuesday, January 8, 13

Page 6: Ruby on hadoop

What is MapReduce?

input = ["deer", "bear", "river", "car", "car", "river", "deer", "car", "bear"]

input.map! { |x| [x, 1] }

sum = 0input.each do |x| sum += x[1]end

Tuesday, January 8, 13

Page 7: Ruby on hadoop

Hadoop BreakdownTuesday, January 8, 13

Page 8: Ruby on hadoop

History of Hadoop

•Doug Cutting @ Yahoo!•It is a Toy Elephant•It is also a framework for

distributed computing•It is a distributed filesystem

Tuesday, January 8, 13

Page 9: Ruby on hadoop

Network Topology

Tuesday, January 8, 13

Page 10: Ruby on hadoop

Hadoop Cluster

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

JobTracker TaskTracker/DataNodeNameNode

Cluster•Commodity hardware•Partition tolerant•Network-aware (rack-aware)

555.555.1.* 555.555.2.* 444.444.1.*

Tuesday, January 8, 13

Page 11: Ruby on hadoop

Hadoop Cluster

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

JobTracker TaskTracker/DataNodeNameNode

NameNode•Keeps track of the DataNodes•Uses “heartbeat” to determine a node’s health•The most resources should be spent here

555.555.1.* 555.555.2.* 444.444.1.*

Tuesday, January 8, 13

Page 12: Ruby on hadoop

Hadoop Cluster

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

JobTracker TaskTracker/DataNodeNameNode

DataNode•Stores filesystem blocks•Can be scaled. Spun up/down.•Replicate based on a set replication factor

555.555.1.* 555.555.2.* 444.444.1.*

Tuesday, January 8, 13

Page 13: Ruby on hadoop

Hadoop Cluster

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

JobTracker TaskTracker/DataNodeNameNode

JobTracker•Delegates which TaskTrackers should handle a

MapReduce job•Communicates with the NameNode to assign a TaskTracker

close to the DataNode where the source exists

555.555.1.* 555.555.2.* 444.444.1.*

Tuesday, January 8, 13

Page 14: Ruby on hadoop

Hadoop Cluster

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

JobTracker TaskTracker/DataNodeNameNode

TaskTracker•Worker for MapReduce jobs•The closer to the DataNode with the data, the better

555.555.1.* 555.555.2.* 444.444.1.*

Tuesday, January 8, 13

Page 15: Ruby on hadoop

HDFS

Tuesday, January 8, 13

Page 16: Ruby on hadoop

HDFS

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

JobTracker TaskTracker/DataNodeNameNode

hadoop fs -put localfile /user/hadoop/hadoopfile

555.555.1.* 555.555.2.* 444.444.1.*

Tuesday, January 8, 13

Page 17: Ruby on hadoop

Hadoop Streaming

Tuesday, January 8, 13

Page 18: Ruby on hadoop

Hadoop Streaming

555.555.1.* 555.555.2.* 444.444.1.*

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

TaskTracker/DataNode

JobTracker TaskTracker/DataNodeNameNode

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ -input "/user/me/samples/cachefile/input.txt" \ -mapper "xargs cat" \ -reducer "cat" \ -output "/user/me/samples/cachefile/out" \ -cacheArchive 'hdfs://hadoop-nn1.example.com/user/me/samples/cachefile/cachedir.jar#testlink' \ -jobconf mapred.map.tasks=3 \ -jobconf mapred.reduce.tasks=3 \ -jobconf mapred.job.name="Experiment"

Tuesday, January 8, 13

Page 19: Ruby on hadoop

Hadoop Streaming

Hadoop Ecosystem

Pig Hive WukongPig Latin SQL-ish Ruby!

Tuesday, January 8, 13

Page 20: Ruby on hadoop

Wukong

•Infochimps•Currently going through

heavy development•Use the 3.0.0.pre3 gem

https://github.com/infochimps-labs/wukong/tree/3.0.0

•Model your jobs with wukong-hadoophttps://github.com/infochimps-labs/wukong-hadoop

Tuesday, January 8, 13

Page 21: Ruby on hadoop

Wukong

Wukong•Write mappers and reducers

using Ruby•As of 3.0.0, Wukong uses

“Processors”, which are Ruby classes that define map, reduce, and other tasks

wukong-hadoop•A CLI to use with Hadoop•Created around building tasks

with Wukong•Better than piping in the shell

(you can see this with --dry_run)

Tuesday, January 8, 13

Page 22: Ruby on hadoop

Wukong Processors

•Fields are accessiblethrough switches in shell

•Local hand-off is made at STDOUT to STDIN

Wukong.processor(:mapper) do    field :min_length, Integer, :default => 1  field :max_length, Integer, :default => 256  field :split_on, Regexp, :default => /\s+/  field :remove, Regexp, :default => /[^a-zA-Z0-9\']+/  field :fold_case, :boolean, :default => false    def process string    tokenize(string).each do |token|      yield token if acceptable?(token)    end  end

  private

  def tokenize string    string.split(split_on).map do |token|      stripped = token.gsub(remove, '')      fold_case ? stripped.downcase : stripped    end  end

  def acceptable? token    (min_length..max_length).include?(token.length)  endend

Tuesday, January 8, 13

Page 23: Ruby on hadoop

Wukong Processors

Wukong.processor(:reducer, Wukong::Processor::Accumulator) do

  attr_accessor :count    def start record    self.count = 0  end    def accumulate record    self.count += 1  end

  def finalize    yield [key, count].join("\t")  endend

Tuesday, January 8, 13

Page 24: Ruby on hadoop

Wukong Processors

Simpsons - Ep 8do 7Doctor 1Does 2doesn't 1dog 2D'oh 1doif 1doing 2done 1doneYou 1don't 10Don't 1

wu-hadoop /home/hduser/wukong-hadoop/examples/word_count.rb \ --mode=local \ --input=/home/hduser/simpsons/simpsonssubs/Simpsons\ [1.08].sub

Tuesday, January 8, 13

Page 25: Ruby on hadoop

The End

Thank you!@[email protected]

Tuesday, January 8, 13