how to collect big data into hadoop

55
Sadayuki Furuhashi uentd.org How to collect Big Data into Hadoop Big Data processing to collect Big Data

Upload: sadayuki-furuhashi

Post on 08-May-2015

5.623 views

Category:

Documents


2 download

DESCRIPTION

Big Data processing to collect Big Data

TRANSCRIPT

Page 1: How to collect Big Data into Hadoop

Sadayuki Furuhashi!uentd.org

How to collect Big Data

into HadoopBig Data processing to collect Big Data

Page 2: How to collect Big Data into Hadoop

Self-introduction

> Sadayuki Furuhashi

> Treasure Data, Inc.Founder & Software Architect

> Open source projectsMessagePack - efficient serializer (original author)

Fluentd - event collector (original author)

Page 4: How to collect Big Data into Hadoop

Today’s topic

Page 5: How to collect Big Data into Hadoop

Report &

Big Data

Monitor

Page 6: How to collect Big Data into Hadoop

Collect Store Process Visualize

Report &

Big Data

Monitor

Page 7: How to collect Big Data into Hadoop

Store Process

ClouderaHorton WorksMapR

Collect Visualize

TableauExcel

R

easier & shorter time

Page 8: How to collect Big Data into Hadoop

Store ProcessCollect Visualize

ClouderaHorton WorksMapR

TableauExcel

R

easier & shorter timeHow to shorten here?

Page 9: How to collect Big Data into Hadoop

Problems to collect data

Page 10: How to collect Big Data into Hadoop

Poor man’s data collection

1. Copy files from servers using rsync

2. Create a RegExp to parse the files

3. Parse the files and generate a 10GB CSV file

4. Put it into HDFS

Page 11: How to collect Big Data into Hadoop

Problems to collect “big data”

> Includes broken values> needs error handling & retrying

> Time-series data are changing and uncler> parse logs before storing

> Takes time to read/write> tools have to be optimized and parallelized

> Takes time for trial & error> Causes network traffic spikes

Page 12: How to collect Big Data into Hadoop

Problem of poor man’s data collection

> Wastes time to implement error handling> Wastes time to maintain a parser> Wastes time to debug the tool> Not reliable> Not efficient

Page 13: How to collect Big Data into Hadoop

Basic theoriesto collect big data

Page 14: How to collect Big Data into Hadoop

Divide & Conquer

error

error

Page 15: How to collect Big Data into Hadoop

Divide & Conquer & Retry

error retry

error retry retry

retry

Page 16: How to collect Big Data into Hadoop

Streaming

Don’t handle big files here Do it here

Page 17: How to collect Big Data into Hadoop

Apache Flume and Fluentd

Page 18: How to collect Big Data into Hadoop

Apache Flume

Page 19: How to collect Big Data into Hadoop

AgentAgentAgentAgent

Collector

Collector

Apache Flume

access logs

app logs

system logs

...

Page 20: How to collect Big Data into Hadoop

Apache Flume - network topology

AgentAgentAgentAgent

Collector

CollectorCollector

Master

AgentAgentAgentAgent

Collector

CollectorCollector

ack

send

send/ack

Flume OG

Flume NG

Page 21: How to collect Big Data into Hadoop

plugin

Apache Flume - pipeline

Flume OG

Flume NG

Source Sink

Source SinkChannel

Page 22: How to collect Big Data into Hadoop

Apache Flume - con!guration

Master

AgentAgentAgentAgent

Collector

CollectorCollectorFlume NG

Master managesall configuration

(optional)

Page 23: How to collect Big Data into Hadoop

Apache Flume - con!guration

# sourcehost1.sources = avro-source1host1.sources.avro-source1.type = avrohost1.sources.avro-source1.bind = 0.0.0.0host1.sources.avro-source1.port = 41414host1.sources.avro-source1.channels = ch1

# channelhost1.channels = ch_avro_loghost1.channels.ch_avro_log.type = memory

# sinkhost1.sinks = log-sink1host1.sinks.log-sink1.type = loggerhost1.sinks.log-sink1.channel = ch1

Page 24: How to collect Big Data into Hadoop

Fluentd

Page 25: How to collect Big Data into Hadoop

Fluentd - network topology

fluentdfluentdfluentdfluentd

fluentd

fluentdfluentd

send/ackFluentd

AgentAgentAgentAgent

Collector

CollectorCollector

send/ackFlume NG

Page 26: How to collect Big Data into Hadoop

plugin

Fluentd - pipeline

FluentdInput OutputBuffer

Source SinkChannelFlume NG

Page 27: How to collect Big Data into Hadoop

Fluentd - con!guration

Fluentd

fluentdfluentdfluentdfluentd

fluentd

fluentdfluentd

Use chef, puppet, etc. for configuration(they do things better)

No central node - keep things simple

Page 28: How to collect Big Data into Hadoop

Fluentd - con!guration

<source> type forward port 24224</source>

<match **> type file path /var/log/logs</match>

Page 29: How to collect Big Data into Hadoop

Fluentd - con!guration

<source> type forward port 24224</source>

<match **> type file path /var/log/logs</match>

# source

host1.sources = avro-source1

host1.sources.avro-source1.type = avro

host1.sources.avro-source1.bind = 0.0.0.0

host1.sources.avro-source1.port = 41414

host1.sources.avro-source1.channels = ch1

# channel

host1.channels = ch_avro_log

host1.channels.ch_avro_log.type = memory

# sink

host1.sinks = log-sink1

host1.sinks.log-sink1.type = logger

host1.sinks.log-sink1.channel = ch1

Page 30: How to collect Big Data into Hadoop

Fluentd - Users

Page 31: How to collect Big Data into Hadoop

Fluentd - plugin distribution platform

$ fluent-gem search -rd fluent-plugin

$ fluent-gem install fluent-plugin-mongo

Page 32: How to collect Big Data into Hadoop

Fluentd - plugin distribution platform

$ fluent-gem search -rd fluent-plugin

$ fluent-gem install fluent-plugin-mongo

94 plugins!

Page 33: How to collect Big Data into Hadoop

Concept of Fluentd

Customization is essential> small core + many plugins

Fluentd core helps to implement plugins> common features are already implemented

Page 34: How to collect Big Data into Hadoop

Divide & Conquer

Retrying

Parallelize

Error handling

Message routing

Fluentd core Plugins

read / receive data

write / send data

Page 35: How to collect Big Data into Hadoop

Fluentd plugins

Page 36: How to collect Big Data into Hadoop

in_tail

fluentdapache

access.log

✓ read a log file✓ custom regexp✓ custom parser in Ruby

Page 37: How to collect Big Data into Hadoop

out_mongo

fluentdapache

access.log buffer

in_tail

Page 38: How to collect Big Data into Hadoop

out_mongo

fluentdapache

access.log buffer

✓ retry automatically✓ exponential retry wait✓ persistent on a file

in_tail

Page 39: How to collect Big Data into Hadoop

out_s3

fluentdapache

access.log buffer

✓ retry automatically✓ exponential retry wait✓ persistent on a file

Amazon S3

✓ slice files based on time

in_tail

2013-01-01/01/access.log.gz2013-01-01/02/access.log.gz2013-01-01/03/access.log.gz...

Page 40: How to collect Big Data into Hadoop

out_hdfs

fluentdapache

access.log buffer

✓ retry automatically✓ exponential retry wait✓ persistent on a file

✓ slice files based on time

in_tail

2013-01-01/01/access.log.gz2013-01-01/02/access.log.gz2013-01-01/03/access.log.gz...

HDFS

✓ custom text formater

Page 41: How to collect Big Data into Hadoop

out_hdfs

fluentdapache

access.log buffer

✓ retry automatically✓ exponential retry wait✓ persistent on a file

✓ slice files based on time

in_tail

2013-01-01/01/access.log.gz2013-01-01/02/access.log.gz2013-01-01/03/access.log.gz...

fluentd

fluentd

fluentd

✓ automatic fail-over✓ load balancing

Page 42: How to collect Big Data into Hadoop

Fluentd examples

Page 43: How to collect Big Data into Hadoop

Fluentd at Treasure Data - REST API logs

API servers

fluentdRails app

fluentd

fluentdRails app

fluent-logger-ruby+ in_forward

out_forward

watch server

Page 44: How to collect Big Data into Hadoop

Fluentd at Treasure Data - backend logs

API servers

fluentdRails app Ruby app

fluentd

fluentdRails app

worker servers

Ruby appfluentd

fluent-logger-ruby+ in_forward

out_forward

fluentdwatch server

Page 45: How to collect Big Data into Hadoop

Fluentd at Treasure Data - monitoring

API servers

fluentdRails app

fluentd

Queue

PerfectQueue

Ruby appfluentd

fluentdRails app

worker servers

Ruby appfluentd

fluent-logger-ruby+ in_forward

watch server

scriptout_forwardin_exec

Page 46: How to collect Big Data into Hadoop

Fluentd at Treasure Data - Hadoop logs

fluentd watch server

scriptin_exec

✓ resource consumption statistics for each user✓ capacity monitoring

thrift API call

HadoopJobTracker

Page 47: How to collect Big Data into Hadoop

Fluentd at Treasure Data - store & analyze

fluentd watch server

Librato Metricsfor realtime analysis

Treasure Datafor historical analysis

out_tdlog out_metricsense✓ streaming aggregation

Page 48: How to collect Big Data into Hadoop
Page 49: How to collect Big Data into Hadoop
Page 50: How to collect Big Data into Hadoop

Plugin development

Page 51: How to collect Big Data into Hadoop

class SomeInput < Fluent::Input Fluent::Plugin.register_input('myin', self)

config_param :tag, :string

def start Thread.new { while true time = Engine.new record = {“user”=>1, “size”=>1} Engine.emit(@tag, time, record) end } end

def shutdown ... endend

<source> type myin tag myapp.api.heartbeat</source>

Page 52: How to collect Big Data into Hadoop

class SomeOutput < Fluent::BufferedOutput Fluent::Plugin.register_output('myout', self)

config_param :myparam, :string

def format(tag, time, record) [tag, time, record].to_json + "\n" end

def write(chunk) puts chunk.read endend

<match **> type myout myparam foobar</match>

Page 53: How to collect Big Data into Hadoop

class MyTailInput < Fluent::TailInput Fluent::Plugin.register_input('mytail', self)

def configure_parser(conf) ... end

def parse_line(line) array = line.split(“\t”) time = Engine.now record = {“user”=>array[0], “item”=>array[1]} return time, record endend

<source> type mytail</source>

Page 54: How to collect Big Data into Hadoop
Page 55: How to collect Big Data into Hadoop

Fluentd v11

Error stream

Streaming processing

Better DSL

Multiprocess