flume @ austin hug 2/17/11

45

Upload: cloudera-inc

Post on 20-Aug-2015

5.393 views

Category:

Technology


0 download

TRANSCRIPT

Jonathan Hsieh, Henry Robinson, Patrick Hunt

Cloudera, Inc

Hadoop World 2010, 10/12/2010

Flume

Reliable Distributed

Streaming Log Collection

Flume

4 months after Hadoop

World 2010

Jonathan Hsieh, Henry Robinson, Patrick Hunt, Eric Sammer, Bruce Mitchener

Cloudera, Inc

Austin Hadoop Users Group 2/17/2011

Who Am I?

• Cloudera:

– Software Engineer on the Platform Team

– Flume Project Lead / Designer / Architect

• U of Washington:

– “On Leave” from PhD program

– Research in Systems and Programming Languages

• Previously:

– Computer Security, Embedded Systems.

4Austin Hadoop User Group, 2/17/2011

The basic scenario

• You have a bunch of servers generating log files.

• You figured out that your logs are valuable and you want to keep them and analyze them.

• Because of the volume of data, you’ve started using a Apache Hadoop or Cloudera’s Distribution of Apache Hadoop.

• … and you’ve got some ad-hoc, hacked together scripts that copy data from servers to HDFS.

5

It’s log, log .. Everyone wants a log!

Austin Hadoop User Group, 2/17/2011

Ad-hockery gets complicated

• Reliability– Will you data still get there … if your scripts fail? … if your hardware failed? … if HDFS goes

down? … if EC2 has flaked out?

• Scale– As you add servers will your scripts keep up to 100GB’s per day? Will you have tons of small

files? Are you going to have tons of connections? Are you willing to suffer more latency to mitigate?

• Manageability– How do you know if the script failed on machine 172? What about logs from that other

system? How do you monitor and configure all the servers? Can you deal with elasticity?

• Extensibility– Can you service custom logs? Send data to different places like Hbase, Hive or Incremental

search indexes? Can you do near-realtime?

• Blackbox– What happens when the guy who write it leaves?

6Austin Hadoop User Group, 2/17/2011

Cloudera Flume

Flume is a framework and conduit for collecting and quickly shipping data records from of many sources and to one centralized place for storage and processing.

Project Principles:

• Scalability

• Reliability

• Extensibility

• Manageability

• Openness

7Austin Hadoop User Group, 2/17/2011

: The Standard Use Case

HDFS

Agent tier Collector tier

8

AgentAgentAgentAgent

AgentAgentAgentAgent

AgentAgentAgentAgent

Collector

Collector

Collector

serverserverserverserver

serverserverserverserver

serverserverserverserver

Austin Hadoop User Group, 2/17/2011

Flume

: The Standard Use Case

HDFS

Agent tier Collector tier

9

AgentAgentAgentAgent

AgentAgentAgentAgent

AgentAgentAgentAgent

Collector

Collector

Collector

serverserverserverserver

serverserverserverserver

serverserverserverserver

Austin Hadoop User Group, 2/17/2011

Flume

: The Standard Use Case

HDFS

10

AgentAgentAgentAgent

AgentAgentAgentAgent

AgentAgentAgentAgent

Collector

Collector

Collector

Masterserverserverserverserver

serverserverserverserver

serverserverserverserverAgent tier Collector tier

Austin Hadoop User Group, 2/17/2011

Flume

: The Standard Use Case

HDFS

11

AgentAgentAgentAgent

AgentAgentAgentAgent

AgentAgentAgentAgent

Collector

Collector

Collector

Agent tier Collector tier

Masterserverserverserverserver

serverserverserverserver

serverserverserverserver

Austin Hadoop User Group, 2/17/2011

Flume’s Key Abstractions

• Data path and control path

• Nodes are in the data path – Nodes have a source and a sink

– They can take different roles• A typical topology has agent nodes and collector nodes.

• Optionally it has processor nodes.

• Masters are in the control path.– Centralized point of configuration.

– Specify sources and sinks

– Can control flows of data between nodes

– Use one master or use many with a ZK-backed quorum

12

Master

nodesinksource

nodesinksource

Agent

Collector

Austin Hadoop User Group, 2/17/2011

Flume’s Key Abstractions

• Data path and control path

• Nodes are in the data path – Nodes have a source and a sink

– They can take different roles• A typical topology has agent nodes and collector nodes.

• Optionally it has processor nodes.

• Masters are in the control path.– Centralized point of configuration.

– Specify sources and sinks

– Can control flows of data between nodes

– Use one master or use many with a ZK-backed quorum

13

Master

nodesinksource

nodesinksource

Austin Hadoop User Group, 2/17/2011

Can I has the codez?

node001: tail(“/var/log/app/log”) | autoE2ESink;

node002: tail(“/var/log/app/log”) | autoE2ESink;

node100: tail(“/var/log/app/log”) | autoE2ESink;

collector1: autoCollectorSource |

collectorSink(“hdfs://logs/app/”,”applogs”)

collector2: autoCollectorSource |

collectorSink(“hdfs://logs/app/”,”applogs”)

collector3: autoCollectorSource |

collectorSink(“hdfs://logs/app/”,”applogs”)

14Austin Hadoop User Group, 2/17/2011

Outline

• What is Flume?

• Scalability– Horizontal scalability of all nodes and masters

• Reliability– Fault-tolerance and High availability

• Extensibility– Unix principle, all kinds of data, all kinds of sources, all kinds of sinks

• Manageability– Centralized management supporting dynamic reconfiguration

• Openness– Apache v2.0 License and an active and growing community

15Austin Hadoop User Group, 2/17/2011

SCALABILITY

16Austin Hadoop User Group, 2/17/2011

Flume

: The Standard Use Case

HDFS

Agent tier Collector tier

17

AgentAgentAgentAgent

AgentAgentAgentAgent

AgentAgentAgentAgent

Collector

Collector

Collector

serverserverserverserver

serverserverserverserver

serverserverserverserver

Austin Hadoop User Group, 2/17/2011

Data path is horizontally scalable

• Add collectors to increase availability and to handle more data– Assumes a single agent will not dominate a collector– Fewer connections to HDFS.– Larger more efficient writes to HDFS.

• Agents have mechanisms for machine resource tradeoffs• Write log locally to avoid collector disk IO bottleneck and catastrophic failures• Compression and batching (trade cpu for network)• Push computation into the event collection pipeline (balance IO, Mem, and CPU

resource bottlenecks)

18

HDFSAgentAgentAgentAgent

Collectorserverserverserverserver

Austin Hadoop User Group, 2/17/2011

RELIABILITY

19Austin Hadoop User Group, 2/17/2011

CollectorAgent

Collector

Tunable failure recovery modes

• Best effort– Fire and forget

• Store on failure + retry– Local acks, local errors

detectable – Failover when faults detected.

• End to end reliability– End to end acks– Data survives compound failures,

and may be retried multiple times

HDFS

HDFS

HDFS

20

Agent Collector

Agent

Austin Hadoop User Group, 2/17/2011

Load balancing

• Agents are logically partitioned and send to different collectors

• Use randomization to pre-specify failovers when many collectors exist • Spread load if a collector goes down.

• Spread load if new collectors added to the system.

21

Agent CollectorAgent

Collector

Collector

AgentAgent

AgentAgent

Austin Hadoop User Group, 2/17/2011

Load balancing and collector failover

• Agents are logically partitioned and send to different collectors

• Use randomization to pre-specify failovers when many collectors exist • Spread load if a collector goes down.

• Spread load if new collectors added to the system.

22

Collector

Collector

Collector

AgentAgent

AgentAgent

AgentAgent

Austin Hadoop User Group, 2/17/2011

Control plane is horizontally scalable

• A master controls dynamic configurations of nodes

– Uses consensus protocol to keep state consistent

– Scales well for configuration reads

– Allows for adaptive repartitioning in the future

• Nodes can talk to any master.

• Masters can talk to an existing ZK ensemble

Master

Master

Master

ZK1

ZK2

ZK3

23

Node

Node

Node

Austin Hadoop User Group, 2/17/2011

Control plane is horizontally scalable

• A master controls dynamic configurations of nodes

– Uses consensus protocol to keep state consistent

– Scales well for configuration reads

– Allows for adaptive repartitioning in the future

• Nodes can talk to any master.

• Masters can talk to an existing ZK ensemble

Master

Master

Master

ZK1

ZK2

ZK3

24

Node

Node

Node

Austin Hadoop User Group, 2/17/2011

Control plane is horizontally scalable

• A master controls dynamic configurations of nodes

– Uses consensus protocol to keep state consistent

– Scales well for configuration reads

– Allows for adaptive repartitioning in the future

• Nodes can talk to any master.

• Masters can talk to an existing ZK ensemble

Master

Master

Master

ZK1

ZK2

ZK3

25

Node

Node

Node

Austin Hadoop User Group, 2/17/2011

MANAGEABILITY

Wheeeeee!26Austin Hadoop User Group, 2/17/2011

Centralized Dataflow Management Interfaces

• One place to specify node sources, sinks and data flows.

• Basic Web interface

• Flume Shell– Command line interface

– Scriptable

• Cloudera Enterprise– Flume Monitor App

– Graphical web interface

27Austin Hadoop User Group, 2/17/2011

Configuring Flume

Node: tail(“file”) | filter [ console, roll(1000) {

dfs(“hdfs://namenode/user/flume”) } ] ;

• A concise and precise configuration language for specifying dataflows in a node.

• Dynamic updates of configurations– Allows for live failover changes

– Allows for handling newly provisioned machines

– Allows for changing analytics

28

hdfs

consoletail filter

roll

fanout

Austin Hadoop User Group, 2/17/2011

Output bucketing

• Automatic output file management

– Write hdfs files in over time based tags

29

HDFS

/logs/web/2010/0715/1200/data-xxx.txt

/logs/web/2010/0715/1200/data-xxy.txt

/logs/web/2010/0715/1300/data-xxx.txt

/logs/web/2010/0715/1300/data-xxy.txt

/logs/web/2010/0715/1400/data-xxx.txt

node : collectorSource | collectorSink

(“hdfs://namenode/logs/web/%Y/%m%d/%H00”, “data”)

Collector

Collector

Austin Hadoop User Group, 2/17/2011

EXTENSIBILITY

30Austin Hadoop User Group, 2/17/2011

sink

sink

Flume is easy to extend

• Simple source and sink APIs

– An event streaming design

– Many simple operations composes for complex behavior

• Plug-in architecture so you can add your own sources, sinks and decorators and sinks

31

sinksource deco

deco

fanout

decosource

Austin Hadoop User Group, 2/17/2011

Variety of Connectors

• Sources produce data– Console, Exec, Syslog, Scribe, IRC, Twitter, – In the works: JMS, AMQP, pubsubhubbub/RSS/Atom

• Sinks consume data– Console, Local files, HDFS, S3– Contributed: Hive (Mozilla), Hbase (Sematext), Cassandra

(Riptano/DataStax), Voldemort, Elastic Search– In the works: JMS, AMQP

• Decorators modify data sent to sinks– Wire batching, compression, sampling, projection,

extraction, throughput throttling– Custom near real-time processing (Meebo)– JRuby event modifiers (InfoChimps)– Cryptographic extensions(Rearden)

32

source

sink

deco

Austin Hadoop User Group, 2/17/2011

: Multi Datacenter

33

HDFS

AP

I serv

er

Collector tier

Pro

cesso

r serv

er

AgentAgentAgentAgent

AgentAgentAgentAgent

AgentAgentAgentAgent

AgentAgentAgentAgent

AgentAgentAgentAgent

AgentAgentAgentAgent

Collector

Collector

Collector

Collector

Collector

Collector

apiapiapiapi

apiapiapiapi

apiapiapiapi

apiapiapiproc

apiapiapiproc

apiapiapiproc

Austin Hadoop User Group, 2/17/2011

: Multi Datacenter

34

HDFS

AP

I serv

er

Collector tier

Pro

cesso

r serv

er

AgentAgentAgentAgent

AgentAgentAgentAgent

AgentAgentAgentAgent

AgentAgentAgentAgent

AgentAgentAgentAgent

AgentAgentAgentAgent

Collector

Collector

Collector

Collector

Collector

Collector

Relay

apiapiapiapi

apiapiapiapi

apiapiapiapi

apiapiapiproc

apiapiapiproc

apiapiapiproc

Austin Hadoop User Group, 2/17/2011

Flume

: Near Realtime Aggregator

35

HDFS

DB Hive job

CollectorTracker AgentAgentAgentAgentAd svrAd svrAd svrAd svr

reports

verify

quickreports

Austin Hadoop User Group, 2/17/2011

Flume

An enterprise story

36

AP

I serv

er

Collector tier

AgentAgentAgentWin

AgentAgentAgentLinux

AgentAgentAgentLinux

Collector

Collector

Collector

apiapiapiapi

apiapiapiapi

apiapiapiapi

Austin Hadoop User Group, 2/17/2011

Kerberos HDFS

D D DDDD

Active Directory

/ LDAP

index

hbase

hdfs

An emerging community story

37

HDFSHive queryAgentAgentAgentAgentsvr

Collector Fanout HBase

Incremental Search Idx

Key lookup

Range query

Search query

Faceted query

Pig query

Austin Hadoop User Group, 2/17/2011

Flume

OPENNESS AND COMMUNITY

38Austin Hadoop User Group, 2/17/2011

Flume is Open Source

• Apache v2.0 Open Source License – Independent from Apache Software Foundation

• GitHub source code repository– http://github.com/cloudera/flume

– Regular tarball update versions every 2-3 months.

– Regular CDH packaging updates every 3-4 months.

• Review Board for code review

• New external committers wanted!– Cloudera folks: Jonathan Hsieh, Henry Robinson, Patrick Hunt, Eric

Sammer

– Independent folks: Bruce Mitchener

Austin Hadoop User Group, 2/17/2011 39

Growing user and developer community

• History:

– Initial Open Source Release, June 2010

• Growth:

– Pre-Hadoop Summit (Late June 2010):

• 4 followers, 4 forks (original authors)

– Pre-Hadoop World (October 2010):

• 174 followers, 34 forks

– Pre-CDH3B4 Release (February 2011):

• 288 followers, 51 forks

Austin Hadoop User Group, 2/17/2011 40

Support

• Community-based mailing lists for support

– “an answer in a few days”

– User: https://groups.google.com/a/cloudera.org/group/flume-user

– Dev: https://groups.google.com/a/cloudera.org/group/flume-dev

• Community-based IRC chat room

– “quick questions, quick answers”

– #flume in irc.freenode.net

• Commercial support with Cloudera Enterprise subscription

– Chat with [email protected]

41Austin Hadoop User Group, 2/17/2011

CONCLUSIONS

42Austin Hadoop User Group, 2/17/2011

Summary

• Flume is a distributed, reliable, scalable, extensible system for collecting and delivering high-volume continuous event data such as logs.

– It is centrally managed, which allows for automated and adaptive configurations.

– This design allows for near-real time processing.

– Apache v2.0 License with active and growing community

• Part of Cloudera’s Distribution for Hadoop, about to be refreshed for CDH3b4.

43Austin Hadoop User Group, 2/17/2011

Questions? (and shameless plugs)

• Contact info:– [email protected]

– Twitter @jmhsieh

• Cloudera Training in Dallas– Hadoop Training for Developers - March 14-16

– Hadoop Training for Administrators - March 17-18

– Sign up at http://cloudera.eventbrite.com

– 10% discount code for classes "hug“

• Cloudera is Hiring!

44Austin Hadoop User Group, 2/17/2011