apache flume

19
Arinto Murdopo Josep Subirats Group 4 EEDC 2012

Upload: arinto-murdopo

Post on 22-Nov-2014

10.297 views

Category:

Technology


3 download

DESCRIPTION

Brief description of Apache Flume 0.9.x for EEDC assignment

TRANSCRIPT

Page 1: Apache Flume

Arinto MurdopoJosep Subirats

Group 4EEDC 2012

Page 2: Apache Flume

Outline● Current problem● What is Apache Flume?● The Flume Model

○ Flows and Nodes

○ Agent, Processor and Collector Nodes

○ Data and Control Path

● Flume goals○ Reliability

○ Scalability

○ Extensibility

○ Manageability

● Use case: Near Realtime Aggregator 

Page 3: Apache Flume

Current Problem

● Situation:You have hundreds of services running in different servers that produce lots of large logs which should be analyzed altogether. You have Hadoop to process them. 

● Problem: How do I send all my logs to a place that has Hadoop? I need a reliable, scalable, extensible and manageable way to do it!

Page 4: Apache Flume

What is Apache Flume?

● It is a distributed data collection service that gets flows of data (like logs) from their source and aggregates them to where they have to be processed.

● Goals: reliability, scalability, extensibility, manageability.

Exactly what I needed!

Page 5: Apache Flume

The Flume Model: Flows and Nodes

● A flow corresponds to a type of data source (server logs, machine monitoring metrics...).

● Flows are comprised of nodes chained together (see slide 7).

Page 6: Apache Flume

The Flume Model: Flows and Nodes

● In a Node, data come in through a source......are optionally processed by one or more decorators......and then are transmitted out via a sink. 

Examples: Console, Exec, Syslog, IRC, Twitter, other nodes... Examples: Console, local files, HDFS, S3, other nodes... Examples: wire batching, compression, sampling, projection, extraction...

Page 7: Apache Flume

The Flume Model: Agent, Processor and Collector Nodes

● Agent: receives data from an application.

 

● Processor (optional): intermediate processing.

 

● Collector: write data to permanent storage.

Page 8: Apache Flume

The Flume Model: Data and Control Path (1/2)

Nodes are in the data path.

Page 9: Apache Flume

The Flume Model: Data and Control Path (2/2)

Masters are in the control path.● Centralized point of configuration. Multiple: ZK.● Specify sources, sinks and control data flows.

Page 10: Apache Flume

Flume Goals: Reliability

Tunable Failure Recovery Modes 

● Best Effort

 ● Store on Failure and Retry

 ● End to End Reliability

Page 11: Apache Flume

Flume Goals: Scalability

Horizontally Scalable Data Path

Load Balancing

Page 12: Apache Flume

Flume Goals: Scalability

Horizontally Scalable Control Path

Page 13: Apache Flume

Flume Goals: Extensibility

● Simple Source and Sink API○ Event streaming and composition of simple

operation

 ● Plug in Architecture

○ Add your own sources, sinks, decorators

  

Page 14: Apache Flume

Flume Goals: Manageability

Centralized Data Flow Management Interface 

Page 15: Apache Flume

Flume Goals: Manageability

Configuring Flume  

Node: tail(“file”) | filter [ console, roll(1000) { dfs(“hdfs://namenode/user/flume”) } ] ;

Output Bucketing  

/logs/web/2010/0715/1200/data-xxx.txt/logs/web/2010/0715/1200/data-xxy.txt/logs/web/2010/0715/1300/data-xxx.txt/logs/web/2010/0715/1300/data-xxy.txt/logs/web/2010/0715/1400/data-xxx.txt

Page 16: Apache Flume

Use Case: Near Realtime Aggregator

Page 17: Apache Flume

Conclusion

Flume is● Distributed data collection service ● Suitable for enterprise setting ● Large amount of log data to process

Page 18: Apache Flume

Questions to be unveiled?  

Q&A