flume and hbase
TRANSCRIPT
Buzzwords Berlin HBase Hackathon, June 2012
Apache Flume and HBaseAlexander Alten-Lorenz | Customer Operations Engineer
1
©2012 Cloudera, Inc. All Rights Reserved.
About Me
• COPS Engineer @ Cloudera• Apache Flume Contributor• Working with hadoop since 2009• Blogger (mapredit.blogspot.com)• Speaker at Conferences / Meetups /
Tooling Events
2
2
©2012 Cloudera, Inc. All Rights Reserved.
Flume 1.x
• Mass event collector• Stream data (events, not files) from clients
to sinks• Clients: files, syslog, avro, seq, exec• Sinks: HDFS files, HBase, …• Configurable routing / topology
3
3
©2012 Cloudera, Inc. All Rights Reserved.
Architecture
Component Function
Agent The JVM running Flume. One per machine. Runs many sources and sinks.
Client Produces data in the form of events. Runs in a separate thread.
Sink Receives events from a channel. Runs in a separate thread.
Channel Connects sources to sinks (like a queue). Implements the reliability semantics.
Event A single datum; a log record, an avro object, etc. Normally around ~4KB.
4
4
©2012 Cloudera, Inc. All Rights Reserved.
Agent
• Runs many clients and sinks• Java properties-based configuration• Low overhead (-Xmx20m)
– adding RAM increases performance– setting Xms prevent in time memory allocation– Batching increase performance dramatically
5
5
©2012 Cloudera, Inc. All Rights Reserved.
Sources
• Plugin interface• Managed by a SourceRunner that controls
threading and execution model (e.g. polling vs. event-based)
• Included: exec, avro, syslog, seq
6
6
©2012 Cloudera, Inc. All Rights Reserved.
HBase sinkls -la flume-ng-sinks/flume-ng-hbase-sink/src/main/java/org/apache/flume/sink/hbase/
HBaseSink.javaHbaseEventSerializer.java SimpleHbaseEventSerializer.javaSimpleRowKeyGenerator.java
7
7
©2012 Cloudera, Inc. All Rights Reserved.
HBaseSink.java
• Control flush()• Using serializer• Control the transaction• Control rollbacks (in case of events couldn’t
written)
8
8
©2012 Cloudera, Inc. All Rights Reserved.
Configuration
• Source Seq interface• Listening on a defined port @localhost• Serializer need some parameters• Column family and column must be known• Valid hbase-site.xml in $CLASSPATH
9
9
©2012 Cloudera, Inc. All Rights Reserved.
Configuration Example
10
host1.sources = src1host1.sinks = sink1 host1.channels = ch1
host1.sources.src1.type = seq host1.sources.src1.port = 25001host1.sources.src1.bind = localhosthost1.sources.src1.channels = ch1host1.sinks.sink1.type = org.apache.flume.sink.hbase.HBaseSink host1.sinks.sink1.channel = ch1host1.sinks.sink1.table = test3host1.sinks.sink1.columnFamily = testinghost1.sinks.sink1.column = foohost1.sinks.sink1.serializer = org.apache.flume.sink.hbase.SimpleHbaseEventSerializerhost1.sinks.sink1.serializer.payloadColumn = pcolhost1.sinks.sink1.serializer.incrementColumn = icol host1.channels.ch1.type=memory
10
©2012 Cloudera, Inc. All Rights Reserved.
Take Away
• Flume collects events• Source - Channel - Sink concept• HBase sink needs a serializer interface• Column family and column must be known
11
11
©2012 Cloudera, Inc. All Rights Reserved.
Thank You
• Web: https://cwiki.apache.org/FLUME/getting-started.html
• ML: [email protected]
• Mail: [email protected]• Blog: mapredit.blogspot.com• Twitter: @mapredit
12
12