streaming kafka search utility for mozilla's bagheera

Streaming Kafka Search Utility for Bagheera

Varunkumar ManoharMetrics Engineering Intern- Summer 2013San Francisco Commons -20th August

Apache KafkaWhy use Kafka ?Mozilla’s Bagheera System Search UtilityPractical UsageOther Projects

Apache Kafka

A high throughput distributed messaging system.

Apache Kafka is publish-subscribe messaging

rethought as a distributed commit log.

Centralized data pipeline

Producer1 Producer2

Producer3

Consumer1 Consumer2 Consumer3

Centralised persistent data pipeline(Apache Kafka)

Since its persistent

consumers can lag behind

Producers and consumers do not know

each other

Consumer maintenance is

easy

High Throughput

Partitioning of data allows production, consumption and brokering to be handled by clusters of machines. Scaling horizontally is easy.

Batch the messages and send large chunks at once.

Usage of filesystem page-cache-Delayed the flush to disk

Shrinkage of data

Metrics

KafkaNode 1

KafkaNode 2

KafkaNode 3

KafkaNode 4

Ptn 1

Ptn 2

Ptn 3Ptn 4

Ptn 1

Ptn 2

Ptn 3Ptn 4

Ptn 1

Ptn 2

Ptn 3Ptn 4

Ptn 1

Ptn 2

Ptn 3Ptn 4

Kafka Commit log

Real time data flow for Metrics topic ( In Production)

Each partition = Commit Log

At offset 0 we have message_37

That can be a json for example

Underlying principle

Use a persistent log as a messaging system.

Parallels the concept of commit log

Append only commit log keep’s track of the incoming messages.

Mozilla’s Bagheera System

Some real-time numbers ! { PER MINUTE PLS }

3.5k

2.6k

1.7k

0.9k

8.7 K messages per minute on week 31

Some questions !

Can we be more granular in finding out the counts Can I get the count of messages that were pushed 3 days

back ?Can I get to know count of messages between

Sunday and Tuesday?Can I get to know the total messages that came in 3

days back & belong to updatechannel=‘release’ Can I get to know the count of messages that

came in from UK two days ago?

We can get into Hadoop or HBase for that matter and scan the data.

But Hadoop/HBase in real time is actually a massive data store- Mind blowing !

Crunching out so much of data – Not all efficient

Can we search the kafka queue that has a fair amount of data retained as per retentition policy ?

Yup ! You can query only the data retained on kafka logs- Typically our queries range within those

bounds

Yes! We can more efficientlyEfficiently use the kafka offsets and data associated

with an offset.

The data we store has a time stamp – { the time of insertion into the queue} – Check the time stamp to know if the message fits our filter conditions.

We can selectively export the data we have retrieved

Concurrent execution across partitionsfor(int i =0;i<totalPartitoins;i++){

/*create a callable object and submit to the exectuor to trigger of a execution */

Callable<Long>callable = new KafkaQueueOps(brokerNodes.get(brokerIndex), topics.get(topicIndex), i, noDays);

final ListenableFuture<Long> future=pool.submit(callable);

computationResults.add(future);}

ListenableFuture<List<Long>> successfulResults= Futures.successfulAsList(computationResults);

long[] sparseLst = consumer.getOffsetsBefore(topicName, partitionNumber, -1,Integer.MAX_VALUE);

/*sparseLst is a sparse list of offsets only (names of log files only)*/

for (int i = 1; i < sparseLst.size(); i++) {

Fetch the message at offLst[i]De-serialize the data using Googleprotocol buffersif(sparseLst[i] <=timeRange) {

checkpoint=sparseLstbreak

}

}

/*start fetching the data from checkpoint skipping through every offset till the precise offset value is obtained*/

State of consumers in Zookeeper

Kafka Broker Node

Kafka Broker Node

Kafka Broker Node

Zookeeper

/consumers/group1/offsets/topic1/0-2:119914/consumers/group1/offsets/topic1/0-1:127994/consumers/group1/offsets/topic1/0-0:130760

Consumer reads the state of their consumption from zookeeper.

What if we can change the offset values to something we want it to be?

we can go back in time and gracefully make the consumer start reading from that point-

We are setting the seek cursor on a distributed log so that the consumers can read from that point.

Do Not Track Dashboard

Hive data processing for DNT Dashboards

JDBC Application

Thrift Service

Driver compiler Executor

Metastore

Threads to execute several hive queries which in turn starts map reduce jobs.

The processed data is converted into to JSON.

All the older JSON records and newly processed JSON records are merged suitably.

The JSON data is used by web-api’s for data binding

2013-04-01 AR 0.11265908876536693 0.122003048921328592013-04-01 AS 0.159090909090 0. 90910.5

JSON Conversion Existing JSON data

Merge&Sort

Thank you !

Daniel EinspanjerAnuragHarshaMark Reid

streaming kafka search utility for mozilla's bagheera

Technology

use kafka

kafka offsets

kafka queue

log real time data flow

kafka logs

disk shrinkage of data

persistent log

count of messages