couchbasetohadoop_matt_michael_justin v4

Couchbase to Hadoop at LinkedinKafka is Enabling the Big Data Pipeline

2

• Define Problem DomainJustin Michaels | Solution Architect, Couchbase

• Use case at LinkedInMichael Kehoe | Site Reliability Engineer, Linkedin

• Supporting Technology Overview and DemoMatt Ingenthron | Senior Director, Couchbase

• Q&A

Agenda

4

Lambda Architecture

1

2

3

4

5

DATA

BATCH

SPEED

SERVE

QUERY

5

Lambda Architecture

Interactive and Real Time Applications

1

2

3

4

5

DATA

BATCH

SPEED

SERVE

QUERYHADOOP

COUCHBASESTORM

COUCHBASEBrokerCluster

Spout for Topic

Kafka Producers

Ordered Subscriptions

6

• Hadoop … an open-source framework written for distributed storage and distributed processing of very large data sets on commodity hardware

• Kafka … append only write-ahead log that records messages to a persistent store and allows subscribers to read and apply these changes to their own stores in an appropriate time-frame

• Storm … distributed framework that uses custom created "spouts" and "bolts" to define information sources and manipulations for processing of streaming data

• Couchbase … an open source, distributed NoSQL document-oriented database that is optimized for interactive applications with an integrated data cache and incremental map reduce facility

COMPLEX EVENT PROCESSING

Real TimeREPOSITORY

PERPETUALSTORE

ANALYTICALDB

BUSINESSINTELLIGENCE

MONITORING

CHAT/VOICESYSTEM

BATCHTRACK

REAL-TIMETRACK

DASHBOARD

TRACKING and COLLECTION

ANALYSIS ANDVISUALIZATION

REST FILTER METRICS

Use Case at Linkedin

10

• Site Reliability Engineer (SRE) at LinkedIn• SRE for Profile & Higher-Education• Member of LinkedIn’s CBVT• B.E. (Electrical Engineering) from

the University of Queensland,Australia

Michael Kehoe

• Kafka was created by LinkedIn• Kafka is a publish-subcribe system built as a distributed commit log• Processes 500+ TB/ day (~500 billion messages) @ LinkedIn

Kafka @ LinkedIn

• Monitoring• InGraphs

• Traditional Messaging (Pub-Sub)

• Analytics• Who Viewed my Profile• Experiment reports• Executive reports

• Building block for (log) distributibuted applications• Pinot• Espresso

LinkedIn’s uses of Kafka

Use Case: Kafka to Hadoop (Analytics)

• LinkedIn tracks data to better understand how members use our products

• Information such as which page got viewed and which content got clicked on are sent into a Kafka cluster in each data center

• Some of these events are all centrally collected and pushed onto our Hadoop grid for analy sis and daily report generation

Couchbase @ LinkedIn

• About 25 separate services with one or more clusters in multiple data centers

• Up to 100 servers in a cluster

• Single and Multi-tenant clusters

Use Case: Jobs Cluster

• Read scaling, Couchbase ~80k QPS, 24 server cluster(s)• Hadoop to pre-build data by partition• Couchbase 99 percentile latencies

Hadoop to Couchbase

• Our primary use-case for Hadoop Couchbase is for building (warming) / recovering Couchbase buckets

• LinkedIn built it’s own in-house solution to work with our ETL processes, cache invalidation procedures etc

couchbasetohadoop_matt_michael_justin v4

Documents