couchbasetohadoop_matt_michael_justin v4

17
Couchbase to Hadoop at Linkedin Kafka is Enabling the Big Data Pipeline

Upload: michael-kehoe

Post on 21-Aug-2015

100 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CouchbasetoHadoop_Matt_Michael_Justin v4

Couchbase to Hadoop at LinkedinKafka is Enabling the Big Data Pipeline

Page 2: CouchbasetoHadoop_Matt_Michael_Justin v4

2

• Define Problem DomainJustin Michaels | Solution Architect, Couchbase

• Use case at LinkedInMichael Kehoe | Site Reliability Engineer, Linkedin

• Supporting Technology Overview and DemoMatt Ingenthron | Senior Director, Couchbase

• Q&A

Agenda

Page 3: CouchbasetoHadoop_Matt_Michael_Justin v4
Page 4: CouchbasetoHadoop_Matt_Michael_Justin v4

4

Lambda Architecture

1

2

3

4

5

DATA

BATCH

SPEED

SERVE

QUERY

Page 5: CouchbasetoHadoop_Matt_Michael_Justin v4

5

Lambda Architecture

Interactive and Real Time Applications

1

2

3

4

5

DATA

BATCH

SPEED

SERVE

QUERYHADOOP

COUCHBASESTORM

COUCHBASEBrokerCluster

Spout for Topic

Kafka Producers

Ordered Subscriptions

Page 6: CouchbasetoHadoop_Matt_Michael_Justin v4

6

• Hadoop … an open-source framework written for distributed storage and distributed processing of very large data sets on commodity hardware

• Kafka … append only write-ahead log that records messages to a persistent store and allows subscribers to read and apply these changes to their own stores in an appropriate time-frame

• Storm … distributed framework that uses custom created "spouts" and "bolts" to define information sources and manipulations for processing of streaming data

• Couchbase … an open source, distributed NoSQL document-oriented database that is optimized for interactive applications with an integrated data cache and incremental map reduce facility

Page 7: CouchbasetoHadoop_Matt_Michael_Justin v4
Page 8: CouchbasetoHadoop_Matt_Michael_Justin v4

COMPLEX EVENT PROCESSING

Real TimeREPOSITORY

PERPETUALSTORE

ANALYTICALDB

BUSINESSINTELLIGENCE

MONITORING

CHAT/VOICESYSTEM

BATCHTRACK

REAL-TIMETRACK

DASHBOARD

Page 9: CouchbasetoHadoop_Matt_Michael_Justin v4

TRACKING and COLLECTION

ANALYSIS ANDVISUALIZATION

REST FILTER METRICS

Page 10: CouchbasetoHadoop_Matt_Michael_Justin v4

Use Case at Linkedin

10

Page 11: CouchbasetoHadoop_Matt_Michael_Justin v4

• Site Reliability Engineer (SRE) at LinkedIn• SRE for Profile & Higher-Education• Member of LinkedIn’s CBVT• B.E. (Electrical Engineering) from

the University of Queensland,Australia

Michael Kehoe

Page 12: CouchbasetoHadoop_Matt_Michael_Justin v4

• Kafka was created by LinkedIn• Kafka is a publish-subcribe system built as a distributed commit log• Processes 500+ TB/ day (~500 billion messages) @ LinkedIn

Kafka @ LinkedIn

Page 13: CouchbasetoHadoop_Matt_Michael_Justin v4

• Monitoring• InGraphs

• Traditional Messaging (Pub-Sub)

• Analytics• Who Viewed my Profile• Experiment reports• Executive reports

• Building block for (log) distributibuted applications• Pinot• Espresso

LinkedIn’s uses of Kafka

Page 14: CouchbasetoHadoop_Matt_Michael_Justin v4

Use Case: Kafka to Hadoop (Analytics)

• LinkedIn tracks data to better understand how members use our products

• Information such as which page got viewed and which content got clicked on are sent into a Kafka cluster in each data center

• Some of these events are all centrally collected and pushed onto our Hadoop grid for analy sis and daily report generation

Page 15: CouchbasetoHadoop_Matt_Michael_Justin v4

Couchbase @ LinkedIn

• About 25 separate services with one or more clusters in multiple data centers

• Up to 100 servers in a cluster

• Single and Multi-tenant clusters

Page 16: CouchbasetoHadoop_Matt_Michael_Justin v4

Use Case: Jobs Cluster

• Read scaling, Couchbase ~80k QPS, 24 server cluster(s)• Hadoop to pre-build data by partition• Couchbase 99 percentile latencies

Page 17: CouchbasetoHadoop_Matt_Michael_Justin v4

Hadoop to Couchbase

• Our primary use-case for Hadoop Couchbase is for building (warming) / recovering Couchbase buckets

• LinkedIn built it’s own in-house solution to work with our ETL processes, cache invalidation procedures etc