samza tech talk_2015 - huawei

37
Stream Processing @Scale in LinkedIn Yi Pan Data Infrastructure Samza Team @LinkedIn Databu s

Upload: yi-pan

Post on 11-Feb-2017

231 views

Category:

Software


1 download

TRANSCRIPT

Stream Processing @Scale in LinkedIn

Yi PanData Infrastructure

Samza Team @LinkedIn

Databus

• What is Stream Processing?• What is Samza?• Samza Programming API• Stream Processing @LinkedIn• Upcoming features

Overview

• What’s stream processing– Input: an unbounded sequence of events

• E.g. web server logs, user activity tracking events, database changelogs, etc.

– Latency: near real-time• From milliseconds to minutes, instead of hours to

days– Output: an unbounded sequence of changes to

the derived dataset• The derived dataset is usually the final or partial

analytic results that can either be in another stream, or a serving data store

Stream Processing

Response latency

Stream

Processing

Milliseconds to minutes

RPC

Synchronous Later. Possibly much later.

0 ms

Stream Processing

• What are the application requirements?– Scalable, fast, stateful stream processing– What scale should we operate at?

• Traffic Volume: 1.4 Trillion events/day• Intermediate State Size: multi TB / colo (*)

– Why is it expensive to run stream processing at scale?

• Intermediate data set needs to be stored to allow low latency processing

• Large volume of data needs to be pulled and pushed via network

Stream Processing

• What is Stream Processing?• What is Samza?• Samza Programming API• Stream Processing @LinkedIn• Upcoming features

Overview

• Samza is a distributed Turing machine– Single Task Samza Job is a stateful

Turing machine

What’s Samza

Samza TaskInput stream Output stream

Statechangelogch

eckp

oint

– Scaling a Samza job: partition the streams

What’s SamzaIn

put s

trea

m A

partition 0

partition 1

partition 2

partition 3

partition n

Samza Task

State

– Scaling a Samza job: partition the streams

What’s SamzaIn

put s

trea

m B

partition 0

partition 1

partition 2

partition 3

partition n

Samza Task

State

– Scaling a Samza job: replicating the state machine

What’s Samza

shared checkpoint

Job

• Samza Execution in Yarn

What’s Samza

Host 1 Host 2 Host 3

Application Master

Samza container Samza container

Samza container

Deploy Samza job

• Samza Execution in Yarn

What’s Samza

Host 1 Host 2 Host 3

Application Master

Samza container Samza container

Samza container

• Samza Execution in Yarn

What’s Samza

Host 1 Host 2 Host 3

Application Master

Samza container

Samza container

Samza container

• States in Samza– Checkpoints

• Offsets per input stream partitions– State Stores

• In-memory or on-disk (RocksDB) derived data set

What’s Samza

Samza TaskOutput stream partitions

State changelog partitionsch

eckp

oint

Host 1

• States in Samza– Checkpoints and local state stores are backed

by distributed logs

What’s Samza

Samza TaskOutput stream partitions

State changelog partitionsch

eckp

oint

Host 1

• States in Samza– Checkpoints and local state stores are

backed by distributed logs

What’s Samza

Samza TaskOutput stream partitions

State changelog partitionsch

eckp

oint

Host 1

• States in Samza– Checkpoints and local state stores are

backed by distributed logs

What’s Samza

Samza TaskOutput stream partitions

State changelog partitionsch

eckp

oint

Host 2

• Multiple Jobs in a Dataflow

What’s Samza

Stream A Stream B Stream C

Stream E

Stream F

Job 1 Job 2

Stream D

Job 3

• What is Stream Processing?• What is Samza?• Samza Programming API• Stream Processing @LinkedIn• Upcoming features

Overview

Partition 0

class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) {

GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); }}

Samza Programming API

Partition 0

class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) {

GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); }}

Samza Programming API

Partition 0

class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) {

GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); }}

Samza Programming API

Partition 0

class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) {

GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); }}

Samza Programming API

Partition 0

class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) {

GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); }}

Samza Programming API

Partition 0

class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) {

GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); }}

Samza Programming API

Partition 0

class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) {

GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); }}

Samza Programming API

Partition 0

class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) {

GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); }}

Samza Programming API

Partition 0

class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) {

GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); }}

Samza Programming API

Partition 0

class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) {

GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); }}

Samza Programming API

• What is Stream Processing?• What is Samza?• Samza Programming API• Stream Processing @LinkedIn• Upcoming features

Overview

Stream Processing @ LinkedIn

WebServersWebServers

WebServersWebServers

WebServersWebServers

WebServersMonitorServers

Oracle

Espresso

Kafka Databus

Trackingevents

Metrics

changelog

changelog

Samza JobsSamza

JobsSamza JobsSamza

Jobs

bootstrap

bootstrap

VoldemortDerivedData

DerivedData

Stream Processing @ LinkedIn

• Tracking aggregate/analysis (ACG)

Stream Processing @ LinkedIn

• Content standardization w/ adjunct data setMember

Profile DBBootstrap

JobDatabus

Kafka

Content Standardization

Kafka

Kafka

Stream Processing @ LinkedIn

• Kafka Deployment– 1.1 Trillion messages / day

• Databus Deployment– 300 Billion messages / day

• Samza Deployment– multiple colos– 10+ Yarn clusters– 200+ nodes– 100+ Jobs in production

• What is Stream Processing?• What’s Samza• Samza Programming API• Stream Processing @LinkedIn• Upcoming features

Overview

• New features– Local state store improvements

• RocksDB TTL support• Fast recovery

– Dynamic configuration– Easier deployment w/ standalone jobs– High-level query language for faster

development

Upcoming Features

Contact Us / Get Involved• Open Source

–Documentation: samza.apache.org–Mailing list:

[email protected]– JIRA: https

://issues.apache.org/jira/browse/SAMZA