apache samza - new features in the upcoming samza release 0.10.0

35
Apache Samza 0.10.0 What’s coming up in the next Samza release LinkedIn Navina R Committer @ Apache Samza

Upload: navina-ramesh

Post on 14-Apr-2017

2.286 views

Category:

Engineering


1 download

TRANSCRIPT

Page 1: Apache Samza - New features in the upcoming Samza release 0.10.0

Apache Samza 0.10.0What’s coming up in the next Samza release

LinkedInNavina RCommitter @ Apache Samza

Page 2: Apache Samza - New features in the upcoming Samza release 0.10.0

New Features in Samza 0.10.0Dynamic Configuration & Control

◦Coordinator Stream ◦Broadcast Stream

Host affinity in SamzaNew Consumer: KinesisNew Producers: Kinesis, HDFS,

ElasticSearchUpgraded RocksDB

Page 3: Apache Samza - New features in the upcoming Samza release 0.10.0

Dynamic Configuration & Control1. Coordinator Stream2. Broadcast Stream

Page 4: Apache Samza - New features in the upcoming Samza release 0.10.0

How does Config work today?

Job Confi

gRM

AM

C0 C1 C2

Submit Job

Cfg via cmd line

Cfg via cmd line

Job deployment in Yarn: Job is localized to the

Resource Manager (RM) RM allocates a container

for the Application Master (AM) and passes the config parameters as command-line arguments to the run-am script

Similarly, AM passes config to the containers on allocation

Checkpoint Stream

Page 5: Apache Samza - New features in the upcoming Samza release 0.10.0

ProblemsJob

Config

RM

AM

C0 C1 C2

Submit Job

Cfg via cmd line

Cfg via cmd line

Escaping / Unescaping quotes is cumbersome (SAMZA-700)

Limits the number of arguments that can be set through shell command line (SAMZA-337, SAMZA-333)

Dynamic config change not possible. Every config change requires a job re-submission (restart) (SAMZA-348)

Handle system config like checkpoints differently than user-defined config (SAMZA-348)

Checkpoint Stream

Page 6: Apache Samza - New features in the upcoming Samza release 0.10.0

Solution: Coordinator Stream

RM

AM

C0 C1 C2

Submit Job

JC

Coordinator Stream

Config requested via HTTP

Coordinator Stream (CS) Single partition Log-compacted Each job has its own

CSJob Coordinator (JC) Exposes HTTP end-

point for containers to query for Job Model

Bootstraps from CS and then, continues consumption from CSSamza job deployment using Job

Coordinator & Coorindator Stream

Bootstraps config from stream

Page 7: Apache Samza - New features in the upcoming Samza release 0.10.0

Data in Coordinator StreamCoordinator Stream (CS) contains:

◦Checkpoints for the input streams Containers periodically write to checkpoints to CS,

instead of a separate checkpoint topic◦Task-to-changelog partition mapping◦Container Locality Info (required for Host

Affinity) Containers write their location (machine-name) to

CS◦User-defined configuration

Entire configuration is written to the CS when the job is started

◦Migration related messages

Page 8: Apache Samza - New features in the upcoming Samza release 0.10.0

* Work In Progress

Coordinator Stream: Benefits

RM

AM

C0 C1 C2

Submit Job

JC

Coordinator Stream

Config requested via HTTP

Config can be easily serialized / deserialized

Checkpoints & user-defined configs are stored similarly

Config change can be made by writing to the CS*

JC can be used to coordinate job execution*

Samza job deployment using Job Coordinator & Coorindator Stream

Bootstraps config from stream

Page 9: Apache Samza - New features in the upcoming Samza release 0.10.0

Coordinator Stream: Tools / MigrationTools:Command-line tool to write

config changes to coordinator stream

Migration:JobRunner in 0.10.0

automatically migrates checkpoints and changelog mappings in 0.9.1 to Coordinator Stream in 0.10.0

Page 10: Apache Samza - New features in the upcoming Samza release 0.10.0

Broadcast StreamStream consumed by all Tasks in the job

Page 11: Apache Samza - New features in the upcoming Samza release 0.10.0

MotivationDynamically configure job

behavior Acts a custom control channel for

an application

Page 12: Apache Samza - New features in the upcoming Samza release 0.10.0

A typical input stream

Task-0

Task-1

Task-2

Task-3

MyInputStreamPartition-0

Partition-1

Partition-2

Partition-3

task.inputs = $system-name.$stream-name

task.inputs = kafka.MyInputStream

One stream partition consumed

only by one task

Page 13: Apache Samza - New features in the upcoming Samza release 0.10.0

Broadcast Stream

Task-0

Task-1

Task-2

Task-3

MyInputStreamPartition-0

Partition-1

Partition-2

Partition-3

MyBroadcastStreamPartition-0

task.inputs = $system-name.$stream-nametask.global.inputs = $system-name.$stream-

name#$partition-numbertask.inputs = kafka.MyInputStream

task.global.inputs = kafka.MyBroadcastStream#0

One stream partition consumed

only by ALL partitions

Page 14: Apache Samza - New features in the upcoming Samza release 0.10.0

Broadcast Stream

Task-0

Task-1

Task-2

Task-3

MyInputStreamPartition-0

Partition-1

Partition-2

Partition-3

MyBroadcastStreamPartition-0 Partition-1 Partition-2

task.inputs = $system-name.$stream-nametask.global.inputs = $system-name.$stream-

name#[$partition-range]task.inputs = kafka.MyInputStream

task.global.inputs = kafka.MyBroadcastStream#[0-1]

Page 15: Apache Samza - New features in the upcoming Samza release 0.10.0

Host-AffinityMaking Samza aware of container locality

Page 16: Apache Samza - New features in the upcoming Samza release 0.10.0

A Stateful JobP0

P1

P2P3

Task-0 Task-1 Task-2 Task-3

P0

P1

P2P3

Host-A Host-B Host-C

Changelog Stream

Stable State

Page 17: Apache Samza - New features in the upcoming Samza release 0.10.0

Fault Tolerance in a Stateful JobP0

P1

P2P3

Task-0 Task-1 Task-2 Task-3

P0

P1

P2P3

Host-A Host-B Host-C

Changelog Stream

Task-0 & Task-1 running on the

container in Host-A fail

Page 18: Apache Samza - New features in the upcoming Samza release 0.10.0

Fault Tolerance in a Stateful Job

P0

P1

P2P3

Task-0 Task-1 Task-2 Task-3

P0

P1

P2P3

Host-E Host-B Host-C

Changelog Stream

Yarn allocates the tasks to a container on a different host!

Page 19: Apache Samza - New features in the upcoming Samza release 0.10.0

Fault Tolerance in a Stateful Job

P0

P1

P2P3

Task-0 Task-1 Task-2 Task-3

P0

P1

P2

P3

Host-E Host-B Host-C

::

01

159

::

01

82

Local state restored by consuming the

changelog from the earliest offset!

Page 20: Apache Samza - New features in the upcoming Samza release 0.10.0

Fault Tolerance in a Stateful Job

P0

P1

P2P3

Task-0 Task-1 Task-2 Task-3

P0

P1

P2P3

Host-E Host-B Host-C

After restored, job continues with

input processing – Back to Stable

State!

Page 21: Apache Samza - New features in the upcoming Samza release 0.10.0

ProblemsP0

P1

P2P3

Task-0 Task-1 Task-2 Task-3

P0

P1

P2P3

Host-E Host-B Host-C

State stores are not persisted if the container fails◦ Tasks need to restore the state

stores from the change-log before continuing with input processing

Samza AppMaster is not aware of host locality for a container◦ Container gets relocated to a

new host Excessive start-up times

when a job is restarted

Page 22: Apache Samza - New features in the upcoming Samza release 0.10.0

MotivationDuring upgrades and job failures,

◦Local state built in the task is lost◦Samza is not aware of the container

locality◦Job start-up time is large (hours)

Job is no longer “near-realtime”Multiple stateful jobs starting up

at the same time will DDoS kafka – saturating the Kafka clusters

Page 23: Apache Samza - New features in the upcoming Samza release 0.10.0

Solution: Host Affinity in SamzaHost Affinity – ability of Samza to

allocate a container to the same machine across job restarts/deployments

Host affinity is best-effort◦Cluster load may vary◦Machine may be non-responsive◦Container should shutdown cleanly

Page 24: Apache Samza - New features in the upcoming Samza release 0.10.0

Host Affinity in SamzaP0

P1

P2P3

Task-0 Task-1 Task-2 Task-3

P0

P1

P2P3

Host-E Host-B Host-C

Coordinator Stream

Container-0 -> Host-EContainer-1 -> Host-BContainer-2 -> Host-C

Persist container locality in

Coordinator Stream

Page 25: Apache Samza - New features in the upcoming Samza release 0.10.0

Host Affinity in SamzaP0

P1

P2P3

Task-0 Task-1 Task-2 Task-3

P0

P1

P2P3

Host-E Host-B Host-C

Coordinator Stream

Container-0 -> Host-EContainer-1 -> Host-BContainer-2 -> Host-C

Task-0 & Task-1 running on the container in Host-

E fail

Page 26: Apache Samza - New features in the upcoming Samza release 0.10.0

Host Affinity in SamzaP0

P1

P2P3

Task-0 Task-1 Task-2 Task-3

P0

P1

P2P3

Host-E Host-B Host-C

Coordinator Stream

Container-0 -> Host-EContainer-1 -> Host-BContainer-2 -> Host-C

AM JC

Tasks failed, but local state stores remain!

Page 27: Apache Samza - New features in the upcoming Samza release 0.10.0

Host Affinity in SamzaP0

P1

P2P3

Task-0 Task-1 Task-2 Task-3

P0

P1

P2P3

Host-E Host-B Host-C

Coordinator Stream

Container-0 -> Host-EContainer-1 -> Host-BContainer-2 -> Host-C

AM JC

RM

Ask: Host-E Allocate: Host-E

Job Coordinator is aware of container locality!

Page 28: Apache Samza - New features in the upcoming Samza release 0.10.0

Host Affinity in SamzaP0

P1

P2P3

Task-0 Task-1 Task-2 Task-3

P0

P1

P2

P3

Host-E Host-B Host-C

::

01

159

::

01

82

Coordinator Stream

Container-0 -> Host-EContainer-1 -> Host-BContainer-2 -> Host-CContainer-0 -> Host-E

State store does not have to be restored from the

earliest offset!

Page 29: Apache Samza - New features in the upcoming Samza release 0.10.0

Host Affinity in SamzaP0

P1

P2P3

Task-0 Task-1 Task-2 Task-3

P0

P1

P2P3

Host-E Host-B Host-CJob back to Stable

state pretty quickly!

Page 30: Apache Samza - New features in the upcoming Samza release 0.10.0

Host Affinity in SamzaEnable host-affinity

◦yarn.samza.host-affinity.enabled=true

Enable continuous scheduling in Yarn

Useful for stateful jobsDoes not affect stateless jobs

Page 31: Apache Samza - New features in the upcoming Samza release 0.10.0

Upgraded RocksDB

Page 32: Apache Samza - New features in the upcoming Samza release 0.10.0

Upgraded RocksDBNew RocksDb JNI 3.13.1+ version

supports TTL Impact:

◦Removes the need to write customized code to delete expired records

Page 33: Apache Samza - New features in the upcoming Samza release 0.10.0

New Features in Samza 0.10.0Dynamic Configuration & Control

◦Coordinator Stream ◦Broadcast Stream

Host affinity in SamzaNew Consumer: KinesisNew Producers: Kinesis, HDFS,

ElasticSearchUpgraded RocksDB

Page 34: Apache Samza - New features in the upcoming Samza release 0.10.0

Thanks!Expected release date – Nov

2015Thanks to all the contributors! Contact Us:

◦Mailing List – [email protected]

◦Twitter - #samza, @samzastream

Page 35: Apache Samza - New features in the upcoming Samza release 0.10.0

Questions?