apache kafka - martin podval

23
Apache Kafka @MartinPodval, hpsv.cz

Upload: martin-podval

Post on 18-Jul-2015

107 views

Category:

Software


1 download

TRANSCRIPT

Page 1: Apache Kafka - Martin Podval

ApacheKafka

@MartinPodval, hpsv.cz

Page 2: Apache Kafka - Martin Podval

What is Apache Kafka?

Messaging SystemDistributedPersistent and ReplicableVery fast - low latency - and scalableSimple but highly configurableBy Linkedin, open sourced under apache.org

Page 3: Apache Kafka - Martin Podval

Data Streaming

New kind of data ...● User or application data (events) streams● Monitoring - App, System● App Logging● High volume

Page 4: Apache Kafka - Martin Podval

Data Streaming Cont’d

… you want to process● Using various components● Into a target form● Map, reduce, shuffle● Real time or batch

Page 5: Apache Kafka - Martin Podval

HP Service Virtualization Use Cases

Process of clients message streams

Real-time performance modeling

Logs aggregation

Page 6: Apache Kafka - Martin Podval

How To Solve It?

Producers and Consumers● Distributed● Decoupled● Configurable● Dynamic

Page 7: Apache Kafka - Martin Podval

Kafka Cluster

Brokers● = Instances, Nodes● Topics● Partitions● Replicas

ZK● Coordination

Page 8: Apache Kafka - Martin Podval

Kafka Topics

Commit Log● Immutable● Ordered● Sequential Offset

Page 9: Apache Kafka - Martin Podval

Kafka Topics Cont’d

PartitionedIndependently:● Stored● Produced● Consumed

⇒ Scalable

Replicated● On partition basis● Different brokers

⇒ Fault Tolerant

Page 10: Apache Kafka - Martin Podval

What Can I Do?

producer.write(topic_id, message);

consumer.read(topic_id, offset);

Page 11: Apache Kafka - Martin Podval

I Want To Produce

● java/scala client● address of one or more brokers● choose a topic where to produce● highly configurable and tunable:

○ partitioner○ number of acks (async=0, master=1, replicas=1+?)○ batching, buffer size, timeouts, retries, ...

Page 12: Apache Kafka - Martin Podval

I Want To Consume

High Level API● Groups abstraction

○ To All, To One○ To Some

● Stream API● Stores positions to support fault tolerance

Page 13: Apache Kafka - Martin Podval

I Want To Consume Cont’d

Low Level● Java/scala client● Find a leader for a topic● Calculate an offset● Fetches messages

○ Re-consume if needed

Page 14: Apache Kafka - Martin Podval

I Want To Consume Cont’d

Delivery Semantic:● At most once● At least once● Exactly once

Page 15: Apache Kafka - Martin Podval

Kafka Internals - Disks

Avoid:● GC● Random disk

access

Page 16: Apache Kafka - Martin Podval

Kafka Internals - Disks Cont’d

Disks are fast ...

… when properly used● sequential access - read ahead, write behind● rely on operating system

○ avoid heap, materialization and GC● it’s more like file copy over network

It’s easy … with immutable topics

Page 17: Apache Kafka - Martin Podval

Kafka Internals - Replication

“In Sync” Replicas● Replication factor on partition basis● One leader + 0..n replicas● Replicas are consumers

○ “In Sync” if they are not “too far” behind a leader○ Batch sync

Page 18: Apache Kafka - Martin Podval

Kafka Internals - Replication Cont’d

Tunable Trade-Offs● Producer’s write method:

○ Not blocked, async○ Waits for master ACK○ Waits for all in-sync replicas

● Consumer pulls only committed messages● Server’s minimum in-sync replicas

Page 19: Apache Kafka - Martin Podval

Performance

“Incredible”

Scales with:● clients count, message size● number of replicas, partitions or topics

Depends on network and disk throughput

Page 20: Apache Kafka - Martin Podval

Performance Cont’d

Our testing● 3 nodes, master + 2 replicas● 500 000 msg/s (100 bytes[])● 400 mbit/s - 1.2 gbit/s network throughput● end2end latency 2-3 ms

@see http://bit.ly/1FsIR9a

Page 21: Apache Kafka - Martin Podval

Easy of Use

● No installation, just run a java/scala program

● Streams in files & dirs● Transparent zookeeper● Ecosystem

Page 22: Apache Kafka - Martin Podval

Cons

● Beta version● Dependency on Zookeeper● The way how it is written in Scala● No easy way how to remove messages

Page 23: Apache Kafka - Martin Podval

Questions?