kafka meetup jp #3 - engineering apache kafka at line

Engineering Apache Kafka at LINEYuto Kawamura

Speaker introduction- Name: Yuto Kawamura(kawamuray)

- Software engineer @ LINE

- Designing and implementing inter-service data passing infrastructure

- ❤ Apache Kafka

- Apache Kafka contributor

- KAFKA-4614 Improved broker’s response time

- KAFKA-4024 Removed unnecessary blocking behavior of producer

- Past publications

- Applying Kafka Streams for internal message delivery pipeline

- https://engineering.linecorp.com/en/blog/detail/80

- Monitoring Apache Kafka w/ Prometheus

- https://www.slideshare.net/kawamuray/monitoring-kafka-w-prometheus

Apache Kafka - a new layer for service isolation

Application server

Another app server

Stats system

Abuser detection system

Produce once, unified data

format, general info

Failure destination can resume

consuming data after recovered.

Apache Kafka

- Publisher can stay agnostic for:

- who uses data

- remote failure

- Subscriber can:

- stay isolated from unexpectedly high workload on publisher

- stop consuming temporary and continue after recovered from failure

- can access all data from different services in the same way

All data in 1-path

Data and usage- What kind of data are stored?

- Structured events

- Application server’s request log(!= access log)

- Mutation logs of HBase datastore

- Task processing request

- How are those data used?

- Data replication

- Protect service storage from unexpected access

- Abuser detection

- User action based processing

- Metrics generation

- Statistics

- Asynchronous task processing

Facts- Kafka version: 0.10.0.1 + inhouse patches and backports

- 140+ billion messages /day

- 38+ TB incoming data /day

- 3.5+ million messages /sec on peak

- tens of broker servers

- Single, multi-tenanted cluster(+ secondary cluster for backup)

- Target latency(Produce response time):

- < 1 ms for 50th %ile

- < 10ms for 99th %ile

Monitoring- Prometheus + Grafana

- ref: https://www.slideshare.net/kawamuray/monitoring-kafka-w-prometheus

- jmx_exprter and node_exporter

- for Kafka broker metrics

- simpleclient_java

- for Producer/Consumer metrics

- kafka-consumer-group-exporter

- for general consumer monitoring

- https://github.com/kawamuray/prometheus-kafka-consumer-group-exporter

- Templated Grafana dashboard for generalized consumer monitoring

- Consumers can start monitoring their consumer’s health immediately after the deployment

Hello, experts :D

Advanced observations- Apache Kafka performs really well. However...

- It’s heavily relying on various system primitives

- page cache

- https://kafka.apache.org/documentation/#design_filesystem

- sendfile(2) (linux)

- https://kafka.apache.org/documentation/#maximizingefficiency

- FileRecords#writeTo: bytesTransferred = channel.transferTo(position, count, destChannel);

Observing Kafka’s performance detail- Sometimes built-in metrics and jtools(jstack, jmap, jvisualvm) aren’t enough to

learn everything

- e.g, “Reading disk” is unusual thing in our cluster

- happens only when some consumer being delayed and start requesting old offsets

- Dive into OS metrics

- node_disk_bytes_read(from node_exporter)

The problem is...- Slow response for FetchConsumer caused by reading disk, potentially involves

other responses being sent within the same network processor

- Disk read potentially happens “transparently” inside sendfile(2) syscall

- => not observable from application(Kafka) level

- Some consumer impl like old storm-kafka doesn’t checkpoints consumption offset

to broker

- => can’t observe currently consumed offset from broker side

Dive into kernel - SystemTap- SystemTap

- A tool to perform dynamic tracing on Linux operation system

- Can hook arbitrary execution point of Linux kernel internal

- Very less overhead with compare to other traditional tools

- Similar tools:

- DTrace

- eBPF

Observing syscall durationglobal s

global records

global sampled

probe begin {

printf("Start observing syscall %s duration... Press C-c to exit\n", @1)

}

probe syscall.$1 {

s[tid()] = gettimeofday_us()

}

probe syscall.$1.return {

elapsed = gettimeofday_us() - s[tid()]

delete s[tid()]

records <<< elapsed

sampled++

}

probe timer.s(1) {

printf("Sampled %d syscalls...\n", sampled)

}

probe end {

printf("syscall %s duration(us):\n", @1)

print(@hist_log(records))

}

- syscall-trace.stp

$ stap -x PID syscall-duration.stp sendfile

Start observing syscall sendfile duration... Press C-c to exit

...

Sampled 307489 syscalls...

^Csyscall sendfile duration(us):

value |-------------------------------------------------- count

0 | 0

1 | 0

2 |@@@@@@@ 36801

4 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 230072

8 |@@@@@@@@ 40797

16 |@ 8550

32 | 535

64 | 407

128 | 35

256 | 7

512 | 0

1024 | 0

2048 | 3

4096 | 1

8192 | 3

16384 | 0

32768 | 1

65536 | 1

131072 | 0

262144 | 0

Observing slow sendfile(2)

global s

probe syscall.sendfile {

s[tid()] = gettimeofday_us()

}

probe syscall.sendfile.return {

elapsed = gettimeofday_us() - s[tid()]

delete s[tid()]

if (elapsed >= $1) {

printf("sendfile took %d us: dest=%d src=%d offset=%d size=%d\n",

elapsed, $out_fd, $in_fd, $offset, $count)

}

}

- Now we know most sendfile is completing within few microsecs

- Then what does slow sendfile doing?

- sendfile-trace.stp

$ stap -x PID sendfile-trace.stp 1000

sendfile took 2517 us: dest=11388 src=4427 offset=139647745139744 size=724183

sendfile took 11784 us: dest=11388 src=4415 offset=139647745139744

size=183748

sendfile took 5961 us: dest=11388 src=4427 offset=139647745139744 size=847618

Finding who is reading disk$ ls -l /proc/PID/fd/{4415,11388}

lrwx------ 1 user user 64 May 19 14:08 /proc/PID/fd/4415 -> /path/to/kafka-data/topic-foo-bar-9/00000000077927876918.log

lrwx------ 1 user user 64 May 19 14:01 /proc/PID/fd/11388 -> socket:[311161013]

$ netstat -npe | grep 311161013

tcp 0 0 LOCAL_IP:12345 REMOTE_IP:60258 ESTABLISHED 123 311161013 PID/java

- Gotcha!

- Technique was leveraged when I worked for

https://issues.apache.org/jira/browse/KAFKA-4614

Conclusion- However…

- Kafka works incredibly stable and high performant for sane traffic

- Even with default config!

- Kafka is serving 140+ billion daily messages only w/ tens of servers

- Having single data hub makes:

- inter-service collaboration easier

- isolated workload

- destination agnostic data format

- more data reusability

- resource utilization efficient

- When going into deeper analysis, an OS level took like SystemTap helps you a lot

Small notice...- How many of you were at Kafka Summit NY…?

- Kafka Summit SF will be held at 8/28 at SF

- https://kafka-summit.org/events/kafka-summit-sf/

- https://kafka-summit.org/sessions/single-data-hub-services-feed-100-billion-messages-per-day/

- I’m going to give a presentation about our Kafka engineering chronicle

- Come and join us!

End of presentation. Questions?