kafka meetup jp #3 - engineering apache kafka at line
TRANSCRIPT
Engineering Apache Kafka at LINEYuto Kawamura
Speaker introduction- Name: Yuto Kawamura(kawamuray)
- Software engineer @ LINE
- Designing and implementing inter-service data passing infrastructure
- ❤ Apache Kafka
- Apache Kafka contributor
- KAFKA-4614 Improved broker’s response time
- KAFKA-4024 Removed unnecessary blocking behavior of producer
- Past publications
- Applying Kafka Streams for internal message delivery pipeline
- https://engineering.linecorp.com/en/blog/detail/80
- Monitoring Apache Kafka w/ Prometheus
- https://www.slideshare.net/kawamuray/monitoring-kafka-w-prometheus
Apache Kafka - a new layer for service isolation
Application server
Another app server
Stats system
Abuser detection system
Produce once, unified data
format, general info
Failure destination can resume
consuming data after recovered.
Apache Kafka
- Publisher can stay agnostic for:
- who uses data
- remote failure
- Subscriber can:
- stay isolated from unexpectedly high workload on publisher
- stop consuming temporary and continue after recovered from failure
- can access all data from different services in the same way
All data in 1-path
Data and usage- What kind of data are stored?
- Structured events
- Application server’s request log(!= access log)
- Mutation logs of HBase datastore
- Task processing request
- How are those data used?
- Data replication
- Protect service storage from unexpected access
- Abuser detection
- User action based processing
- Metrics generation
- Statistics
- Asynchronous task processing
Facts- Kafka version: 0.10.0.1 + inhouse patches and backports
- 140+ billion messages /day
- 38+ TB incoming data /day
- 3.5+ million messages /sec on peak
- tens of broker servers
- Single, multi-tenanted cluster(+ secondary cluster for backup)
- Target latency(Produce response time):
- < 1 ms for 50th %ile
- < 10ms for 99th %ile
Monitoring- Prometheus + Grafana
- ref: https://www.slideshare.net/kawamuray/monitoring-kafka-w-prometheus
- jmx_exprter and node_exporter
- for Kafka broker metrics
- simpleclient_java
- for Producer/Consumer metrics
- kafka-consumer-group-exporter
- for general consumer monitoring
- https://github.com/kawamuray/prometheus-kafka-consumer-group-exporter
- Templated Grafana dashboard for generalized consumer monitoring
- Consumers can start monitoring their consumer’s health immediately after the deployment
Hello, experts :D
Advanced observations- Apache Kafka performs really well. However...
- It’s heavily relying on various system primitives
- page cache
- https://kafka.apache.org/documentation/#design_filesystem
- sendfile(2) (linux)
- https://kafka.apache.org/documentation/#maximizingefficiency
- FileRecords#writeTo: bytesTransferred = channel.transferTo(position, count, destChannel);
Observing Kafka’s performance detail- Sometimes built-in metrics and jtools(jstack, jmap, jvisualvm) aren’t enough to
learn everything
- e.g, “Reading disk” is unusual thing in our cluster
- happens only when some consumer being delayed and start requesting old offsets
- Dive into OS metrics
- node_disk_bytes_read(from node_exporter)
The problem is...- Slow response for FetchConsumer caused by reading disk, potentially involves
other responses being sent within the same network processor
- Disk read potentially happens “transparently” inside sendfile(2) syscall
- => not observable from application(Kafka) level
- Some consumer impl like old storm-kafka doesn’t checkpoints consumption offset
to broker
- => can’t observe currently consumed offset from broker side
Dive into kernel - SystemTap- SystemTap
- A tool to perform dynamic tracing on Linux operation system
- Can hook arbitrary execution point of Linux kernel internal
- Very less overhead with compare to other traditional tools
- Similar tools:
- DTrace
- eBPF
Observing syscall durationglobal s
global records
global sampled
probe begin {
printf("Start observing syscall %s duration... Press C-c to exit\n", @1)
}
probe syscall.$1 {
s[tid()] = gettimeofday_us()
}
probe syscall.$1.return {
elapsed = gettimeofday_us() - s[tid()]
delete s[tid()]
records <<< elapsed
sampled++
}
probe timer.s(1) {
printf("Sampled %d syscalls...\n", sampled)
}
probe end {
printf("syscall %s duration(us):\n", @1)
print(@hist_log(records))
}
- syscall-trace.stp
$ stap -x PID syscall-duration.stp sendfile
Start observing syscall sendfile duration... Press C-c to exit
...
Sampled 307489 syscalls...
^Csyscall sendfile duration(us):
value |-------------------------------------------------- count
0 | 0
1 | 0
2 |@@@@@@@ 36801
4 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 230072
8 |@@@@@@@@ 40797
16 |@ 8550
32 | 535
64 | 407
128 | 35
256 | 7
512 | 0
1024 | 0
2048 | 3
4096 | 1
8192 | 3
16384 | 0
32768 | 1
65536 | 1
131072 | 0
262144 | 0
Observing slow sendfile(2)
global s
probe syscall.sendfile {
s[tid()] = gettimeofday_us()
}
probe syscall.sendfile.return {
elapsed = gettimeofday_us() - s[tid()]
delete s[tid()]
if (elapsed >= $1) {
printf("sendfile took %d us: dest=%d src=%d offset=%d size=%d\n",
elapsed, $out_fd, $in_fd, $offset, $count)
}
}
- Now we know most sendfile is completing within few microsecs
- Then what does slow sendfile doing?
- sendfile-trace.stp
$ stap -x PID sendfile-trace.stp 1000
sendfile took 2517 us: dest=11388 src=4427 offset=139647745139744 size=724183
sendfile took 11784 us: dest=11388 src=4415 offset=139647745139744
size=183748
sendfile took 5961 us: dest=11388 src=4427 offset=139647745139744 size=847618
Finding who is reading disk$ ls -l /proc/PID/fd/{4415,11388}
lrwx------ 1 user user 64 May 19 14:08 /proc/PID/fd/4415 -> /path/to/kafka-data/topic-foo-bar-9/00000000077927876918.log
lrwx------ 1 user user 64 May 19 14:01 /proc/PID/fd/11388 -> socket:[311161013]
$ netstat -npe | grep 311161013
tcp 0 0 LOCAL_IP:12345 REMOTE_IP:60258 ESTABLISHED 123 311161013 PID/java
- Gotcha!
- Technique was leveraged when I worked for
https://issues.apache.org/jira/browse/KAFKA-4614
Conclusion- However…
- Kafka works incredibly stable and high performant for sane traffic
- Even with default config!
- Kafka is serving 140+ billion daily messages only w/ tens of servers
- Having single data hub makes:
- inter-service collaboration easier
- isolated workload
- destination agnostic data format
- more data reusability
- resource utilization efficient
- When going into deeper analysis, an OS level took like SystemTap helps you a lot
Small notice...- How many of you were at Kafka Summit NY…?
- Kafka Summit SF will be held at 8/28 at SF
- https://kafka-summit.org/events/kafka-summit-sf/
- https://kafka-summit.org/sessions/single-data-hub-services-feed-100-billion-messages-per-day/
- I’m going to give a presentation about our Kafka engineering chronicle
- Come and join us!
End of presentation. Questions?