14122018 lessons learned from building an kafka managed ... · 1.xxxxxxxx xxxxxx introduction •...

19
Lessons Learned from Building an Apache Kafka Managed Service Ben Slater Chief Product Officer

Upload: others

Post on 20-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Lessons Learned from Building an Apache Kafka Managed Service

Ben SlaterChief Product Officer

1.Xxxxxxxxxxxxxx

Introduction• Over 30 million node-hours of experience managing

Cassandra, Spark, Elasticsearch and Kafka

• Our platform provides automated provisioning, monitoring and management

• Available on AWS, GCP, Azure and IBM Cloud

• Managed Apache Kafka released in May this year

• Elasticsearch Early Access Program currently underway

Agenda• Context - our offering and development process

• Hardware choice and benchmarking

• Topic and user management

• Broker security configuration

• Monitoring

• Backup and Restore

1.Xxxxxxxxxxxxxx

Instaclustr Managed Kafka -Key Features

• Currently available:o Open source Apache Kafka and Zookeeper provisioned in AWS, GCP and

Azureo SOC2 complianto Broker level, topic level and synthetic transaction monitoringo Instaclustr monitoring and provisioning API supporto Private network clusters (AWS only)o Run in your cloud provider account or ourso Topic management via a custom CLI toolo User & credential managemento Kafka 1.1, 2.0, or 2.1

1.Xxxxxxxxxxxxxx

Instaclustr Managed Kafka -Features in Dev

• In dev:

o Managed Kafka REST API

o Managed Service Registry

Expected Next Year

o Managed Kafka Connect

1.Xxxxxxxxxxxxxx

Instaclustr Managed Kafka -Development Process

• First customer requests 2016

• Internal infrastructure deployment and usage of Kafka mid 2017

• Managed service platform developmentcommenced November 2017

• Early access program with 4 customerscommenced December 2017

• Public preview release 21 May 2018

• GA 25 June 2018

• Currently running 21 cluster and ~100 nodes

1.Xxxxxxxxxxxxxx

Hardware Choice and Benchmarking - GP2 vs ST1

• Disk Typeo AWS benchmark - r4.large w 500GB disks

§ 1 x 500GB ST1 volume§ 10 x 50GB GP2 volumes in RAID0 configuration

o Avg 10% improved throughput with ST1 vs GP2 EBSo ST1 is 45% of the cost of GP2

Type Writes (m/s) Reads (m/s) Mixed (m/s)

ST1 223,851 149,506 W: 171,305 / R: 49,898

GP2 203,409 127,127 W: 162,966 / R: 44,869

1.Xxxxxxxxxxxxxx

STI

GP2

1.Xxxxxxxxxxxxxx

Hardware Choice and Benchmarking - SSL vs non-SSL

• Encryption enabled on broker-to-broker and client-to-brokero AWS benchmark - r4.large w 1500GB ST1 disko 512 byte messageso ~30% decrease in throughput with Broker and Client SSL enabled

• Follow-up benchmarks on OpenJDK 8 vs. 9, based on KAFKA-2561o 50% increased throughput in writeso 80% increased throughput in reads

1.Xxxxxxxxxxxxxx

1.Xxxxxxxxxxxxxx

Hardware Choice and Benchmarking – Number of Topics• Possible urban myth that increasing

topics reduces performance

• However, more topics = more partitions

• Significantly slows recovery time from node failure

10Topics

500Topics

100Topics

1000Topics

1.Xxxxxxxxxxxxxx

Hardware Choice and Benchmarking – Colocated Zookeeper• Often recommended to host zookeeper separately to Kafka• However, recent changes have significantly reduced load on Zookeeper from Kafka

o Consumer offsets are no longer stored in Zookeeper

• Our benchmarking showed no measurable difference in performance, at least for smaller clusters

• Still some benefits in failure scenarios - we will likely offer separate Zookeeper at some stage next year

1.Xxxxxxxxxxxxxx

Hardware Choice and Benchmarking – Colocated Zookeeper• 6 node cluster with broker restart

o Similar results with dedicated Zookeeper disk vs. shared

Consumer Rate - Separate Consumer Rate - Colocated

1.Xxxxxxxxxxxxxx

Topic and User Configuration Management• Kafka utilities require direct access to Zookeeper• Zookeeper does not have a robust external security model• Felt that providing access to Zookeeper was a risk

• Solutionso Developed command line tool to use Kafka API for topic configuration

https://github.com/instaclustr/ic-kafka-tools§ Future: Console UI support?§ Value topic configuration versioning and management

o Adding user management to Instaclustr Console§ Additional authentication required

1.Xxxxxxxxxxxxxx

Broker Security Configuration• Using SCRAM (Salted Challenge Response Authentication Mechanism) authentication

o Used for client->brokero Broker->broker uses SASL plaintext

• Using SASL plaintext authenticationo Used for broker->brokero Were planning on integrating SCRAM authentication, but dynamic configuration still requires

broker restarto Instead planning on short-lived signed broker keys as dynamic configuration does not require

restart

1.Xxxxxxxxxxxxxx

Broker Security Configuration• Access to managed clusters

o Public IPs and whitelisting in firewall (security group or equivalent)o Private IPs with VPC Peering (or equivalent in other cloud providers)o Private Network Clusters where nodes are not allocated public IPs and gateway box is used for

admin accesso Don’t expose Zookeeper through firewall due to weak security model

1.Xxxxxxxxxxxxxx

Monitoring• Metrics exposed via JMX

o Custom collection agent -> RabbitMQ (planned to migrate to Kafka) -> Riemann -> Cassandra+Spark -> Console, APIs, Grafana

• Exposing broker-level and per-topic metrics

• Alertingo Basics: service state, disk usage free space, server still existso Kafka metrics: offline partitions, active controllers != 1, partition under replicated

§ Active controller very sensitive, are re-assessing alert thresholdso Synthetic transactions: publish and consume message to controlled topic, measure success

and latency

1.Xxxxxxxxxxxxxx

Backup & Restore• Internet wisdom = Kafka Backups is not a thing

• Rely on replication within cluster or mirror maker replication to another cluster

• Cassandra experience says backups are valuableo Hardware failure is not an issue but corruption due to

app bugs or user error can occur and be spread by replication

o Simultaneous hardware failure can happen

• Current Solutiono Regular automated backup and restore of topic and

security configurationo Consider using Kafka Connect to write important

messages to offline backup

Give it a try!

14-day free trial option (no CC needed) -www.instaclustr.com