big data logging pipeline with apache spark and kafka

Shipping YaaS logs with Apache Spark and KafkaDogukan Sonmez

Senior Software Engineer @hybris Software@dogukansonmez

Agenda

² Introduction to Yaas

² Architecture of Logging pipeline

² Technology behind logging pipeline

² Challenges

² Recap

² Q&A

What is YaaS

SAP hybris as a Service (YaaS)

A micro-service based Business PaaS

Integrated with hybris and SAP Solutions

Publish

yaas.io

Architecture of Logging pipeline

Technology behind logging pipeline

High Throughput messaging

BrokerDistributed

Scalable

Fault Tolerant

TopicPartition

Replicated

Offset

Technology behind logging pipeline

Micro Batching RDD

Streaming

Reliable

Scalable

Big Data pipeline challenges

Reliability of Kafka

v 3 Brokers

v 3 Zookeeper instances

v default.replication.factor=2

v Mainly with Default Configurations

v 5 Brokers

v 5 Zookeeper instances

v unclean.leader.election.enable=false

v min.insync.replicas=2

v default.replication.factor=3

BEFORE AFTER

Spark Streaming Checkpointing

v Spark checkpointing

v All RDD serialized and stored at HDFS

v Custom kafka checkpointing

(Only latest offset stored at kafka)

BEFORE AFTER

Elasticsearch indexing big data

v Default mapping

v index.refresh_interval = 1s

v Indices.memory_index_buffer_size= 10%

v Custom mapping with disabled norms

v Mapping using simple analyzer

v index.refresh_interval = 30s

v Indices.memory_index_buffer_size= 30%

v spark.streaming.kafka.maxRatePerPartition=10000

BEFORE AFTER

https://hackingat.hybris.com

big data logging pipeline with apache spark and kafka

Data & Analytics

slides - apache kafka® architecture & fundamentals...

apache kafka best practices

apache kafka overview - docs.cloudera.com

intro to apache kafka

using apache spark, apache kafka and apache...

kafka tutorial - introduction to apache kafka (part 1)

apache kafka at linkedin

apache kafka event stream processing solution is apache...

evaluation of apache kafka in real-time big data pipeline...

decoupling decisions with apache kafka

building distributed semantic job queue with kafka ·...

integrating apache hive with kafka, spark, and bi ·...

distributed messaging with apache kafka

apache kafka - free friday

making apache kafka elastic with apache mesos

scaling mqtt with apache kafka

apache kafka security

stream processing using apache spark and apache kafka

kafka streams: the stream processing engine of apache kafka

kafka in production - wordpress.com · apache kafka apache...