big data logging pipeline with apache spark and kafka

Post on 16-Apr-2017

381 Views

Category:

Data & Analytics

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Shipping YaaS logs with Apache Spark and KafkaDogukan Sonmez

Senior Software Engineer @hybris Software@dogukansonmez

Agenda

² Introduction to Yaas

² Architecture of Logging pipeline

² Technology behind logging pipeline

² Challenges

² Recap

² Q&A

What is YaaS

SAP hybris as a Service (YaaS)

A micro-service based Business PaaS

Integrated with hybris and SAP Solutions

Build

Publish

Fast

yaas.io

Architecture of Logging pipeline

Architecture of Logging pipeline

Technology behind logging pipeline

High Throughput messaging

BrokerDistributed

Scalable

Fault Tolerant

TopicPartition

Replicated

Offset

Technology behind logging pipeline

Micro Batching RDD

Streaming

DAG

Reliable

ML

Scalable

Graph

Fast

Big Data pipeline challenges

Reliability of Kafka

v 3 Brokers

v 3 Zookeeper instances

v default.replication.factor=2

v Mainly with Default Configurations

v 5 Brokers

v 5 Zookeeper instances

v unclean.leader.election.enable=false

v min.insync.replicas=2

v default.replication.factor=3

BEFORE AFTER

Big Data pipeline challenges

Spark Streaming Checkpointing

v Spark checkpointing

v All RDD serialized and stored at HDFS

v Custom kafka checkpointing

(Only latest offset stored at kafka)

BEFORE AFTER

Big Data pipeline challenges

Elasticsearch indexing big data

v Default mapping

v index.refresh_interval = 1s

v Indices.memory_index_buffer_size= 10%

v Custom mapping with disabled norms

v Mapping using simple analyzer

v index.refresh_interval = 30s

v Indices.memory_index_buffer_size= 30%

v spark.streaming.kafka.maxRatePerPartition=10000

BEFORE AFTER

Recap

Recap

Q&A

https://hackingat.hybris.com

top related