back-pressure in action: handling high-burst workloads with akka streams & kafka

Back-Pressure in Action

Handling High-Burst Workloads with Akka Streams & Kafka

Akara Sucharitakul, PayPal

Anil Gursel, PayPal

Intro & Agenda Crawler Intro & Problem Statements

Crawler Architecture

Infrastructure: Akka Streams, Kafka, etc.

The Goodies

Crawl Jobs

Job DB

Validate

URLCache

Download Process

URLs

URLs

Timestamps

High-Level View

Requirements Ever-expanding # of URLs

Can’t crawl all URLs at once

Control over concurrent web GETs

Efficient resource usage

Resilient under high burst

Scales horizontally & vertically

Sizing the Crawl Job

Let:i = Number of crawl URLs in a jobn = Average number of links per paged = The crawl depth (how many layers to follow links)u = The max number of URLs to process

Then:u = ind

0 2 4 6 8 10 121.00E+001.00E+011.00E+021.00E+031.00E+041.00E+051.00E+061.00E+07

totalURLs vs depth

depth (initialURLs = 1, outLinks = 5)

1E+00 1E+01 1E+02 1E+03 1E+04 1E+05 1E+06 1E+071.00E+031.00E+041.00E+051.00E+061.00E+071.00E+081.00E+091.00E+101.00E+11

totalURLs vs initialURLs

initialURLs (depth = 5, outLinks = 5)

The Reactive Manifesto

Responsive

Message Driven

Elastic Resilient

Why Does it Matter?

Respond in a deterministic, timely manner

Stays responsive in the face of failure – even cascading failures

Stays responsive under workload spikes

Basic building block for responsive, resilient, and elastic systems

Responsive

Resilient

Elastic

Message Driven

The Right Ingredients• Kafka

• Huge persistent buffer for the bursts• Load distribution to very large number of

processing nodes• Enable horizontal scalability

• Akka streams• High performance, highly efficient

processing pipeline• Resilient with end-to-end back-pressure• Fully asynchronous – utilizes

mapAsyncUnordered with Async HTTP client• Async HTTP client

• Non-blocking and consumes no threads in waiting

• Integrates with Akka Streams for a high parallelism, low resource solution

EfficientResilient

Scale

AkkaStream

AsyncHTTP

Reactive Kafka

Crawl Jobs

Job DB

Validate

URLCache

Download Process

URLs

URLs

Timestamps

Adding Kafka & Akka Streams

URLsAkka

Streams

Akka Streams,what???

High performance, pure async, stream processing

Conforms to reactive streams

Simple, yet powerful GraphDSL allows clear stream topology declaration

Central point to understand processing pipeline

Crawl Stream

Actual Stream Declaration in Code prioritizeSource ~> crawlerFlow ~> bCast0 ~> result ~> bCast ~> outLinksFlow ~> outLinksSink bCast ~> dataSinkFlow ~> kafkaDataSink bCast ~> hdfsDataSink bCast ~> graphFlow ~> merge ~> graphSink bCast0 ~> maxPage ~> merge bCast0 ~> retry ~> bCastRetry ~> retryFailed ~> merge bCastRetry ~> errorSink

PrioritizedSource

Crawl

Result

MaxPageReached

Retry

OutLinks

Data

Graph

CheckFail

CheckErr

OutLinksSinkKafka DataSinkHDFS DataSink

GraphSink

ErrorSink

Resulting CharacteristicsEfficient• Low thread count, controlled by Akka and pure non-blocking async HTTP• High latency URLs do not block low latency URLs using MapAsyncUnordered• Well-controlled download concurrency using MapAsyncUnordered• Thread per concurrent crawl jobResilient• Processes only what can be processed – no resource overload• Kafka as short-term, persistent queueScale• Kafka feeds next batch of URLs to available node cluster• Pull model – only processes that have capacity will get the load• Kafka distributes work to large number of processing nodes in cluster

Back-Pressure

0 100 200 300 400 500 600 7000

20000

40000

60000

80000

100000

120000

Queue Size

Time (seconds)

0

100

200

300

400

URLs/secTime (seconds)

initialURLs : 100parallelism : 1000processTime : 1 – 5 soutLinks : 0 - 10depth : 5totalCrawled : 312500

ChallengesTraining• Developers not used to E2E stream

definitions

• More familiar with deeply nested function calls

Maturity of Infrastructure• Kafka 0.9 use fetch as heartbeat

• Slow nodes cause timeout & rebalance

• Solved in 0.10

What it would have been…

Bloated, ineffective concurrency control

Lack of well-thought-out and visible processing pipeline

Clumsy code, hard to manage & understand

Low training cost, high project TCODev / Support / Maintenance

Bottom LineCrawl Time Reduced to 1/10th (compared to thread-based architecture)

Standardized Reactive PlatformFor Large Scale Internet Deployments

Efficiency & Resilience meets Standardization

• Monitoring• Need to collect metrics, consistently

• Logging• Correlation across services• Uniformity in logs

• Security• Need to apply standard security configuration

• Environment Resolution• Staging, production, etc.

Consistency in the face of Heterogeneity

squbs is not… A framework by its own

A programming model – use Akka

Take all or none – Components/patterns can mostly be used independently

squbsAkka for large scale deployments

Bootstrap

Lifecycle management

Loosely-coupled module system

Integration hooks for logging, monitoring, ops integration

squbsAkka for large scale deployments

JSON console

HttpClient with pluggable resolver and monitoring/logging hooks

Test tools and interfaces

Goodies:- Activators for Scala & Java- Programming patterns and helpers for Akka and Akka Stream Use cases…, and growing

PerpetualStream

• Provides a convenience trait to help write streams controlled by system lifecycle• Minimal/no message losses

• Register PerpetualStream to make stream start/stop

• Provides customization hooks – especially for how to stop the stream

• Provides killSwitch (from Akka) to be embedded into stream

• Implementers - just provide your stream!

A non-stop stream; starts and stops with the systemclass MyStream extends PerpetualStream[Future[Int]] {

def generator = Iterator.iterate(0) { p => if (p == Int.MaxValue) 0 else p + 1 } val source = Source.fromIterator(generator _) val ignoreSink = Sink.ignore[Int]

override def streamGraph = RunnableGraph.fromGraph( GraphDSL.create(ignoreSink) { implicit builder => sink => import GraphDSL.Implicits._ source ~> killSwitch.flow[Int] ~> sink ClosedShape })}

PersistentBuffer/BroadcastBuffer• Data & indexes in rotating memory-mapped files• Off-heap rotating file buffer – very large buffers• Restarts gracefully with no or minimal message loss• Not as durable as a remote data store, but much faster

• Does not back-pressure upstream beyond data/index writes• Similar usage to Buffer and Broadcast• BroadcastBuffer – a FanOutShape decouples each output port making each

downstream independent• Useful if downstream stage blocked or unavailable• Kafka is unavailable/rebalancing but system cannot backpressure/deny

incoming traffic• Optional commit stage for at-least-once delivery semantics• Implementation based on Chronicle Queue

A buffer of virtually unlimited size

Summary• Kafka + Akka Streams + Async I/O = Ideal Architecture for

High Bursts & High Efficiency• Akka Streams• Clear view of stream topology• Back-pressure & Kafka allows buffering load bursts

• Standardization• Walk like a duck, quack like a duck, and manage it like a duck

• squbs: Have the cake, and eat it too, with goodies like• PerpetualStream• PersistentBuffer• BroadcastBuffer

Q&A – Feedback AppreciatedJoin us on – link from https://github.com/paypal/squbs @squbs, @S_Akara, @anilgursel

back-pressure in action: handling high-burst workloads with akka streams & kafka

Software