back-pressure in action: handling high-burst workloads with akka streams & kafka
TRANSCRIPT
Back-Pressure in Action
Handling High-Burst Workloads with Akka Streams & Kafka
Akara Sucharitakul, PayPal
Anil Gursel, PayPal
Intro & Agenda Crawler Intro & Problem Statements
Crawler Architecture
Infrastructure: Akka Streams, Kafka, etc.
The Goodies
Crawl Jobs
Job DB
Validate
URLCache
Download Process
URLs
URLs
Timestamps
High-Level View
Requirements Ever-expanding # of URLs
Can’t crawl all URLs at once
Control over concurrent web GETs
Efficient resource usage
Resilient under high burst
Scales horizontally & vertically
Sizing the Crawl Job
Let:i = Number of crawl URLs in a jobn = Average number of links per paged = The crawl depth (how many layers to follow links)u = The max number of URLs to process
Then:u = ind
0 2 4 6 8 10 121.00E+001.00E+011.00E+021.00E+031.00E+041.00E+051.00E+061.00E+07
totalURLs vs depth
depth (initialURLs = 1, outLinks = 5)
1E+00 1E+01 1E+02 1E+03 1E+04 1E+05 1E+06 1E+071.00E+031.00E+041.00E+051.00E+061.00E+071.00E+081.00E+091.00E+101.00E+11
totalURLs vs initialURLs
initialURLs (depth = 5, outLinks = 5)
The Reactive Manifesto
Responsive
Message Driven
Elastic Resilient
Why Does it Matter?
Respond in a deterministic, timely manner
Stays responsive in the face of failure – even cascading failures
Stays responsive under workload spikes
Basic building block for responsive, resilient, and elastic systems
Responsive
Resilient
Elastic
Message Driven
The Right Ingredients• Kafka
• Huge persistent buffer for the bursts• Load distribution to very large number of
processing nodes• Enable horizontal scalability
• Akka streams• High performance, highly efficient
processing pipeline• Resilient with end-to-end back-pressure• Fully asynchronous – utilizes
mapAsyncUnordered with Async HTTP client• Async HTTP client
• Non-blocking and consumes no threads in waiting
• Integrates with Akka Streams for a high parallelism, low resource solution
EfficientResilient
Scale
AkkaStream
AsyncHTTP
Reactive Kafka
Crawl Jobs
Job DB
Validate
URLCache
Download Process
URLs
URLs
Timestamps
Adding Kafka & Akka Streams
URLsAkka
Streams
Akka Streams,what???
High performance, pure async, stream processing
Conforms to reactive streams
Simple, yet powerful GraphDSL allows clear stream topology declaration
Central point to understand processing pipeline
Crawl Stream
Actual Stream Declaration in Code prioritizeSource ~> crawlerFlow ~> bCast0 ~> result ~> bCast ~> outLinksFlow ~> outLinksSink bCast ~> dataSinkFlow ~> kafkaDataSink bCast ~> hdfsDataSink bCast ~> graphFlow ~> merge ~> graphSink bCast0 ~> maxPage ~> merge bCast0 ~> retry ~> bCastRetry ~> retryFailed ~> merge bCastRetry ~> errorSink
PrioritizedSource
Crawl
Result
MaxPageReached
Retry
OutLinks
Data
Graph
CheckFail
CheckErr
OutLinksSinkKafka DataSinkHDFS DataSink
GraphSink
ErrorSink
Resulting CharacteristicsEfficient• Low thread count, controlled by Akka and pure non-blocking async HTTP• High latency URLs do not block low latency URLs using MapAsyncUnordered• Well-controlled download concurrency using MapAsyncUnordered• Thread per concurrent crawl jobResilient• Processes only what can be processed – no resource overload• Kafka as short-term, persistent queueScale• Kafka feeds next batch of URLs to available node cluster• Pull model – only processes that have capacity will get the load• Kafka distributes work to large number of processing nodes in cluster
Back-Pressure
0 100 200 300 400 500 600 7000
20000
40000
60000
80000
100000
120000
Queue Size
Time (seconds)
0
100
200
300
400
URLs/secTime (seconds)
initialURLs : 100parallelism : 1000processTime : 1 – 5 soutLinks : 0 - 10depth : 5totalCrawled : 312500
ChallengesTraining• Developers not used to E2E stream
definitions
• More familiar with deeply nested function calls
Maturity of Infrastructure• Kafka 0.9 use fetch as heartbeat
• Slow nodes cause timeout & rebalance
• Solved in 0.10
What it would have been…
Bloated, ineffective concurrency control
Lack of well-thought-out and visible processing pipeline
Clumsy code, hard to manage & understand
Low training cost, high project TCODev / Support / Maintenance
Bottom LineCrawl Time Reduced to 1/10th (compared to thread-based architecture)
Standardized Reactive PlatformFor Large Scale Internet Deployments
Efficiency & Resilience meets Standardization
• Monitoring• Need to collect metrics, consistently
• Logging• Correlation across services• Uniformity in logs
• Security• Need to apply standard security configuration
• Environment Resolution• Staging, production, etc.
Consistency in the face of Heterogeneity
squbs is not… A framework by its own
A programming model – use Akka
Take all or none – Components/patterns can mostly be used independently
squbsAkka for large scale deployments
Bootstrap
Lifecycle management
Loosely-coupled module system
Integration hooks for logging, monitoring, ops integration
squbsAkka for large scale deployments
JSON console
HttpClient with pluggable resolver and monitoring/logging hooks
Test tools and interfaces
Goodies:- Activators for Scala & Java- Programming patterns and helpers for Akka and Akka Stream Use cases…, and growing
PerpetualStream
• Provides a convenience trait to help write streams controlled by system lifecycle• Minimal/no message losses
• Register PerpetualStream to make stream start/stop
• Provides customization hooks – especially for how to stop the stream
• Provides killSwitch (from Akka) to be embedded into stream
• Implementers - just provide your stream!
A non-stop stream; starts and stops with the systemclass MyStream extends PerpetualStream[Future[Int]] {
def generator = Iterator.iterate(0) { p => if (p == Int.MaxValue) 0 else p + 1 } val source = Source.fromIterator(generator _) val ignoreSink = Sink.ignore[Int]
override def streamGraph = RunnableGraph.fromGraph( GraphDSL.create(ignoreSink) { implicit builder => sink => import GraphDSL.Implicits._ source ~> killSwitch.flow[Int] ~> sink ClosedShape })}
PersistentBuffer/BroadcastBuffer• Data & indexes in rotating memory-mapped files• Off-heap rotating file buffer – very large buffers• Restarts gracefully with no or minimal message loss• Not as durable as a remote data store, but much faster
• Does not back-pressure upstream beyond data/index writes• Similar usage to Buffer and Broadcast• BroadcastBuffer – a FanOutShape decouples each output port making each
downstream independent• Useful if downstream stage blocked or unavailable• Kafka is unavailable/rebalancing but system cannot backpressure/deny
incoming traffic• Optional commit stage for at-least-once delivery semantics• Implementation based on Chronicle Queue
A buffer of virtually unlimited size
Summary• Kafka + Akka Streams + Async I/O = Ideal Architecture for
High Bursts & High Efficiency• Akka Streams• Clear view of stream topology• Back-pressure & Kafka allows buffering load bursts
• Standardization• Walk like a duck, quack like a duck, and manage it like a duck
• squbs: Have the cake, and eat it too, with goodies like• PerpetualStream• PersistentBuffer• BroadcastBuffer
Q&A – Feedback AppreciatedJoin us on – link from https://github.com/paypal/squbs @squbs, @S_Akara, @anilgursel