die lmax-architecture with disruptors: 6m transactions per ...donatas/vadovavimas/temos... · costs...
TRANSCRIPT
Die LMAX-Architecture with Disruptors: 6M Transactions per Second
Stephan Schmidt, Vice CTO, brands4friends
Me Stephan Schmidt Vice CTO brands4friends
@codemonkeyism www.codemonkeyism.com [email protected]
3
brands4friends No.1 Shopping Club in Germany > 360k daily visitors > 4.5M Users eBay company
20.04.12 5 WJAX 2011
6
7
Development at brands4friends Team Java and web developers, data warehouse developers Process Scrum since 2009 Kanban for DWH since 2012
LMAX - The London Multi-Asset Exchange
20.04.12 Fußzeilentext 9
"We aim to build the highest performance financial exchange in the world"
High Performance Transaction Processing
20.04.12 10 Fußzeilentext
Service / Transaction Processor
Receive Unmarshal ReplicateJournal Business Logic Marshall Send
Service / Transaction Processor
Receive Unmarshal ReplicateJournal Business Logic Marshall Send
Queue
Queue
Queue
Queue
Queue
Queue
Ghz CPU
Cores
20.04.12 Fußzeilentext 14
Actors? SEDA?
Stuff that did not work for various reasons
20.04.12 Fußzeilentext 15
1. RDBMS
2. Actors
3. SEDA
4. J2EE …
Service / Transaction Processor
Receive Unmarshal ReplicateJournal Business Logic Marshall Send
Queue
Queue
Queue
Queue
Queue
Queue
LMAX Architecture
20.04.12 16 Fußzeilentext
Service / Transaction Processor
Receive Unmarshal ReplicateJournal Business Logic Marshall Send
Queue
Queue
Queue
Queue
Queue
Queue
Size
Node Node Node Node
Linked List Queue
Add Remove
Array Queue
Cache Line Cache Line
AddRemove
Queue as a data structure Problems with Queues
19
1. Reading (Take) and Writing (Add) are both write access => Write Contention
2. Write Contention solves with Locks 1. Other solutions include Deques
3. Locks lead to context switches to the kernel 1. Context switches lead to CPU cache misses etc.
2. Kernel might use opportunity to do other stuff as well
Locks Costs according to LMAX Paper
20
Method Time in ms Single Thread 300 Single Thread mit Lock 10.000 Zwei Threads mit Lock 224.000 Single Thread mit CAS 5.700 Zwei Threads mit CAS 30.000 Single Thread/ Volatile Write
4.700
“Compare And Swap” Atomic Reference etc. in Java => No Context Switch Memory Read/Write Barrier
LMAX Data Structure – Ring Buffer
21
Ring Buffer
Publisher Event Processor
Pre-Allocation of Buckets
22
Ring Buffer
31
24
1918
Publisher
30 29 28272625
23
22
21
20
17161514131211
109
87
6
5
432
10
Event Processor
2^5• No (less) GC problems • Objects are near each other in memory
=> cache friendly
Coordination
23
Ring Buffer
31
24
1918
Publisher
30 29 28272625
23
22
21
20
17161514131211
109
87
6
5
432
10
Event Processor
2^5
Claim Strategy
1.Claim 2.Write 3.Make Public by advancing sequence
Wait Strategy
Service / Transaction Processor
Receive Unmarshal ReplicateJournal Business Logic Marshall Send
Queue
Queue
Queue
Queue
Queue
Queue
Latency
Receive Message
Journal
Replicate
Unmarshall
Business Logic
Service / Transaction Processor
Receive Unmarshal ReplicateJournal Business Logic Marshall Send
Datenstruktur
Datenstruktur
Ouput DisruptorOuput DisruptorInput Disruptor Ouput Disruptor
Business Logic Handler
LMAX Architektur
28
Input Disruptor
Receiver
Journaler
Replicator
Un-Marshaller
Business Logic Handler
Output Disruptor
Publisher
Marshaller
HA Node
File System
Jede Stage kann mehrere Threads haben
29
31
24
1918
Receiver
Journaler
Replicator
Business Logic Handler
Receiver writes on 31. Journaler and Replicator read on 24 and can move up the sequence to 30.
Business Logic Handler needs to stay behind all others.
Un-Marshaller can move beyond Journaler and Replicator up to 30.
Un-Marshaller
Java API
20.04.12 30 Fußzeilentext
P1
C1
C2
C3
C4
C1P1
C2
C3
C4
P1
C1 C2
C3 C4
C1
C2
C3
C4P1
C1
P1
P2
20.04.12 Fußzeilentext 38
Demo
LMAX Low Level Ideas
20.04.12 Fußzeilentext 39
1. Simple Code
2. Everything in memory
3. Single threaded per CPU for business logic
4. Business logic has no I/O, I/O is done somewhere else
5. Scheduler “knows” dependencies of handlers
6M TPS? How did LMAX do it?
40
10K+ TPS
If you don't do anything stupid
3 billions of instructions on modern CPU
100K+ TPS
Clean organized code
Standard libraries
1000K+ TPS
Custom, cache friendly collections
Performance Testing
Controlled GC
Very well modeled domain
x 10
x 10
We’re looking for very good developers
Thanks! @codemonkeyism [email protected]
43
Images CC from Flickr: nimboo, imjustcreative, gremionis, justonlysteve, John_Scone, Matthias Wicke, irisgodd3ss, TunnelBug, alandd, seasonal wanderer, raulbarraltamayo, Gilmoth, Dunechaser, graftedno1
Sources
20.04.12 Fußzeilentext 44
“Disruptor: High performance alternative to bounded queues for exchanging data between concurrent threads”, Martin Thompson, Dave Farley, Michael Barker, Patricia Gee, Andrew Stewart, 2011
"The LMAX Architecture”, Martin Fowler, 2011
http://martinfowler.com/articles/lmax.html
“How to do 100K+ TPS at less than 1ms latency”, Martin Thompson, Michael Barker, 2010