bullet cache
Post on 14-Apr-2018
240 Views
Preview:
TRANSCRIPT
-
7/30/2019 Bullet Cache
1/40
Bullet Cache
Balancing speed and usabilityin a cache server
Ivan Voras
-
7/30/2019 Bullet Cache
2/40
What is it?
People know what memcached is... mostly
Example use case:
So you have a web page which is justdynamic enough so you can't cache itcompletely as an HTML dump
You have a SQL query on this page which
is 99.99% always the same (same query,same answer)
...so you cache it
-
7/30/2019 Bullet Cache
3/40
Why a cache server?
Sharing between processes
on different servers
In environments which do not implementapplication persistency
CGI, FastCGI
PHP Or you're simply lazy and want something
which works...
-
7/30/2019 Bullet Cache
4/40
A little bit of history...
This started as my pet project...
It's so ancient, when I first started workingon it, Memcached was still single-threaded
It's gone through at least one rewrite anda whole change of concept
I made it because of the frustration I felt
while working with Memcached Key-value databases are so very basic
I could do better than that :)
-
7/30/2019 Bullet Cache
5/40
Now...
Used in production in my university'sproject
Probably the fastest memory cacheengine available (in specificcircumstances)
Available in FreeBSD ports(databases/mdcached)
Has a 20-page User Manual :)
-
7/30/2019 Bullet Cache
6/40
What's wrong with memcached?
Nothing much it's solid work
The classic problem:
cache expiry / invalidation memcached accepts a list of records to
expire (inefficient, need to maintain thislist)
It's fast but is it fast enough?
Does it really make use of multiple CPUsas efficiently as possible?
-
7/30/2019 Bullet Cache
7/40
Introducing the Bullet Cache
1. Offers a smarter data structure to
the user side than a simplekey-value pair
2. Implements interesting internal
data structures3. Some interesting bells & whistles
-
7/30/2019 Bullet Cache
8/40
User-visible structure
Traditional (memcached) style:
Key-value pairs
Relatively short keys (255 bytes) ASCII-only keys (?)
(ASCII-only protocol)
Multi-record operations only with a list ofrecords
Simple atomic operations (relativelyinefficient - atoi())
-
7/30/2019 Bullet Cache
9/40
Introducing record tags
They are metadata
Constraints:
Must be fast (they are NOT db indexes) Must allow certain types of bulk operations
The implementation:
Both key and value are signed integers No limit on the number of tags per record
Bulk queries: (tkey X) && (tval1, [tval2...])
-
7/30/2019 Bullet Cache
10/40
Record tags
I heard you like key-value records so I'veput key-value records into your key-valuerecords...
record key record value
k v k v k v ...
generic metadata
-
7/30/2019 Bullet Cache
11/40
Metadata queries (1)
Use case example: a web application hasa page /contacts which contains datafrom several SQL queries as well as othersources (LDAP)
Tag all cached records with(tkey,tval) = (42, hash(/contacts))
When constructing page, issue query:get_by_tag_values(42, hash(/contacts))
When expiring all data, issue query:del_by_tag_values(42, hash(/contacts))
-
7/30/2019 Bullet Cache
12/40
Metadata queries (2)
Use case example: Application objects arestored (serialized, marshalled) into thecache, and there's a need to invalidate(expire) all objects of a certain type
Tag records with(tkey, tval) = (object_type, instance_id)
Expire withdel_by_tag_values(object_type, instance_id)
Also possible: tagging objectinterdependance
-
7/30/2019 Bullet Cache
13/40
Under the hood
It's interesting...
Started as a C project, now mostly
converted to C++ for easiermodularization
Still uses C-style structures and algorithmsfor the core parts i.e. not std::containers
Contains tests and benchmarks within themain code base
C and PHP client libraries
-
7/30/2019 Bullet Cache
14/40
The main data structure
A forest of trees,anchored in hashtable buckets
Buckets aredirectly addressedby hashing record
keys Buckets are
protected byrwlocks
RW
locknode
node
node
node
node
node
Red-black
tree
RWlock
node
node
node
node
node
node
Red-black
tree
hash value = H1
hash value = H2
.
.
.
tree root
tree root
Hash table
-
7/30/2019 Bullet Cache
15/40
Basic operation
1.Find h =Hash(key)
2.Acquire lock #h3.Find record in RB
tree indexed bykey
4.Perform operation
5.Release lock #h
RW
locknode
node
node
node
node
node
Red-black
tree
RW
locknode
node
node
node
node
node
Red-black
tree
hash value = H1
hash value = H2
.
.
.
tree root
tree root
Hash table
1
2
3
4
5
OP
end
-
7/30/2019 Bullet Cache
16/40
Record tags followa similar pattern
The tags index themain structure andare maintained(almost)independently
Hash value = H1
RW
lock
TK1 TK2 TK3 TK4
Hash value = H2
RW
lock
TK1 TK2 TK3 TK4
...
-
7/30/2019 Bullet Cache
17/40
Concurrency and locking
Concurrency is great the defaultconfiguration starts 256 record bucketsand 64 tag buckets
Locking is without ordering assumptions
*_trylock() for everything
rollback-and-retry
No deadlocks
Livelocks on the other hand need to beinvestigated
-
7/30/2019 Bullet Cache
18/40
Two-way linking betweenrecords and tags
RW
locknode
node
node
node
node
node
RW
locknode
node
node
node
node
node
hash value = H1
hash value = H2
.
.
.
tree root
tree root
Hash table
hash value = H1
RW
lock
TK1 TK2 TK3 TK4
hash value = H2
RW
lock
TK1 TK2 TK3 TK4
...
-
7/30/2019 Bullet Cache
19/40
Concurrency
Scenario 1:
A record isreferenced need to hold Ntag bucket locks
Scenario 2:
A tag isreferenced need to hold Mrecord bucket
locks
4 8 16 32 64 128 256
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
NCPU=64, SHARED NCPU=64, EXCLUSIVE
NCPU=8, SHARED NCPU=8, EXCLUSIVE
Number of hash table buckets (H)
Percentage
ofuncontested
lock
acquisitions
(U)
Percentage of uncontested lock acquisitions
-
7/30/2019 Bullet Cache
20/40
Multithreading models
Aka which thread does what
Three basic tasks:
T1: Connection acceptanceT2: Network IO
T3: Payload work
The big question: how to distribute theseinto threads?
-
7/30/2019 Bullet Cache
21/40
Multithreading models
SPED : Single process, event driven
SEDA : Staged, event-driven architecture
AMPED : Asymmetric, multi-process,event-driven
SYMPED : Symmetric, multi-process,
event driven Model New connectionhandler Network IOhandler Payload work
SPED 1 thread In connection thread In connection thread
SEDA 1 thread N1 threads N2 threads
SEDA-S 1 thread N threads N threads
AMPED 1 thread 1 thread N threads
SYMPED 1 thread N threads In network thread
-
7/30/2019 Bullet Cache
22/40
All the models are event-driven
The dumb model: thread-per-connection
Not really efficient (FreeBSD has experimented with KSE and
M:N threading but that didn't work out)
IO events: via kqueue(2)
Inter-thread synchronization: queuessignalled with CVs
-
7/30/2019 Bullet Cache
23/40
SPED
Single-threaded, event-driven
Very efficient on
single-CPU systems Most efficient if
the operation isvery fast (comparedto network IO andevent handling)
Used in efficient Unix network servers
-
7/30/2019 Bullet Cache
24/40
SEDA
Staged, event-driven
Different task threads
instantiated indifferent numbers
Generally,N1 != N2 != N3
The most queueing
The most separationof tasks most CPUs used
T1
T2
T2
T2
T3
T3
T3
T3
-
7/30/2019 Bullet Cache
25/40
AMPED
Asymmetric multi-process event-driven
Asymmetric: N(T2) != N(T3)
Assumes network IOprocessing is cheapcompared tooperation processing
Moderate amount of queuing
Can use arbitrary number of CPUs
T1 T2 T3
T3
T3
-
7/30/2019 Bullet Cache
26/40
SYMPED
Symmetric multi-process event-driven
Symmetric: grouping of tasks
Assumes network IOand operationprocessing are similarlyexpensive but uniform
Sequential processinginside threads
Similar to multiple instances of SPED
T1
T2+T3
T2+T3
T2+T3
-
7/30/2019 Bullet Cache
27/40
Multithreading modelsin Bullet Cache
Command-line configuration:
n : number of network threads
t : number of payload threads n=0, t=0 : SPED
n=1, t>0 : AMPED
n>0, t=0 : SYMPED n>1, t>0 : SEDA
n>1, t>1, n=t : SEDA-S (symmetrical)
-
7/30/2019 Bullet Cache
28/40
How does that work?
SEDA: the same network loop acceptsconnections and network IO
Others: The network IO threads acceptmessages, then either:
process them in-thread or
queue them on worker thread queues
Response messages are either sent in-thread from whichever thread generatesthem or finished with the IO event code
-
7/30/2019 Bullet Cache
29/40
Performance of various models
20 40 60 80 100 120 140
0
50
100
150
200
250
300
350
400
450
500
SPED SEDA SEDA-S AMPED SYMPED
Number of clients
Thousands
oftransactions/s
Except in specialcircumstances,
SYMPED is best
-
7/30/2019 Bullet Cache
30/40
Why is SYMPED efficient?
The same thread receives the messageand processes it
No queueing No context switching
In the optimal case: no any kind of(b)locking delays
Downsides:
Serializes network IO and processingwithin the thread (which is ok if per-CPU)
-
7/30/2019 Bullet Cache
31/40
Notable performanceoptimizations
zero-copy operation
Queries which do not involve complexprocessing or record aggregation are aresatisfied directly from data structures
zero-malloc operation
The code re-uses memory buffers as much
as possible; the fast path is completelymalloc()- and memcpy()-free
Adaptive dynamic buffer sizes
malloc() usage is tracked to avoid realloc()
-
7/30/2019 Bullet Cache
32/40
State of the art
92 185 278 371 464 558 651 744 837 930 1023 1395 1861 2792 3723
200.000 TPS
400.000 TPS
600.000 TPS
800.000 TPS
1.000.000 TPS
1.200.000 TPS
1.400.000 TPS
1.600.000 TPS
1.800.000 TPS
2.000.000 TPS
System A (June 2010) System B (Jan 2011) System B (Jun 2011)
System C (Dec 2011) System D (Mar 2012)
Average record data size
Perform
ance
-
7/30/2019 Bullet Cache
33/40
State of the art
600.000 TPS
800.000 TPS
1.000.000 TPS
1.200.000 TPS
1.400.000 TPS
1.600.000 TPS1.800.000 TPS
2.000.000 TPS
Performa
nce
Xeon 5675, 6-core, 3 GHzXeon 5675, 6-core, 3 GHz
+HTT
9-stable, March 2012
-
7/30/2019 Bullet Cache
34/40
under certain conditions
The optimal, fast path (zero-memcpy,zero-malloc, optimal buffers)
Which is actually less important, we knowthat these algorithms are fast...
Using Unix domain sockets
Much more important
FreeBSD's network stack (the TCP path) iscurrently basically nonscalable to SMP?
UDP path is more scalable WIP
-
7/30/2019 Bullet Cache
35/40
TCP vs Unix sockets
104 209 314 420 525 639 735 840 945 1050 1576 2102 3153 42040 TPS
100,000 TPS
200,000 TPS
300,000 TPS
400,000 TPS
500,000 TPS600,000 TPS
700,000 TPS
800,000 TPS
900,000 TPS
1,000,000 TPS
System D / mdcached System D / memcached System D / redis
Average record size
-
7/30/2019 Bullet Cache
36/40
NUMA Effects
1 2 3 4 5 6 7 8 9 10 15 20 30 400 TPS
200,000 TPS
400,000 TPS
600,000 TPS
800,000 TPS
1,000,000 TPS1,200,000 TPS
1,400,000 TPS
1,600,000 TPS
1,800,000 TPS
2,000,000 TPS
System D System D / NUMA System D / ULE
cpuset(4)-bound toa single socket
cpuset(4) client & server
separate sockets
No cpuset(4)only ULE
It's unlikely that better NUMA support would help at all...
-
7/30/2019 Bullet Cache
37/40
Scalability wrt number of records
1.000 10.000 100.000 1.000.000 10.000.000
300.000 TPS
320.000 TPS
340.000 TPS
360.000 TPS
380.000 TPS
400.000 TPS
420.000 TPS
440.000 TPS
mdcached
-
7/30/2019 Bullet Cache
38/40
Bells & whistles
Binary protocol (endian-dependant)
Extensive atomic operation set
cmpset, add, fetchadd, readandclear tstack operations
Every tag (tk,tv) identifies a stack
Push and pop operations on records Periodic data dumps / chekpoints
Cache pre-warm (load from file)
-
7/30/2019 Bullet Cache
39/40
Usage ideas
Application data cache, database cache
Semantically tag cached records
Efficient retrieval and expiry (deletion) Primary data storage
High-performance ephemeral storage
Optional periodic checkpoints Data sharing between app server nodes
Esoteric: distributed lock manager, stack
-
7/30/2019 Bullet Cache
40/40
Bullet Cache
Balancing speed and usabilityin a cache server
http://www.sf.net/projects/mdcached
Ivan Voras
top related