scaling memcache at facebook presenter: rajesh nishtala ([email protected]) co-authors: hans...
TRANSCRIPT
![Page 1: Scaling Memcache at Facebook Presenter: Rajesh Nishtala (rajesh.nishtala@fb.com) Co-authors: Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649e575503460f94b4f8e2/html5/thumbnails/1.jpg)
Scaling Memcacheat Facebook
Presenter: Rajesh Nishtala ([email protected])Co-authors: Hans Fugal, Steven Grimm, MarcKwiatkowski, Herman Lee, Harry C. Li, Ryan McElroy,Mike Paleczny, Daniel Peek, Paul Saab, David Stafford,Tony Tung, Venkateshwaran Venkataramani
![Page 2: Scaling Memcache at Facebook Presenter: Rajesh Nishtala (rajesh.nishtala@fb.com) Co-authors: Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649e575503460f94b4f8e2/html5/thumbnails/2.jpg)
Infrastructure Requirementsfor Facebook
1. Near real-time communication
2. Aggregate content on-the-fly frommultiple sources
3. Be able to access and update very popularshared content
4. Scale to process millions of user requestsper second
![Page 3: Scaling Memcache at Facebook Presenter: Rajesh Nishtala (rajesh.nishtala@fb.com) Co-authors: Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649e575503460f94b4f8e2/html5/thumbnails/3.jpg)
Design Requirements
Support a very heavy read load• Over 1 billion reads / second
• Insulate backend services from high read rates
Geographically Distributed
Support a constantly evolving product• System must be flexible enough to support a variety of use cases
• Support rapid deployment of new features
Persistence handled outside the system• Support mechanisms to refill after updates
![Page 4: Scaling Memcache at Facebook Presenter: Rajesh Nishtala (rajesh.nishtala@fb.com) Co-authors: Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649e575503460f94b4f8e2/html5/thumbnails/4.jpg)
Need more read capacity
Database
DatabaseDatabase
Memcache
4. Set (key)2. Miss (key)
3. DB lookup
• Two orders of magnitudemore reads than writes
• Solution: Deploy a fewmemcache hosts to handlethe read capacity
• How do we store data?
• Demand-filled look-aside cache
• Common case is data is
available in the cache
Web Server
1. Get (key)
![Page 5: Scaling Memcache at Facebook Presenter: Rajesh Nishtala (rajesh.nishtala@fb.com) Co-authors: Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649e575503460f94b4f8e2/html5/thumbnails/5.jpg)
Handling updates
• Memcache needs to beinvalidated after DB write
• Prefer deletes to sets• Idempotent
• Demand filled
• Up to web application
to specify which keysto invalidate afterdatabase update
Database
Memcache
2. Delete1. Database
update
Web Server
![Page 6: Scaling Memcache at Facebook Presenter: Rajesh Nishtala (rajesh.nishtala@fb.com) Co-authors: Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649e575503460f94b4f8e2/html5/thumbnails/6.jpg)
Memcache
![Page 7: Scaling Memcache at Facebook Presenter: Rajesh Nishtala (rajesh.nishtala@fb.com) Co-authors: Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649e575503460f94b4f8e2/html5/thumbnails/7.jpg)
While evolving our system we prioritize two major design goals.
• Any change must impact a userfacing or operational issue. Optimizations that have limited scope are rarely considered
• We treat the probability of reading transient stale data as a parameter to be tuned, similar to responsiveness. We are willing to expose slightly stale data in exchange for insulating a backend storage service from excessive load.
![Page 8: Scaling Memcache at Facebook Presenter: Rajesh Nishtala (rajesh.nishtala@fb.com) Co-authors: Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649e575503460f94b4f8e2/html5/thumbnails/8.jpg)
Roadmap
• Single front-end cluster–Read heavy workload
–Wide fanout
–Handling failures
• Multiple front-end clusters–Controlling data replication
–Data consistency
• Multiple Regions–Data consistency
![Page 9: Scaling Memcache at Facebook Presenter: Rajesh Nishtala (rajesh.nishtala@fb.com) Co-authors: Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649e575503460f94b4f8e2/html5/thumbnails/9.jpg)
Single front-end cluster
• Reducing Latency:focusing on the memcache client
–serialization–compression–request routing–request routing–request batching
![Page 10: Scaling Memcache at Facebook Presenter: Rajesh Nishtala (rajesh.nishtala@fb.com) Co-authors: Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649e575503460f94b4f8e2/html5/thumbnails/10.jpg)
• Parallel requests and batching
![Page 11: Scaling Memcache at Facebook Presenter: Rajesh Nishtala (rajesh.nishtala@fb.com) Co-authors: Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649e575503460f94b4f8e2/html5/thumbnails/11.jpg)
• Client-server communication:–embed the complexity of the system into a stateless client rather th
an in the memcached servers– We rely on UDP for get requests ,TCP via mcrouter for set,delete r
equests.
![Page 12: Scaling Memcache at Facebook Presenter: Rajesh Nishtala (rajesh.nishtala@fb.com) Co-authors: Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649e575503460f94b4f8e2/html5/thumbnails/12.jpg)
• Incast congestion:–Memcacheclients implement flowcontrol
mechanisms to limit incast congestion.
![Page 13: Scaling Memcache at Facebook Presenter: Rajesh Nishtala (rajesh.nishtala@fb.com) Co-authors: Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649e575503460f94b4f8e2/html5/thumbnails/13.jpg)
• Reducing Load– Leases: 是一个 64-bit的 Token,由MemCache在读请求时分配给 Client,可以用来解决两个问题: stale sets 和 thundering herds.
– stale sets :A stale set occurs when a web server sets a value in memcache that does not reflect the latest value that should be cached. This can occur when concurrent updates to memcacheget reordered
– Thundering herd :happens when a specific key undergoes heavy read and write activity
![Page 14: Scaling Memcache at Facebook Presenter: Rajesh Nishtala (rajesh.nishtala@fb.com) Co-authors: Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649e575503460f94b4f8e2/html5/thumbnails/14.jpg)
• Stale values:–When a key is deleted, its value is transferred to a data
structure that holds recently deleted items, where it lives for a short time before being flushed.A get request can return a lease token or data that is marked as stale.
• Memcache Pools: We designate one pool (named wildcard) as the default and provision separate pools for keys whose residence in wildcard is problematic.
![Page 15: Scaling Memcache at Facebook Presenter: Rajesh Nishtala (rajesh.nishtala@fb.com) Co-authors: Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649e575503460f94b4f8e2/html5/thumbnails/15.jpg)
• Replication Within Pools,We choose to replicate a cat
egory of keys within a pool :– the application routinely fetches many keys simultaneousl
y– the entire data set fits in one or two memcached
servers– the request rate is much higher than what
a single server can manage
![Page 16: Scaling Memcache at Facebook Presenter: Rajesh Nishtala (rajesh.nishtala@fb.com) Co-authors: Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649e575503460f94b4f8e2/html5/thumbnails/16.jpg)
• Handling Failures –There are two scales at which we must address failures:• a small number of hosts are inaccessible due to a network or server failure •a widespread outage that affects a significant percentage of the servers within the cluster
–Gutter
![Page 17: Scaling Memcache at Facebook Presenter: Rajesh Nishtala (rajesh.nishtala@fb.com) Co-authors: Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649e575503460f94b4f8e2/html5/thumbnails/17.jpg)
Multiple front-end clusters;Region
• Region:–We split our web and memcached servers into multiple
front-end clusters. These clusters, along with a storage cluster that contain the databases, define a region.– We trade replication of data for more independent fail
ure domains, tractable network configuration, and a reduction of incast congestion
![Page 18: Scaling Memcache at Facebook Presenter: Rajesh Nishtala (rajesh.nishtala@fb.com) Co-authors: Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649e575503460f94b4f8e2/html5/thumbnails/18.jpg)
Databases invalidate caches
• Cached data must be invalidated after database updates
• Solution: Tail the mysql commit log and issue deletes basedon transactions that have been committed• Allows caches to be resynchronized in the event of a problem
Front-End Cluster #1
Web Server
MC MC MC MC
Commit Log
MySQLStorage Server
Front-End Cluster #2
Web Server
MC MC MC
Front-End Cluster #3
Web Server
MC MC MC MC
McSqueal
![Page 19: Scaling Memcache at Facebook Presenter: Rajesh Nishtala (rajesh.nishtala@fb.com) Co-authors: Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649e575503460f94b4f8e2/html5/thumbnails/19.jpg)
Invalidation pipelineToo many packets
• Aggregating deletes reducespacket rate by 18x
• Makes configurationmanagement easier
• Each stage buffers deletes in
case downstream component isdown
McSqueal
DB
McSqueal
DB
McSqueal
DB
MC MC MC MC
Memcache
Routers
MC MC MC
Memcache
Routers
MC MC MC MC
Memcache
Routers
Memcache Routers
![Page 20: Scaling Memcache at Facebook Presenter: Rajesh Nishtala (rajesh.nishtala@fb.com) Co-authors: Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649e575503460f94b4f8e2/html5/thumbnails/20.jpg)
• Regional Pools–We can reduce the number of replicas by having multipl
e frontend clusters share the same set of memcached servers. We call this aregional pool .
• Cold Cluster Warmup–A system called Cold Cluster Warmup mitigates this by allo
wing clients in the “cold cluster” to retrieve data from the “warm cluster” rather than the persistent storage.
![Page 21: Scaling Memcache at Facebook Presenter: Rajesh Nishtala (rajesh.nishtala@fb.com) Co-authors: Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649e575503460f94b4f8e2/html5/thumbnails/21.jpg)
Across Regions: Geographically distributed clusters
• One region to hold the master databases and the other regions to contain read-only replicas;
• Advantages –putting web servers closer to end users can significantly
reduce latency–geographic diversity can mitigate the effects of events s
uch as natural disasters or massive power failures– new locations can provide cheaper power and other ec
onomic incentives
![Page 22: Scaling Memcache at Facebook Presenter: Rajesh Nishtala (rajesh.nishtala@fb.com) Co-authors: Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649e575503460f94b4f8e2/html5/thumbnails/22.jpg)
Replica Master
Geographically distributed clusters
Replica
![Page 23: Scaling Memcache at Facebook Presenter: Rajesh Nishtala (rajesh.nishtala@fb.com) Co-authors: Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649e575503460f94b4f8e2/html5/thumbnails/23.jpg)
Writes in non-masterDatabase update directly in master
• Race between DB replication and subsequent DB read
ReplicaDB
MemcacheMasterDB
1. Write to master
WebServer
3. Read from DB
(get missed)2. Delete from mc
WebServer
4. Set potentially
state value to
3. MySQL replication
memcache
Race!
![Page 24: Scaling Memcache at Facebook Presenter: Rajesh Nishtala (rajesh.nishtala@fb.com) Co-authors: Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649e575503460f94b4f8e2/html5/thumbnails/24.jpg)
ReplicaDB
Memcache
Web Server
MasterDB
2. Write to master
3. Delete from
memcache
5. Delete remote
marker
4. Mysql replication
Remote markersSet a special flag that indicates whether a race is likely
Read miss path:If marker set
read from master DBelse
read from replica DB
1. Set remotemarker
![Page 25: Scaling Memcache at Facebook Presenter: Rajesh Nishtala (rajesh.nishtala@fb.com) Co-authors: Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649e575503460f94b4f8e2/html5/thumbnails/25.jpg)
Single Server Improvements
• Performance Optimizations:–允许内部使用的 hashtable自动扩张 (rehash)–由原先多线程下的全局单一的锁改进为更细粒度的
锁–每个线程一个 UDP来避免竞争
![Page 26: Scaling Memcache at Facebook Presenter: Rajesh Nishtala (rajesh.nishtala@fb.com) Co-authors: Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649e575503460f94b4f8e2/html5/thumbnails/26.jpg)
![Page 27: Scaling Memcache at Facebook Presenter: Rajesh Nishtala (rajesh.nishtala@fb.com) Co-authors: Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649e575503460f94b4f8e2/html5/thumbnails/27.jpg)
![Page 28: Scaling Memcache at Facebook Presenter: Rajesh Nishtala (rajesh.nishtala@fb.com) Co-authors: Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649e575503460f94b4f8e2/html5/thumbnails/28.jpg)
• Adaptive Slab Allocator– 允许不同 Slab Class互换内存。当发现要删除的 item对应的 Slab Cl
ass最近使用的频率超过平均的 20%,将会把最近最少使用的 Slab Class的内存交互给需要的 Slab Class使用。整体思路类似一个全局的 LRU。
• The Transient Item Cache– 某些 key的过期时间会设置的较短,如果这些 key已经过期但是还没达到 LRU尾部时,它所占用的内存在这段时间就会浪费。因此可以把这些 key单独放在一个由链表组成的环形缓冲(即相当于一个循环数组,数组的每个元素是个链表,以过期时间作为数组的下标),每隔 1s,会将缓冲头部的数据给删除。
• Software Upgrades– Memcache的数据是保存在 System V的共享内存区域,方便机器上的软件升级。
![Page 29: Scaling Memcache at Facebook Presenter: Rajesh Nishtala (rajesh.nishtala@fb.com) Co-authors: Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649e575503460f94b4f8e2/html5/thumbnails/29.jpg)
Memcache Workload• Fanout
![Page 30: Scaling Memcache at Facebook Presenter: Rajesh Nishtala (rajesh.nishtala@fb.com) Co-authors: Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649e575503460f94b4f8e2/html5/thumbnails/30.jpg)
• Response size
![Page 31: Scaling Memcache at Facebook Presenter: Rajesh Nishtala (rajesh.nishtala@fb.com) Co-authors: Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649e575503460f94b4f8e2/html5/thumbnails/31.jpg)
• Pool Statistics
![Page 32: Scaling Memcache at Facebook Presenter: Rajesh Nishtala (rajesh.nishtala@fb.com) Co-authors: Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649e575503460f94b4f8e2/html5/thumbnails/32.jpg)
• Invalidation Latency
![Page 33: Scaling Memcache at Facebook Presenter: Rajesh Nishtala (rajesh.nishtala@fb.com) Co-authors: Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649e575503460f94b4f8e2/html5/thumbnails/33.jpg)
Conclusion• Separating cache and persistent storage systems allows us to i
ndependently scale them• Features that improve monitoring, debugging and operational
efficiency are as important as performance• Managing stateful components is operationally more complex
than stateless ones. As a result keeping logic in a stateless client helps iterate on features and minimize disruption.
• Managing stateful components is operationally more complex than stateless ones. As a result keeping logic in a stateless client helps iterate on features and minimize disruption.
• Simplicity is vital