allocators for compressed pagesoptimization • in mainline kernel since 4.11 random...
TRANSCRIPT
![Page 1: Allocators for Compressed Pagesoptimization • in mainline kernel since 4.11 Random read/write(4.12) 0 5 10 15 20 25 30 35 0 50 100 150 200 zsmalloc zbud z3fold z3fold 4.12 threads](https://reader033.vdocuments.us/reader033/viewer/2022050419/5f8ed13e4fa211426b3b42b8/html5/thumbnails/1.jpg)
Allocators for Compressed PagesVitaly Wool
![Page 2: Allocators for Compressed Pagesoptimization • in mainline kernel since 4.11 Random read/write(4.12) 0 5 10 15 20 25 30 35 0 50 100 150 200 zsmalloc zbud z3fold z3fold 4.12 threads](https://reader033.vdocuments.us/reader033/viewer/2022050419/5f8ed13e4fa211426b3b42b8/html5/thumbnails/2.jpg)
Intro: Compressed memory allocator
• It’s an allocator, Cap.– allocates memory according to user’s
demands• It’s designed to store compressed data
– chunks of arbitrary length• usually quite small, way less than a page• ordinary kernel allocator would be a waste of space
– it doesn’t compress anything itself
![Page 3: Allocators for Compressed Pagesoptimization • in mainline kernel since 4.11 Random read/write(4.12) 0 5 10 15 20 25 30 35 0 50 100 150 200 zsmalloc zbud z3fold z3fold 4.12 threads](https://reader033.vdocuments.us/reader033/viewer/2022050419/5f8ed13e4fa211426b3b42b8/html5/thumbnails/3.jpg)
Okay what purpose does all that serve?
![Page 4: Allocators for Compressed Pagesoptimization • in mainline kernel since 4.11 Random read/write(4.12) 0 5 10 15 20 25 30 35 0 50 100 150 200 zsmalloc zbud z3fold z3fold 4.12 threads](https://reader033.vdocuments.us/reader033/viewer/2022050419/5f8ed13e4fa211426b3b42b8/html5/thumbnails/4.jpg)
Swapping
• using secondary storage to store and retrieve data– secondary storage is usually a HD or a flash
dveice
– saves memory by pushing rarely used pages out
• trade memory for performance?– reading and writing pages may be quite slow
![Page 5: Allocators for Compressed Pagesoptimization • in mainline kernel since 4.11 Random read/write(4.12) 0 5 10 15 20 25 30 35 0 50 100 150 200 zsmalloc zbud z3fold z3fold 4.12 threads](https://reader033.vdocuments.us/reader033/viewer/2022050419/5f8ed13e4fa211426b3b42b8/html5/thumbnails/5.jpg)
Swapping optimization
• use RAM to cache swapped-out pages– but what’s the gain then?
• compress swapped-out pages• trade performance for memory?
– bigger cache means better performance– now we can be more flexible
![Page 6: Allocators for Compressed Pagesoptimization • in mainline kernel since 4.11 Random read/write(4.12) 0 5 10 15 20 25 30 35 0 50 100 150 200 zsmalloc zbud z3fold z3fold 4.12 threads](https://reader033.vdocuments.us/reader033/viewer/2022050419/5f8ed13e4fa211426b3b42b8/html5/thumbnails/6.jpg)
Swapping and compression
• zswap: compressed write-back cache– compresses swapped-out pages and moves
them into a pool– when the pool is full enough, pushes the
compressed pages to the secondary storage– pages are read back directly from the storage
![Page 7: Allocators for Compressed Pagesoptimization • in mainline kernel since 4.11 Random read/write(4.12) 0 5 10 15 20 25 30 35 0 50 100 150 200 zsmalloc zbud z3fold z3fold 4.12 threads](https://reader033.vdocuments.us/reader033/viewer/2022050419/5f8ed13e4fa211426b3b42b8/html5/thumbnails/7.jpg)
Allocator for zswap?
• zbud: the first compressed data allocator– stores up to 2 objects per page
• one bound to the beginning• one bound to the end
– actual compression ratio may be quite low• imagine high amount of chunks sized 2K+Ɛ
![Page 8: Allocators for Compressed Pagesoptimization • in mainline kernel since 4.11 Random read/write(4.12) 0 5 10 15 20 25 30 35 0 50 100 150 200 zsmalloc zbud z3fold z3fold 4.12 threads](https://reader033.vdocuments.us/reader033/viewer/2022050419/5f8ed13e4fa211426b3b42b8/html5/thumbnails/8.jpg)
zsmalloc
• came as an alternative to zbud– addresses the situation with 2k+ε sized objects– allocates objects contiguously within physically
uncontiguous pages• objects may span across several pages
– high compression ratio in the beginning– hard to mitigate in-page fragmentation over time
as objects are allocated and released
![Page 9: Allocators for Compressed Pagesoptimization • in mainline kernel since 4.11 Random read/write(4.12) 0 5 10 15 20 25 30 35 0 50 100 150 200 zsmalloc zbud z3fold z3fold 4.12 threads](https://reader033.vdocuments.us/reader033/viewer/2022050419/5f8ed13e4fa211426b3b42b8/html5/thumbnails/9.jpg)
Compressed allocator API
• 2 allocators used by zswap and doing the same thing differently– That calls for unification
• zpool: a common compressed allocator API– zswap is converted to use zpool– zbud and zswap both implement zpool API
![Page 10: Allocators for Compressed Pagesoptimization • in mainline kernel since 4.11 Random read/write(4.12) 0 5 10 15 20 25 30 35 0 50 100 150 200 zsmalloc zbud z3fold z3fold 4.12 threads](https://reader033.vdocuments.us/reader033/viewer/2022050419/5f8ed13e4fa211426b3b42b8/html5/thumbnails/10.jpg)
Quite boring so far...What happened next?
![Page 11: Allocators for Compressed Pagesoptimization • in mainline kernel since 4.11 Random read/write(4.12) 0 5 10 15 20 25 30 35 0 50 100 150 200 zsmalloc zbud z3fold z3fold 4.12 threads](https://reader033.vdocuments.us/reader033/viewer/2022050419/5f8ed13e4fa211426b3b42b8/html5/thumbnails/11.jpg)
ZRAM: compressed RAM disk
• RAM block device with on-the-fly compression/decompression– uses zsmalloc directly via its API
• Alternative to zswap for embedded devices– no backing storage necessary– pages swapped to compressed RAM storage
![Page 12: Allocators for Compressed Pagesoptimization • in mainline kernel since 4.11 Random read/write(4.12) 0 5 10 15 20 25 30 35 0 50 100 150 200 zsmalloc zbud z3fold z3fold 4.12 threads](https://reader033.vdocuments.us/reader033/viewer/2022050419/5f8ed13e4fa211426b3b42b8/html5/thumbnails/12.jpg)
Can’t do zram with zbud?!
zbud zsmalloc
zswap
zram
![Page 13: Allocators for Compressed Pagesoptimization • in mainline kernel since 4.11 Random read/write(4.12) 0 5 10 15 20 25 30 35 0 50 100 150 200 zsmalloc zbud z3fold z3fold 4.12 threads](https://reader033.vdocuments.us/reader033/viewer/2022050419/5f8ed13e4fa211426b3b42b8/html5/thumbnails/13.jpg)
ZRAM over zpool API
• Pros– unification and versatility
• Cons– none
• Patches ready• Several attempts to mainline the patches
– blocked by the maintainer
![Page 14: Allocators for Compressed Pagesoptimization • in mainline kernel since 4.11 Random read/write(4.12) 0 5 10 15 20 25 30 35 0 50 100 150 200 zsmalloc zbud z3fold z3fold 4.12 threads](https://reader033.vdocuments.us/reader033/viewer/2022050419/5f8ed13e4fa211426b3b42b8/html5/thumbnails/14.jpg)
ZRAM over zpool API: test with zbud
• No performance degrade over time– stable and sustainable operation
• Peak performance lower than with zsmalloc– spinlocks don’t scale well
• Low compression ratio– 1.5x - 1.7x in real life scenarios– not enough to justify ZRAM for embedded
![Page 15: Allocators for Compressed Pagesoptimization • in mainline kernel since 4.11 Random read/write(4.12) 0 5 10 15 20 25 30 35 0 50 100 150 200 zsmalloc zbud z3fold z3fold 4.12 threads](https://reader033.vdocuments.us/reader033/viewer/2022050419/5f8ed13e4fa211426b3b42b8/html5/thumbnails/15.jpg)
So what if we modify zbud to hold up to 3 objects?
![Page 16: Allocators for Compressed Pagesoptimization • in mainline kernel since 4.11 Random read/write(4.12) 0 5 10 15 20 25 30 35 0 50 100 150 200 zsmalloc zbud z3fold z3fold 4.12 threads](https://reader033.vdocuments.us/reader033/viewer/2022050419/5f8ed13e4fa211426b3b42b8/html5/thumbnails/16.jpg)
z3fold: new kid on the block
• spun off zbud• 3 objects per page instead of 2• can handle PAGE_SIZE allocations• only implements zpool API
– no custom API here• work started after ELC 2016 in San Diego
– in the mainline since 4.8
![Page 17: Allocators for Compressed Pagesoptimization • in mainline kernel since 4.11 Random read/write(4.12) 0 5 10 15 20 25 30 35 0 50 100 150 200 zsmalloc zbud z3fold z3fold 4.12 threads](https://reader033.vdocuments.us/reader033/viewer/2022050419/5f8ed13e4fa211426b3b42b8/html5/thumbnails/17.jpg)
z3fold: good for both ZRAM and zswap
• for ZRAM– supports up to page size allocations– low latency operation– good compression ratio
• for zswap– supports eviction unlike zsmalloc– higher compression ratio than zbud
![Page 18: Allocators for Compressed Pagesoptimization • in mainline kernel since 4.11 Random read/write(4.12) 0 5 10 15 20 25 30 35 0 50 100 150 200 zsmalloc zbud z3fold z3fold 4.12 threads](https://reader033.vdocuments.us/reader033/viewer/2022050419/5f8ed13e4fa211426b3b42b8/html5/thumbnails/18.jpg)
Ok let’s do the fun part.Comparisons!
![Page 19: Allocators for Compressed Pagesoptimization • in mainline kernel since 4.11 Random read/write(4.12) 0 5 10 15 20 25 30 35 0 50 100 150 200 zsmalloc zbud z3fold z3fold 4.12 threads](https://reader033.vdocuments.us/reader033/viewer/2022050419/5f8ed13e4fa211426b3b42b8/html5/thumbnails/19.jpg)
Currently allowed combinations
zbud zsmalloc
z3fold
zswap
zram
![Page 20: Allocators for Compressed Pagesoptimization • in mainline kernel since 4.11 Random read/write(4.12) 0 5 10 15 20 25 30 35 0 50 100 150 200 zsmalloc zbud z3fold z3fold 4.12 threads](https://reader033.vdocuments.us/reader033/viewer/2022050419/5f8ed13e4fa211426b3b42b8/html5/thumbnails/20.jpg)
Compression under stress (4.8)
0 1 2 3 4 5 6 7 8 90
1
2
3
4
zsmalloczbudz3fold
hours
ratio
![Page 21: Allocators for Compressed Pagesoptimization • in mainline kernel since 4.11 Random read/write(4.12) 0 5 10 15 20 25 30 35 0 50 100 150 200 zsmalloc zbud z3fold z3fold 4.12 threads](https://reader033.vdocuments.us/reader033/viewer/2022050419/5f8ed13e4fa211426b3b42b8/html5/thumbnails/21.jpg)
Random read/write(4.8)
0 10 20 30 400
50
100
150
200
zsmalloczbudz3fold
threads
kb/s
![Page 22: Allocators for Compressed Pagesoptimization • in mainline kernel since 4.11 Random read/write(4.12) 0 5 10 15 20 25 30 35 0 50 100 150 200 zsmalloc zbud z3fold z3fold 4.12 threads](https://reader033.vdocuments.us/reader033/viewer/2022050419/5f8ed13e4fa211426b3b42b8/html5/thumbnails/22.jpg)
Conclusions so far
• z3fold provides good compression ratio
• z3fold doesn’t scale well to larger number of CPUs/threads
• Third level– Fourth level
» Fifth level
![Page 23: Allocators for Compressed Pagesoptimization • in mainline kernel since 4.11 Random read/write(4.12) 0 5 10 15 20 25 30 35 0 50 100 150 200 zsmalloc zbud z3fold z3fold 4.12 threads](https://reader033.vdocuments.us/reader033/viewer/2022050419/5f8ed13e4fa211426b3b42b8/html5/thumbnails/23.jpg)
z3fold: profiling
• using perf while running fio– identify bottlenecks under stress load
• using perf while Android LMK is triggered– how z3fold operation affects user experience
![Page 24: Allocators for Compressed Pagesoptimization • in mainline kernel since 4.11 Random read/write(4.12) 0 5 10 15 20 25 30 35 0 50 100 150 200 zsmalloc zbud z3fold z3fold 4.12 threads](https://reader033.vdocuments.us/reader033/viewer/2022050419/5f8ed13e4fa211426b3b42b8/html5/thumbnails/24.jpg)
z3fold: profiling results
• spinlocks are the main obstacle to scalability– the “big” spinlock that protects “unbuddied”
lists is the biggest one • using perf while Android LMK is triggered
– how z3fold operation affects user experience
![Page 25: Allocators for Compressed Pagesoptimization • in mainline kernel since 4.11 Random read/write(4.12) 0 5 10 15 20 25 30 35 0 50 100 150 200 zsmalloc zbud z3fold z3fold 4.12 threads](https://reader033.vdocuments.us/reader033/viewer/2022050419/5f8ed13e4fa211426b3b42b8/html5/thumbnails/25.jpg)
z3fold: per-page locks
• Keep “big” spinlock for list operations• Have “small” spinlocks to protect in-page
operation– this goes well with async in-page layout
optimization• in mainline kernel since 4.11
![Page 26: Allocators for Compressed Pagesoptimization • in mainline kernel since 4.11 Random read/write(4.12) 0 5 10 15 20 25 30 35 0 50 100 150 200 zsmalloc zbud z3fold z3fold 4.12 threads](https://reader033.vdocuments.us/reader033/viewer/2022050419/5f8ed13e4fa211426b3b42b8/html5/thumbnails/26.jpg)
Random read/write(4.12)
0 5 10 15 20 25 30 350
50
100
150
200
zsmalloczbudz3foldz3fold 4.12
threads
kb/s
![Page 27: Allocators for Compressed Pagesoptimization • in mainline kernel since 4.11 Random read/write(4.12) 0 5 10 15 20 25 30 35 0 50 100 150 200 zsmalloc zbud z3fold z3fold 4.12 threads](https://reader033.vdocuments.us/reader033/viewer/2022050419/5f8ed13e4fa211426b3b42b8/html5/thumbnails/27.jpg)
z3fold: lockless lists (llist)
• Idea: implement unbuddied lists using llist– Should improve scalability with less locking
needed
• Unfortunately llist wasn't a fit– Can't do a llist_del
● Complicates unbuddied lists manipulation up to the point where it makes no sense
![Page 28: Allocators for Compressed Pagesoptimization • in mainline kernel since 4.11 Random read/write(4.12) 0 5 10 15 20 25 30 35 0 50 100 150 200 zsmalloc zbud z3fold z3fold 4.12 threads](https://reader033.vdocuments.us/reader033/viewer/2022050419/5f8ed13e4fa211426b3b42b8/html5/thumbnails/28.jpg)
z3fold: per-CPU “unbuddied” lists
• z3fold can operate only on this CPU's list– Reduces contention on spin lock– Speeds up search
• That can have adverse effect on ratio– Z3fold header gets bigger– Worse selection – More memory for multiple lists
• Will get into 4.14
![Page 29: Allocators for Compressed Pagesoptimization • in mainline kernel since 4.11 Random read/write(4.12) 0 5 10 15 20 25 30 35 0 50 100 150 200 zsmalloc zbud z3fold z3fold 4.12 threads](https://reader033.vdocuments.us/reader033/viewer/2022050419/5f8ed13e4fa211426b3b42b8/html5/thumbnails/29.jpg)
Random read/write(4.14-rc4)
0 5 10 15 20 25 30 350
50
100
150
200
z3fold 4.14z3foldz3fold 4.12
threads
kb/s
![Page 30: Allocators for Compressed Pagesoptimization • in mainline kernel since 4.11 Random read/write(4.12) 0 5 10 15 20 25 30 35 0 50 100 150 200 zsmalloc zbud z3fold z3fold 4.12 threads](https://reader033.vdocuments.us/reader033/viewer/2022050419/5f8ed13e4fa211426b3b42b8/html5/thumbnails/30.jpg)
z3fold: bit locks
• Z3fold header size better be 1 chunk– Now 2
• Bit locks may be used to mitigate bigger header
– Slightly worse performance
– Evaluation in progress
![Page 31: Allocators for Compressed Pagesoptimization • in mainline kernel since 4.11 Random read/write(4.12) 0 5 10 15 20 25 30 35 0 50 100 150 200 zsmalloc zbud z3fold z3fold 4.12 threads](https://reader033.vdocuments.us/reader033/viewer/2022050419/5f8ed13e4fa211426b3b42b8/html5/thumbnails/31.jpg)
Conclusions
● Z3fold is still a young allocator• Still z3fold already outperforms other
allocators• Z3fold is a good fit both for zswap and
ZRAM• We need to push ZRAM to use zpool
![Page 33: Allocators for Compressed Pagesoptimization • in mainline kernel since 4.11 Random read/write(4.12) 0 5 10 15 20 25 30 35 0 50 100 150 200 zsmalloc zbud z3fold z3fold 4.12 threads](https://reader033.vdocuments.us/reader033/viewer/2022050419/5f8ed13e4fa211426b3b42b8/html5/thumbnails/33.jpg)