zfs arc cacche and zil - wordpress.com · dram cache (the zfs arc) and disk. •zfs arc – zfs...
Post on 29-May-2018
229 Views
Preview:
TRANSCRIPT
Sun Microsystems 22
ZFS Readzilla und Logzilla
Claudia HildebrandtSystem Engineer/ConsultantSun Microsystems GmbH
Sun Microsystems 3
Hybrid Storage Pools
• Hybrid Storage Pools> Pools with SSDs> SSDs as write flash accelerators and seperate
log devices for the ZIL - aka Logzilla > SSDs as read flash accelerators – aka
Readzilla
• OpenStorage at this time> 18 GB Logzilla> 100 GB Readzilla
Sun Microsystems 4
Logzilla
• ZFS uses the ZFS Intent Log ( ZIL) to match POSIX synchronous requirements
• ZIL uses allocated blocks within the main storage pool ( default )
• Better performance with seperate ZIL (slog) – ZIL is allocated on seperate devices like a dedicated disk, also SSDs or NVRAM
• # zpool add <pool_name> log <log_device1> <log_device2>
• Note: use mirrored log devices, RAIDZ is not supported
Sun Microsystems 5
Readzilla
• aka L2ARC as secondary caching tier between the DRAM cache (the ZFS ARC) and disk.
• ZFS ARC – ZFS adjustable replacement cache> Stores ZFS data and metadata information from
all active storage pools in physical memory by default as much as possible, except 1 GB of RAM
> ZFS ARC consumes free memory as long there is free memory and releases the memory only to system when free memory is requested by another application
> With Readzilla the information in RAM can be moved to disk and cached as long as there is free space
Sun Microsystems 6
ZFS – Features“All or nothing”
“Always consistent”
“Pooled storage Model”
“Self healing”
Sun Microsystems 8
ARC Overview and Purpose• ZFS does not use page cache like UFS (except:
mmap(2))
• Adaptive Replacement Cache> Based on Megiddo & Modha (IBM) at FAST
2003– ARC: A Self-Tuning, Low Overhead
Replacement Cache> ZFS ARC differs slightly in implementation
– ZFS: Variable sized cache and contents, non-evictable contents
• DMU uses ARC to cache data objects based on DVA
• 1 ARC per system
• 2 LRU (Least Recently Used) caches plus History> Recency (MRU) and Frequency (MFU)
– ARC data survives large file scan> 1c cache and 1c history (c = cache size)
Sun Microsystems 9
Adjustable Replacement Cache (ARC )
• Central point for memory management for the SPA > Ability to evict buffers
as a result of memory pressure
• Dynamically, adaptively and self-tuning > Cache adjusts based
on I/O workload
• scan-resistant
Sun Microsystems 10
6 states of arc_buf• ARC_anon :
> Buffers not associated with a DVA> They hold dirty block copies before being written to
storage> They are considered as part of ARC_mru
• ARC_mru : Recently used and currently cached
• ARC_mru_ghost : Recently used, no longer in cache
• ARC_mfu : Frequently used and currently cached
• ARC_mfu_ghost : Frequently used, no longer cached
• ARC_l2c_only : exists only in L2ARC
• Ghost caches only contain ARC buffer headers
Sun Microsystems 11
ARC Diagram of Caches
• MRU = Most Recently Used, MFU = Most Fequently Used. Both lists plus the Ghost Caches are twice the size of the cache c
• ARC adapts c ( cache size ) and p ( used pages in MRU ) in response to workloads
• ARC parameters initialised to:
arc_c_min = MAX(1/32 of all mem, 64Mb) arc_c_max = MAX(3/4 of all mem, all but 1Gb) arc_c = MIN(1/8 physmem, 1/8 VM size) arc_p = arc_c / 2
ARCARCc
M RU M FUp
M RU M FU
c
Ghost Caches
Sun Microsystems 12
p c - p
How it works
ARC = c
MFU ghostMRU ghostGhost caches
MRU = p MFU = c - p
MRULRU MRU LRU
Sun Microsystems 13
claudia@frodo:~/Downloads$ pfexec ./arc_summary.pl System Memory:
Physical RAM: 4052 MB Free Memory : 2312 MB LotsFree: 63 MB
ZFS Tunables (/etc/system):
ARC Size: Current Size: 772 MB (arcsize) Target Size (Adaptive): 3039 MB (c) Min Size (Hard Limit): 379 MB (zfs_arc_min) Max Size (Hard Limit): 3039 MB (zfs_arc_max)
ARC Size Breakdown: Most Recently Used Cache Size: 50% 1519 MB (p) Most Frequently Used Cache Size: 49% 1519 MB (c-p)
Sun Microsystems 14
Data is read
ARC = c
MFU ghostMRU ghost
MRU = p MFU = c - p
A arc_read request
Ghost caches
Sun Microsystems 15
Data buffer is in MRU
ARC = c
MFU ghostMRU ghost
MRU = p MFU = c - p
A
Ghost caches
Sun Microsystems 16
Same data buffer read again
A
ARC = c
MFU ghostMRU ghost
MRU = p MFU = c - p
A arc_read request
Ghost caches
Sun Microsystems 17
Data buffer moves in MFU
A
ARC = c
MFUghostMRU ghost
MRU = p MFU = c - p
Ghost caches
Sun Microsystems 18
Cache fills up
A
ARC = c
MFUghostMRU ghost
MRU = p MFU = c - p
D E F BC
Ghost caches
Sun Microsystems 19
MRU data buffer is read again
A
ARC = c
MFUghostMRU ghost
MRU = p MFU = c - p
D E F BC
D arc_read request
Ghost caches
Sun Microsystems 20
MFU list is dynamically adjusted
A
ARC = c
MFUghostMRU ghost
MRU = p MFU = c - p
F D BCE
Ghost caches
Sun Microsystems 21
Data buffer in MFU is read again
A
ARC = c
MFUghostMRU ghost
MRU = p MFU = c - p
E F D BC
Barc_read request
Ghost caches
Sun Microsystems 22
Data buffer moves at 1st position
A
ARC = c
MFUghostMRU ghost
MRU = p MFU = c - p
E F B CD
Ghost caches
Sun Microsystems 23
ARC Caches in Action
• If evicting during cache insert, then:> 1. Inserting in MRU & MRU < p then arc_evict(MFU)> 2. Inserting in MRU & MRU > p then arc_evict(MRU)> 3. Inserting in MFU & MFU < (c-p) then
arc_evict(MRU)> 4. Inserting in MFU & MFU > (c-p) then
arc_evict(MFU)
• Buffers change state (ie cache) in response to access> If current state is MRU, and at least ARC_MINTIME
(62ms) since last access, then new state is MFU> All other repeated accesses result in state of MFU
–Exception: Prefetching in MRU or Ghosts results in MRU
Sun Microsystems 24
Least recency data buffer evicted
A
ARC = c
MFUghostMRU ghost
MRU = p MFU = c - p
G B CDF
e
G arc_read request
E
Ghost caches
Sun Microsystems 25
Least frequency data buffer evicted
C
ARC = c
MFUghostMRU ghost
MRU = p MFU = c - p
G G DBF
e
G
a
arc_read request
A
Ghost caches
Sun Microsystems 26
ARC Adapting and Adjusting• Adapting...adapting to workload
> When adding new content:– If (hit in MRU_Ghost) then increase p– If (hit in MFU_Ghost) then decrease p– If (arc_size within (2*maxblocksize) of c) then
increase c
• Adjusting...adjusting contents to fit> When shrinking or reclaiming:
– If (MRU > p) then arc_evict(MRU)– If (MRU+MRU_Ghost > c) then
arc_evict(MRU_Ghost)– If (arc_size > c) then arc_evict(MFU)– If (arc_size + Ghosts > 2*c) then
arc_evict(MFU_Ghost)
Sun Microsystems 27
Data buffer not in cache
C
ARC = c
MFUghostMRU ghost
MRU = p MFU = c - p
J G DBI
e
F
ah f
arc_read request
Ghost caches
Sun Microsystems 28
ARC adaptive self tuning
G
ARC = c
MFUghostMRU ghost
MRU = p MFU = c - p
FJ
e ah
B
c d
I
Ghost caches
Sun Microsystems 29
ARC is to small
D
ARC = c
MFUghostMRU ghost
MRU = p MFU = c - p
J F BGI
e ah c
A
C
E
arc_read request
Ghost caches
Sun Microsystems 30
ARC Reclaiming• Reclaim...reclaiming kernel memory
> Every second (or sooner if adapting or kmem callback)> Check VM parameters: freemem, lotsfree, needfree,
desfree> If required:
– Set arc_no_grow – suspend ARC adaption growths
– Set Aggressive Reclaim Policy triggers ARC shrink– Shrinks by MAX(1/32 of current size, VM needfree)
down to arc_min– Calls arc_adjust() to adjust (ie evict) cache contents
to new sizes–Call kmem_cache_reap_now() on ZIO buffers
• Megiddo/Modha said:“We think of ARC as dynamically, adaptively and continually balancing between recency and frequency - in an online and self-tuning fashion - in response to evolving and possibly changing access patterns”
Sun Microsystems 31
L2ARC
• Enhances the ARC• Second cache layer between main
memory and disk or SSD• Boosts random read performance• Devices used can be:
> Short-stroked disks> Solid state disks> Devices with smaller read latency
Sun Microsystems 32
L2ARC – How does it populate the cache?
• L2ARC attempts to cache data from ARC before it is evicted> There is no eviction path form ARC to L2ARC
• A kernel thread scans the eviction list of MFU/MRU and copies them to L2ARC devices> Refer to l2arc_feed_thread()
Sun Microsystems 33
L2ARC – Tuning
• The performance of the L2ARC can be tweaked by a number of tunables, which may be necessary for different workloads:> l2arc_write_max : max write bytes per interval> l2arc_noprefetch : skip caching prefetched buffers> l2arc_headroom : number of max device writes to
precache> l2arc_feed_secs :seconds between L2ARC writing
Sun Microsystems 34
ZIL ZFS Intent Log
• Filesystems buffer write requests and sync these to storage periodically to improve performance
• Power loss can corrupt filesystems and/or suffer data loss> Corruption solved with TXG commits–Always on-disk consistency
• Use synchronous semantics for applications requiring data is flushed to stable pool by the time a system call returns> Open file with O_DSYNC> Flush buffered contents with fsync(3c)
• The ZIL provides synchronous semantics for ZFS
Sun Microsystems 35
ZIL Operational Overview• ZFS intent log (ZIL) saves transaction records of system calls that
change the file system in memory with enough information to replay them
• ZFS operations are organized by the DMU as transactions. Whenever a DMU transaction is opened there is also a ZIL transaction opened> A Log record holds a system call transaction> A Log block can hold many log records and blocks are chained
together> Log Blocks are dynamically allocated and freed as needed> a) ZIL blocks freed on TXG commit by DMU ( discard )> b) flushed due to synchronous requirements e.g. fsync(3C),
O_DSYNC commited to stable storage
• In the event of power failure/panic the transactions are replayed from ZIL
• 1 ZIL per file system
Sun Microsystems 36
ZIL
• ZILogs resides in m em ory or on disk
• ZIL gathers inm em ory transactions of system calls and pushes the list out to a per filesystem ondisk log
• ZILogs are written on disk in variable block sizes
> m in. 4 KB, m ax. 128 KB
Sun Microsystems 37
Seperate ZIL
• Enables the use of limited capacity but fast block devices such as NVRAM and SSDs
• ZIL allocates from main pool leads to pool fragmentation
• Performance increasement> databases and NFS relies on speed and the
need to be assured that the data are not lost
Sun Microsystems 39
OpenStorage – 7000 Series
• Logzilla devices – 18 GB flash-based SSDs backed up by a supercapacitor> 10,000 write IOPS
• Readzilla devices – up to 6 100 GB read optimized SSDs> 50 -100 micro seconds
Sun Microsystems 41
NEW
• Since 2009.03 > Triple RAIDZ ( RAID-Z3 )> Triple mirroring storage profile> Enhanced iSCSI support> Infiniband support> improved management
Sun Microsystems 42
Links Hybrid Storage Pools
http://blogs.sun.com/ahl/entry/flash_hybrid_pools_and_futurehttp://blogs.sun.com/ahl/entry/hsp_goes_glossy
Demo: Storage Simulatorhttp://www.sun.com/storage/disk_systems/unified_storage/resources.jsp?intcmp=2992
top related