li wensheng [email protected] chapter 4 file system chapter 4 file system —— file system cache

Li Wensheng [email protected] Chapter 4 File System Chapter 4 File System —— File System Cache

Upload: doris-perkins

Post on 31-Dec-2015




4 download


Li [email protected]

Chapter 4 File SystemChapter 4 File System

—— File System Cache

2 2


Introduction to File CachingIntroduction to File Caching Page Cache and Virtual memory System File System performance

3 3

Introduction to File Caching

File Caching One of the most important features of a file

system Unix file system caching is implemented in

the I/O subsystem by keeping copies of recently read or written blocks in a block cache

Solaris, implemented in the virtual memory system

4 4

The Old-Style Buffer Cache

5 5

Solaris Page Cache

Page chahe a new method of caching file system data developed at Sun as part of the virtual memory used by System V Release 4 Unix now also used in Linux and Windows NT

major differences from the old caching method it’s dynamically sized and can use all memory that is

not being used by applications it caches file blocks rather than disk blocks

The key difference is that the page cache is a virtual file cache rather than a physical block cache

6 6

The Solaris Page Cache

for internal file system data -- metadata items(direct/indirect blocks, inodes)

for file data

7 7

Block Buffer Cache

used for caching of inodes and file metadata In old versions of Unix, fixed in size by nbuf

specified the number of 512-byte buffers

now also dynamically sized can grow by nbuf, as needed,

until it reaches a ceiling specified by the bufhwm

By default, it is allowed to grow until it uses 2 percent of physical memory.

We can look at the upper limit for the buffer cache by using the sysdef command.

8 8

sysdef command.

# sysdef** Tunable Parameters*7757824 maximum memory allowed in buffer cache (bufhwm)5930 maximum number of processes (v.v_proc)99 maximum global priority in sys class (MAXCLSYSPRI)5925 maximum processes per user id (v.v_maxup)30 auto update time limit in seconds (NAUTOUP)25 page stealing low water mark (GPGSLO)5 fsflush run rate (FSFLUSHR)25 minimum resident memory for avoiding deadlock (MINARMEM)25 minimum swapable memory for avoiding deadlock (MINASMEM)

9 9

Buffer cache size needed

300 bytes per inode about 1 MB per 2 GB of files Example

A DBS with 100 files, total 100GB of storage space

Access only 50GB at the same time Need:

100*300 bytes=30KB for inodes 50/2*1MB=25MB for metadata (direct and indirect blocks)

On a system with 5GB of physical memory Default bufhwm will be 102MB

10 10

monitor the buffer cache hit statistics # sar -b 3 333SunOS zangief 5.7 Generic sun4u 06/27/99

22:01:51 bread/s lread/s %rcache bwrit/s lwrit/s %wcache pread/s pwrit/s22:01:54 0 7118 100 0 0 100 0

022:01:57 0 7863 100 0 0 100 0 022:02:00 0 7931 100 0 0 100 0 022:02:03 0 7736 100 0 0 100 0 022:02:06 0 7643 100 0 0 100 0 022:02:09 0 7165 100 0 0 100 0 022:02:12 0 6306 100 8 25 68 0 022:02:15 0 8152 100 0 0 100 0 022:02:18 0 7893 100 0 0 100 0


11 11

Introduction to File CachingIntroduction to File Caching Page Cache and Virtual memory System File System performance


12 12

file system caching behavior

physical memory is divided into pages “pages in” a file

To read data from a file into memory, the virtual memory system reads in one page at a time

page scanner searches and puts LRU pages back on the free list

13 13

file system caching behavior (Cont.)

Scan rate

freePage in

14 14

File System Paging Optimizations reduce the amount of memory pressure

invoke free-behind with sequential access free pages when free memory falls to lotsfree

limit the file system’s use of the page cache pages_before_pager, default 200 pages reflects the amount of memory above the

point where the page scanner starts (lotsfree) when memory falls to 1.6 megabytes (on

UltraSPARC) above lotsfree, the file system throttles back the use of the page cache

15 15

File System Paging Optimizations (Cont.) memory falls to lotsfree +

pages_before_pager Solaris file systems free all pages after they

are written UFS and NFS enable free-behind on

sequential access NFS disables read-ahead NFS writes synchronously, rather than

asynchronously VxFS enables free-behind (some versions


16 16

Introduction to File CachingIntroduction to File Caching Page Cache and Virtual memory System File System performance


17 17

Paging affects user’s application page scanner puts too much pressure on

user application’s private process memory If scan rate is several hundred pages a

second, the amount of time to check whether a page has been accessed falls to a few seconds.

any pages have not been used in the last few seconds will be taken

This behavior negatively affects application performance

18 18

Example consider an OLTP application that makes heavy use

of the file system database is generating file system I/O, making the

page scanner actively steal pages from the system. user of the OLTP application has paused for 15 seconds

to read the contents of a screen from the last transaction.

During this time, page scanner has found that those pages associated with the user application have not been referenced and makes them available for stealing.

The pages are stolen, when user types the next keystroke, he is forced to wait until the application is paged back in—usually several seconds.

Our user is forced to wait for an application to page in from the swap device, even though the application is running on a system with sufficient memory to keep all of the application in physical memory!

19 19

The priority paging algorithm

places a boundary around the file cache so that file system I/O does not cause unnecessary paging of applications

prioritizes the different types of pages in the page cache, in order of importance:

Highest — Pages associated with executables and shared libraries, including application process memory (anonymous memory)

Lowest — Regular file cache pages

as long as the system has sufficient memory, the scanner only steals pages associated with regular files

20 20

Enable priority paging

set the parameter priority_paging in /etc/system:set priority_paging=1

To enable priority paging on a live 32-bit system, set the following with adb:

# adb -kw /dev/ksyms /dev/mem


lotsfree: 730 <- value of lotsfree is printed

cachefree/W 0t1460 <- insert 2 x value of lotsfree preceded with 0t (decimal)

dyncachefree/W 0t1460 <- insert 2 x value of lotsfree preceded with 0t (decimal)


cachefree: 1460


dyncachefree: 1460

21 21

Enable priority paging (Cont.)

To enable priority paging on a live 64-bit system, set the following with adb:

# adb -kw /dev/ksyms /dev/mem


lotsfree: 730 <- value of lotsfree is printed

cachefree/Z 0t1460 <- insert 2 x value of lotsfree preceded with 0t (decimal)

dyncachefree/Z 0t1460 <- insert 2x value of lotsfree preceded with 0t (decimal)


cachefree: 1460


dyncachefree: 1460

22 22

Paging types

Execute bit associated with address space executable files regular files

paging types: executable, application, and file

memstat command Output is similar to that of vmstat, but with

extra fields to differentiate paging types pi po fr sr epi epf api apo apf fpi fpo fpf

23 23

paging caused by an application memory shortage

# ./readtest testfile&

# memstat 3

Memory ----------- paging ------ ---------executable- -- anonymous ------- -- filesys - --- cpu ---

free re mf pi po fr de sr epi epo epf api apo apf fpi fpo fpf us sy wt id

2080 1 0 749 512 821 0 264 0 0 269 0 512 549 749 0 2 1 7 92 0

1912 0 0 762 384 709 0 237 0 0 290 0 384 418 762 0 0 1 4 94 0

1768 0 0 738 426 610 0 1235 0 0 133 0 426 434 738 0 42 4 14 82 0

1920 0 2 781 469 821 0 479 0 0 218 0 469 525 781 0 77 24 54 22 0

2048 0 0 754 514 786 0 195 0 0 152 0 512 597 754 2 37 1 8 91 0

2024 0 0 741 600 850 0 228 0 0 101 0 597 693 741 2 56 1 8 91 0

2064 0 1 757 426 589 0 143 0 0 72 8 426 498 749 0 18 1 7 92 0

24 24

paging through the file system

# ./readtest testfile&

# memstat 3memory ----------- paging ------------------ -executable - -anonymous - -- filesys -- ---- cpu ------ free re mf pi po fr de sr epi epo epf api apo apf fpi fpo fpf us sy wt id3616 6 0 760 0 752 0 673 0 0 0 0 0 0 760 0 752 2 3 95 03328 2 198 816 0 925 0 1265 0 0 0 0 0 0 816 0 925 2 10 88 03656 4 195 765 0 792 0 263 0 0 0 2 0 0 762 0 792 7 11 83 03712 4 0 757 0 792 0 186 0 0 0 0 0 0 757 0 792 1 9 91 03704 3 0 770 0 789 0 203 0 0 0 0 0 0 770 0 789 0 5 95 03704 4 0 757 0 805 0 205 0 0 0 0 0 0 757 0 805 2 6 92 03704 4 0 778 0 805 0 266 0 0 0 0 0 0 778 0 805 1 6 93 0

25 25

Paging parameters affecting performance When priority paging is enabled, the file

system scan rate is higher. High scan rates should not be used as a

factor for determining memory shortage If the file system activity is heavy, the

scanner parameters are insufficient and will limit file system performance.

set the scanner parameters fastscan and maxpgioto to allow the scanner to scan at a high enough rate to keep up with the file system.

26 26

Scanner parameters

fastscan the number of pages per second the scanner can

scan. defaults ¼ of memory per second, limited to 64

MB per second limits file system throughput

when memory is at lotsfree,the scanner runs at half of fastscan, limited to 32 MB per second

If only 1/3 physical memory pages is a file page, the scanner will only be able to put 32 / 3 = 11MB per second of memory on the free list.

27 27

Scanner parameters (Cont.)

Maxpgio the maximum number of pages the page

scanner can push. limits the write performance of the file system

If memory is sufficient, set maxpgio large, 1024

Example: on a 4 GB machine set fastscan=131072 set handspreadpages=131072 set maxpgio=1024

28 28

VM Parameters That Affect File Systems

29 29

Direct I/O

unbuffered I/O , bypass file system page cache

UFS Direct I/O allows reads and writes to files in a regular file

system to bypass the page cache and access the file at near raw disk performance

be advantageous when accessing a file in a manner where caching is of no benefit e.g., copying a very large file from one disk to another

eliminates the double copy that is performed when the read and write system calls are used arranging for the DMA transfer to occur directly into the user’s address


30 30

Enable direct I/O

Direct I/O will only bypass the buffer cache if all of the following are true

The file is not memory mapped. The file is not on a logging file system. The file does not have holes. The read/write is sector aligned (512 byte)

enable direct I/O mounting an entire file system with the forcedirectio

mount option

# mount -o forcedirectio /dev/dsk/c0t0d0s6 /u1 with the directio system call, on a per-file basis

int directio(int fildes, DIRECTIO_ON | DIRECTIO_OFF);

31 31

UFS direct I/O

Direct I/O can provide extremely fast transfers when moving data with big block sizes (>64 kB), but it can be a significant performance limitation for smaller sizes.

Structure ufs_directio_kstats, direct I/O statisticsstruct ufs_directio_kstats {

uint_t logical_reads; /* Number of fs read operations */

uint_t phys_reads; /* Number of physical reads */

uint_t hole_reads; /* Number of reads from holes */

uint_t nread; /* Physical bytes read */

uint_t logical_writes; /* Number of fs write operations */

uint_t phys_writes; /* Number of physical writes */

uint_t nwritten; /* Physical bytes written */

uint_t nflushes; /* Number of times cache was cleared */

} ufs_directio_kstats;

32 32

Directory name Cache

caches path names for vnodes DNLC, The Directory Name Lookup Cache

Each time we find the path name for a vnode, we store it in DNLC

Ncsize, system-tunable parameter, used to set the number of entries in the DNLC is set at boot time ncsize = (17 * maxusers) + 90 in Solaris 2.4, 2.5, 2.5.1 ncsize = (68 * maxusers) + 360 in Solaris 2.6, 2.7 Maxusers, equal to the number of megabytes of memory installed in the system,

maximum of 1024, it can also be overridden to 2048

Hit rate the number of times a name was looked up and

found in the name cache

33 33

Inode Caches

keep a number of inodes in memory to minimize disk inode reads to keep the inode’s vnode in memory

ufs_ninode, size the tables for the expected number of inodes affects the number of inodes in memory

how the UFS maintains inodes Inodes are created when a file is first referenced States: referenced, or on an idle queue Are destroyed when pushed off the end of the idle


34 34

Inode Caches (Cont.)

The number of inodes in memory is dynamicno upper bound to the number of inodes open at a time

the idle queueWhen inode is no longer referenced, the inode is placed on the idle

queue its size is controlled by the ufs_ninode parameter and is limited to ¼ of

ufs_ninode referred by

other subsystem

35 35

Inode Caches (Cont.)# sar -v 3 3

SunOS devhome 5.7 Generic sun4u 08/01/9911:38:09 proc-sz ov inod-sz ov file-sz ov lock-sz11:38:12 100/5930 0 37181/37181 0 603/603 0 0/011:38:15 100/5930 0 37181/37181 0 603/603 0 0/011:38:18 101/5930 0 37181/37181 0 607/607 0 0/0

# netstat -k ufs_inode_cache


buf_size 440 align 8 chunk_size 440 slab_size 8192 alloc 1221573 alloc_fail 0

free 1188468 depot_alloc 19957 depot_free 21230 depot_contention 18 global_alloc 48330

global_free 7823 buf_constructed 3325 buf_avail 3678 buf_inuse 37182

buf_total 40860 buf_max 40860 slab_create 2270 slab_destroy 0 memory_class 0

hash_size 0 hash_lookup_depth 0 hash_rescale 0 full_magazines 219

empty_magazines 332 magazine_size 15 alloc_from_cpu0 579706 free_to_cpu0 588106

buf_avail_cpu0 15 alloc_from_cpu1 573580 free_to_cpu1 571309 buf_avail_cpu1 25

36 36

Inode Caches (Cont.)

hash table used to look up inodes Its size is controlled by the ufs_ninode By default, ufs_ninode is set to the size of the

directory name cache (ncsize) set ufs_ninode separately in /etc/system

set ufs_ninode = new_value

37 37


[email protected]