a five-year study of file-system metadataalumni.soe.ucsc.edu/~ajnelson/talks/ajn-talk-agrawal...a...

A Five-Year Study of File-System Metadata

Nitin Agrawal, William J. Bolosky, John R. Douceur, Jacob R. Lorch

Slides by Alex NelsonFor UCSC CMPS 229, 2010-04-06

Large, longitudinal study

• Goal: Understand file system use trends.

• Why has “fullness” declined only slightly?

• Task: Study five years of desktop storage use.

• Analyzed 63K systems over 5 years.4 billion files, 700TB data.

Many planners and developers need this

• File system developers

• Disk manufacturers

• File system utility developers

• Backup, antivirus, indexing, encryption

• SAN designers, capacity planners,multitier storage system developers

Related studies:Single-time snapshots• This study dwarfs most others at single

times: 11-16K file systems per year.

• Mullender84: 1 Unix system

• Irlam93: 1050 Unix file systems

• Sienknecht94: 46 HP-UX systems, 267 file systems

• Douceur98: 10K desktops

Related studies:Longitudinal studies

• This study is longer-term than others:5 years.

• Bennett91: 1 day, 3 file servers.

• Smith94: 10 months, 4 file servers, 48 file systems

Overview

• Methodology

• File trends

• Directory trends

• Space usage trends

• Conclusion

Methodology:

• Data collection: By voluntary response, lottery incentive to participate, 1x/year.

• Data collected: Metadata, directory structure snapshot.

• File systems: NTFS, FAT32, FAT.

• Cohort identifiers: User name, computer name, volume ID, drive letter, total space.

File trend:File counts growing

• Metadata tables will be more loaded.

• O(File #) operations should expect more work

0

20

40

60

80

100

4M512K64K8K1K128162Cum

ulat

ive%

offile

syst

ems

File count (log scale)

20002001200220032004

Figure 1: CDFs of file systems by file count

0

2000

4000

6000

8000

10000

12000

128M8M512K32K2K12880

File

spe

rfile

syst

em

File size (bytes, log scale, power-of-2 bins)

20002001200220032004

Figure 2: Histograms of files by size

for directories. Thus, file system designers should en-sure their metadata tables scale to large file counts. Ad-ditionally, we can expect file system scans that examinedata proportional to the number of files and/or directo-ries to take progressively longer. Examples of such scansinclude virus scans and metadata integrity checks fol-lowing block corruption. Thus, it will become increas-ingly useful to perform these checks efficiently, perhapsby scanning in an order that minimizes movement of thedisk arm.

3.2 File size

This section describes our findings regarding file size.We report the size of actual content, ignoring the effectsof internal fragmentation, file metadata, and any otheroverhead. We observe that the overall file size distribu-tion has changed slightly over the five years of our study.By contrast, the majority of stored bytes are found in in-creasingly larger files. Moreover, the latter distributionincreasingly exhibits a double mode, due mainly to data-base and blob (binary large object) files.

Figure 2 plots histograms of files by size and Figure 3plots the corresponding CDFs. We see that the absolutecount of files per file system has grown significantly overtime, but the general shape of the distribution has not

0

20

40

60

80

100

256M16M1M64K4K256161

Cum

ulat

ive%

offile

s

File size (bytes, log scale)

20002001200220032004

Figure 3: CDFs of files by size

0200400600800

10001200140016001800

64G8G1G128M16M2M256K32K4K512Used

spac

epe

rfile

syst

em(M

B)

Containing file size (bytes, log scale, power-of-2 bins)

20002001200220032004

Figure 4: Histograms of bytes by containing file size

changed significantly. Although it is not visible on thegraph, the arithmetic mean file size has grown by 75%from 108 KB to 189 KB. In each year, 1–1.5% of fileshave a size of zero.

The growth in mean file size from 108 KB to 189 KBover four years suggests that this metric grows roughly15% per year. Another way to estimate this growth rate isto compare our 2000 result to the 1981 result of 13.4 KBobtained by Satyanarayanan [24]. This comparison esti-mates the annual growth rate as 12%. Note that this latterestimate is somewhat flawed, since it compares file sizesfrom two rather different environments.

0

20

40

60

80

100

128G16G2G256M32M4M512K64K8K1K

Cum

ulat

ive%

ofus

edsp

ace

Containing file size (bytes, log scale)

20002001200220032004

Figure 5: CDFs of bytes by containing file size

FAST ’07: 5th USENIX Conference on File and Storage Technologies USENIX Association34

File trend:Size proportions same

• Many small files take a small amount of the total space.

• Impacts metadata and data placement.

0

20

40

60

80

100

4M512K64K8K1K128162Cum

ulat

ive%

offile

syst

ems


20002001200220032004


0

2000

4000

6000

8000

10000

12000

128M8M512K32K2K12880

File

spe

rfile

syst

em


20002001200220032004



3.2 File size



0

20

40

60

80

100

256M16M1M64K4K256161

Cum

ulat

ive%

offile

s


20002001200220032004


0200400600800

10001200140016001800


spac

epe

rfile

syst

em(M

B)Containing file size (bytes, log scale, power-of-2 bins)

20002001200220032004




0

20

40

60

80

100

128G16G2G256M32M4M512K64K8K1K

Cum

ulat

ive%

ofus

edsp

ace


20002001200220032004



0

20

40

60

80

100

4M512K64K8K1K128162Cum

ulat

ive%

offile

syst

ems


20002001200220032004


0

2000

4000

6000

8000

10000

12000

128M8M512K32K2K12880

File

spe

rfile

syst

em


20002001200220032004



3.2 File size



0

20

40

60

80

100

256M16M1M64K4K256161

Cum

ulat

ive%

offile

s


20002001200220032004


0200400600800

10001200140016001800


spac

epe

rfile

syst

em(M

B)


20002001200220032004




0

20

40

60

80

100

128G16G2G256M32M4M512K64K8K1K

Cum

ulat

ive%

ofus

edsp

ace


20002001200220032004



File trend:Space/file increasing

• Space used is also becoming bimodal.

• Different file types have different size distributions.

0

20

40

60

80

100

4M512K64K8K1K128162Cum

ulat

ive%

offile

syst

ems


20002001200220032004


0

2000

4000

6000

8000

10000

12000

128M8M512K32K2K12880

File

spe

rfile

syst

em


20002001200220032004



3.2 File size



0

20

40

60

80

100

256M16M1M64K4K256161

Cum

ulat

ive%

offile

s


20002001200220032004


0200400600800

10001200140016001800


spac

epe

rfile

syst

em(M

B)


20002001200220032004




0

20

40

60

80

100

128G16G2G256M32M4M512K64K8K1K

Cum

ulat

ive%

ofus

edsp

ace


20002001200220032004



0

20

40

60

80

100

4M512K64K8K1K128162Cum

ulat

ive%

offile

syst

ems


20002001200220032004


0

2000

4000

6000

8000

10000

12000

128M8M512K32K2K12880

File

spe

rfile

syst

em


20002001200220032004



3.2 File size



0

20

40

60

80

100

256M16M1M64K4K256161

Cum

ulat

ive%

offile

s


20002001200220032004


0200400600800

10001200140016001800


spac

epe

rfile

syst

em(M

B)


20002001200220032004




0

20

40

60

80

100

128G16G2G256M32M4M512K64K8K1K

Cum

ulat

ive%

ofus

edsp

ace


20002001200220032004



File trend:New, larger file types

0

20

40

60

80

100

4M512K64K8K1K128162Cum

ulat

ive%

offile

syst

ems


20002001200220032004


0

2000

4000

6000

8000

10000

12000

128M8M512K32K2K12880

File

spe

rfile

syst

em


20002001200220032004



3.2 File size



0

20

40

60

80

100

256M16M1M64K4K256161

Cum

ulat

ive%

offile

s


20002001200220032004


0200400600800

10001200140016001800


spac

epe

rfile

syst

em(M

B)


20002001200220032004




0

20

40

60

80

100

128G16G2G256M32M4M512K64K8K1K

Cum

ulat

ive%

ofus

edsp

ace


20002001200220032004



0

200

400

600

800

1000

1200

1400

1600

1800

512-1K

4K-8K 32K-64K

256K-512K

2M-4M

16M-32M

128M-256M

1G-2G

8G-16G

64G-128G


Use

dsp

ace

per

FS(M

B)

Others Video DB Blob

Figure 6: Contribution of file types to Figure 4 (2004).Video means files with extension avi, dps, mpeg, mpg,vob, or wmv; DB means files with extension ldf, mad,mdf, ndf, ost, or pst; and Blob means files namedhiberfil.sys and files with extension bak, bkf,bkp, dmp, gho, iso, pqi, rbf, or vhd.

Figure 4 plots histograms of bytes by containingfile size, alternately described as histograms of filesweighted by file size. Figure 5 plots CDFs of these distri-butions. We observe that the distribution of file size hasshifted to the right over time, with the median weightedfile size increasing from 3 MB to 9 MB. Also, the distri-bution exhibits a double mode that has become progres-sively more pronounced. The corresponding distributionin our 1998 study did not show a true second mode, butit did show an inflection point around 64 MB, which isnear the local minimum in Figure 4.

To study this second peak, we broke out several cate-gories of files according to file-name extension. Figure 6replots the 2004 data from Figure 4 as a stacked bar chart,with the contributions of video, database, and blob filesindicated. We see that most of the bytes in large files arein video, database, and blob files, and that most of thevideo, database, and blob bytes are in large files.

Our finding that different types of files have differ-ent size distributions echoes the findings of other stud-ies. In 1981, Satyanarayanan [24] found this to be thecase on a shared file server in an academic environment.In 2001, Evans and Kuenning also noted this phenom-enon in their analysis of 22 machines running variousoperating systems at Harvey Mudd College and MarineBiological Laboratories [11]. The fact that this findingis consistent across various different environments andtimes suggests that it is fundamental.

02000400060008000

10000120001400016000

14y1.8y81d10d31hr3.8hr

File

spe

rfile

syst

em

File age (log scale, power-of-2 bins)

20002001200220032004

Figure 7: Histograms of files by age

0

20

40

60

80

100

7.1y0.9y41d5d15.2hr2hr

Cum

ulat

ive%

offile

s


20002001200220032004

Figure 8: CDFs of files by age

There are several implications of the fact that a largenumber of small files account for a small fraction of diskusage, such as the following. First, it may not take muchspace to colocate many of these files with their meta-data. This may be a reasonable way to reduce the diskseek time needed to access these files. Second, a filesystem that colocates several files in a single block, likeReiserFS [22], will have many opportunities to do so.This will save substantial space by eliminating internalfragmentation, especially if a large block size is used toimprove performance. Third, designers of disk usage vi-sualization utilities may want to show not only directo-ries but also the names of certain large files.

3.3 File age

This subsection describes our findings regarding fileage. Because file timestamps can be modified by applica-tion programs [17], our conclusions should be regardedcautiously.

Figure 7 plots histograms of files by age, calculated asthe elapsed time since the file was created or last modi-fied, relative to the time of the snapshot. Figure 8 showsCDFs of this same data. The median file age ranges be-tween 80 and 160 days across datasets, with no cleartrend over time.

FAST ’07: 5th USENIX Conference on File and Storage TechnologiesUSENIX Association 35

• Distribution shown for 2004.

• New data: video, blob (including .vhd).

File trend:Extension consistency

• Most-popular extensions, by file count and agg. file size, are consistent across years.

• Special-case treatment would benefit nearly half the files and half the space.

gif gif gif gif gif

dll dlldll dll dll

h hh h h

htm htmhtm

htm htm

Ø ØØ Ø Ø

cpp cpp cpp cpp cpp

exe exe exe exe exe

jpg jpg jpg jpg jpg

txt txt txt txt txt

0%5%10%15%20%25%30%35%40%45%50%

2000 2001 2002 2003 2004

%of

files

Figure 9: Fraction of files with popular extensions

The distribution of file age is not memoryless, so theage of a file is useful in predicting its remaining lifetime.So, systems such as archival backup systems can use thisdistribution to make predictions of how much longer afile will be needed based on how old it is. Since the dis-tribution of file age has not appreciably changed acrossthe years, we can expect that a prediction algorithm de-veloped today based on the latest distribution will applyfor several years to come.

3.4 File-name extensions

This subsection describes our findings regarding pop-ular file types, as determined by file-name extension. Al-though the top few extensions have not changed dramat-ically over our five-year sample period, there has beensome change, reflecting a decline in the relative preva-lence of web content and an increase in use of virtualmachines. The top few extensions account for nearly halfof all files and bytes in file systems.

In old DOS systems with 8.3-style file names, the ex-tension was the zero to three characters following thesingle dot in the file name. Although Windows systemsallow file names of nearly arbitrary length and contain-ing multiple dots, many applications continue to indicatetheir file types by means of extensions. For our analy-ses, we define an extension as the five-or-fewer charac-ters following the last dot in a file name. If a name has nodots or has more than five characters after the last dot, weconsider that name to have no extension, which we repre-sent with the symbol Ø. As a special case, if a file nameends in .gz, .bz2, and .Z, then we ignore that suffixwhen determining extension. We do this because theseare types of compressed files wherein the actual contenttype is indicated by the characters prior to the compres-sion extension. To understand the typical usage of thefile extensions we discuss in this section, see Table 4.

Extension Typical Usagecpp C++ source codedll Dynamic link libraryexe Executablegif Image in Graphic Interchange Formath Source code headerhtm File in hypertext markup languagejpg Image in JPEG formatlib Code librarymp3 Music file in MPEG Layer III formatpch Precompiled headerpdb Source symbols for debuggingpst Outlook personal foldertxt Textvhd Virtual hard drive for virtual machinewma Windows Media Audio

Table 4: Typical usage of popular file extensions

dll dll dll dll dll

pdb pdb pdb pdb pdb

exeexe exe

exe exe

pstpst pst

pst pst

pchpch pch

pchpch

mp3mp3 mp3

mp3mp3

liblib

liblib

lib

ØØ

ØØ Ø

wma wma wmavhd vhd

0%5%10%15%20%25%30%35%40%45%50%

2000 2001 2002 2003 2004

%of

used

spac

e

wma

Figure 10: Fraction of bytes in files with popular exten-sions

Figure 9 plots, for the nine extensions that are the mostpopular in terms of file count, the fraction of files withthat extension. The fractions are plotted longitudinallyover our five-year sample period. The most notable thingwe observe is that these extensions’ popularity is rela-tively stable—the top five extensions have remained thetop five for this entire time. However, the relative popu-larity of gif files and htm files has gone down steadilysince 2001, suggesting a decline in the popularity of webcontent relative to other ways to fill one’s file system.

Figure 10 plots, for the ten extensions that are the mostpopular in terms of summed file size, the fraction of filebytes residing in files with that extension. Across allyears, dynamic link libraries (dll files) contain morebytes than any other file type. Extension vhd, whichis used for virtual hard drives, is consuming a rapidlyincreasing fraction of file-system space, suggesting that


gif gif gif gif gif

dll dlldll dll dll

h hh h h

htm htmhtm

htm htm

Ø ØØ Ø Ø

cpp cpp cpp cpp cpp

exe exe exe exe exe

jpg jpg jpg jpg jpg

txt txt txt txt txt

0%5%10%15%20%25%30%35%40%45%50%

2000 2001 2002 2003 2004

%of

files

Figure 9: Fraction of files with popular extensions

The distribution of file age is not memoryless, so theage of a file is useful in predicting its remaining lifetime.So, systems such as archival backup systems can use thisdistribution to make predictions of how much longer afile will be needed based on how old it is. Since the dis-tribution of file age has not appreciably changed acrossthe years, we can expect that a prediction algorithm de-veloped today based on the latest distribution will applyfor several years to come.

3.4 File-name extensions

This subsection describes our findings regarding pop-ular file types, as determined by file-name extension. Al-though the top few extensions have not changed dramat-ically over our five-year sample period, there has beensome change, reflecting a decline in the relative preva-lence of web content and an increase in use of virtualmachines. The top few extensions account for nearly halfof all files and bytes in file systems.

In old DOS systems with 8.3-style file names, the ex-tension was the zero to three characters following thesingle dot in the file name. Although Windows systemsallow file names of nearly arbitrary length and contain-ing multiple dots, many applications continue to indicatetheir file types by means of extensions. For our analy-ses, we define an extension as the five-or-fewer charac-ters following the last dot in a file name. If a name has nodots or has more than five characters after the last dot, weconsider that name to have no extension, which we repre-sent with the symbol Ø. As a special case, if a file nameends in .gz, .bz2, and .Z, then we ignore that suffixwhen determining extension. We do this because theseare types of compressed files wherein the actual contenttype is indicated by the characters prior to the compres-sion extension. To understand the typical usage of thefile extensions we discuss in this section, see Table 4.

Extension Typical Usagecpp C++ source codedll Dynamic link libraryexe Executablegif Image in Graphic Interchange Formath Source code headerhtm File in hypertext markup languagejpg Image in JPEG formatlib Code librarymp3 Music file in MPEG Layer III formatpch Precompiled headerpdb Source symbols for debuggingpst Outlook personal foldertxt Textvhd Virtual hard drive for virtual machinewma Windows Media Audio

Table 4: Typical usage of popular file extensions

dll dll dll dll dll

pdb pdb pdb pdb pdb

exeexe exe

exe exe

pstpst pst

pst pst

pchpch pch

pchpch

mp3mp3 mp3

mp3mp3

liblib

liblib

lib

ØØ

ØØ Ø

wma wma wmavhd vhd

0%5%10%15%20%25%30%35%40%45%50%

2000 2001 2002 2003 2004

%of

used

spac

e

wma

Figure 10: Fraction of bytes in files with popular exten-sions

Figure 9 plots, for the nine extensions that are the mostpopular in terms of file count, the fraction of files withthat extension. The fractions are plotted longitudinallyover our five-year sample period. The most notable thingwe observe is that these extensions’ popularity is rela-tively stable—the top five extensions have remained thetop five for this entire time. However, the relative popu-larity of gif files and htm files has gone down steadilysince 2001, suggesting a decline in the popularity of webcontent relative to other ways to fill one’s file system.

Figure 10 plots, for the ten extensions that are the mostpopular in terms of summed file size, the fraction of filebytes residing in files with that extension. Across allyears, dynamic link libraries (dll files) contain morebytes than any other file type. Extension vhd, whichis used for virtual hard drives, is consuming a rapidlyincreasing fraction of file-system space, suggesting that


File trend:Age ... inconclusive

• No clear trend or wild deviancy.

• Predictions of future utility from this data likely to stay useful awhile.

0

200

400

600

800

1000

1200

1400

1600

1800

512-1K

4K-8K 32K-64K

256K-512K

2M-4M

16M-32M

128M-256M

1G-2G

8G-16G

64G-128G


Use

dsp

ace

per

FS(M

B)






02000400060008000

10000120001400016000

14y1.8y81d10d31hr3.8hr

File

spe

rfile

syst

em


20002001200220032004


0

20

40

60

80

100

7.1y0.9y41d5d15.2hr2hr

Cum

ulat

ive%

offile

sFile age (log scale, power-of-2 bins)

20002001200220032004



3.3 File age




0

200

400

600

800

1000

1200

1400

1600

1800

512-1K

4K-8K 32K-64K

256K-512K

2M-4M

16M-32M

128M-256M

1G-2G

8G-16G

64G-128G


Use

dsp

ace

per

FS(M

B)






02000400060008000

10000120001400016000

14y1.8y81d10d31hr3.8hr

File

spe

rfile

syst

em


20002001200220032004


0

20

40

60

80

100

7.1y0.9y41d5d15.2hr2hr

Cum

ulat

ive%

offile

s


20002001200220032004



3.3 File age




File trend:Users writing less

• Unwritten file proportion has grown.

• This bodes well for P2P or rapid-deployment systems.

0

5

10

15

20

100806040200

%of

filesy

stem

s

% of files unwritten (5-percentage-point bins)

20002001200220032004

Figure 11: Histograms of file systems by percentage offiles unwritten

0

20

40

60

80

100

100806040200Cum

ulat

ive%

offile

syst

ems

% of files unwritten

20002001200220032004

Figure 12: CDFs of file systems by percentage of filesunwritten

virtual machine use is increasing. The null extension ex-hibits a notable anomaly in 2003, but we cannot investi-gate the cause without decrypting the file names in ourdatasets, which would violate our privacy policy.

Since files with the same extension have similar prop-erties and requirements, some file system managementpolicies can be improved by including special-case treat-ment for particular extensions. Such special-case treat-ment can be built into the file system or autonomicallyand dynamically learned [16]. Since nearly half the files,and nearly half the bytes, belong to files with a few pop-ular extensions, developing such special-case treatmentfor only a few particular extensions can optimize perfor-mance for a large fraction of the file system. Further-more, since the same extensions continue to be popularyear after year, one can develop special-case treatmentsfor today’s popular extensions and expect that they willstill be useful years from now.

3.5 Unwritten files

Figures 11 and 12 plot histograms and CDFs, respec-tively, of file systems by percentage of files that have notbeen written since they were copied onto the file sys-

0

20

40

60

80

100

64K8K1K128162Cum

ulat

ive%

offile

syst

ems

Directory count (log scale, power-of-2 bins)

20002001200220032004

Figure 13: CDFs of file systems by directory count

tem. We identify such files as ones whose modificationtimestamps are earlier than their creation timestamps,since the creation timestamp of a copied file is set tothe time at which the copy was made, but its modifica-tion timestamp is copied from the original file. Over oursample period, the arithmetic mean of the percentage oflocally unwritten files has grown from 66% to 76%, andthe median has grown from 70% to 78%. This suggeststhat users locally contribute to a decreasing fraction oftheir systems’ content. This may in part be due to theincreasing amount of total content over time.

Since more and more files are being copied acrossfile systems rather than generated locally, we can expectidentifying and coalescing identical copies to become in-creasingly important in systems that aggregate file sys-tems. Examples of systems with such support are theFARSITE distributed file system [1], the Pastiche peer-to-peer backup system [8], and the Single Instance Storein Windows file servers [5].

4 Directories

4.1 Directory count per file system

Figure 13 plots CDFs of file systems by count of di-rectories. The count of directories per file system hasincreased steadily over our five-year sample period: Thearithmetic mean has grown from 2400 to 8900 directoriesand the median has grown from 1K to 4K directories.

We discussed implications of the rising number of di-rectories per file system earlier, in §3.1.

4.2 Directory size

This section describes our findings regarding direc-tory size, measured by count of contained files, count ofcontained subdirectories, and total entry count. None ofthese size distributions has changed appreciably over oursample period, but the mean count of files per directoryhas decreased slightly.


0

5

10

15

20

100806040200

%of

filesy

stem

s

% of files unwritten (5-percentage-point bins)

20002001200220032004

Figure 11: Histograms of file systems by percentage offiles unwritten

0

20

40

60

80

100

100806040200Cum

ulat

ive%

offile

syst

ems

% of files unwritten

20002001200220032004

Figure 12: CDFs of file systems by percentage of filesunwritten

virtual machine use is increasing. The null extension ex-hibits a notable anomaly in 2003, but we cannot investi-gate the cause without decrypting the file names in ourdatasets, which would violate our privacy policy.

Since files with the same extension have similar prop-erties and requirements, some file system managementpolicies can be improved by including special-case treat-ment for particular extensions. Such special-case treat-ment can be built into the file system or autonomicallyand dynamically learned [16]. Since nearly half the files,and nearly half the bytes, belong to files with a few pop-ular extensions, developing such special-case treatmentfor only a few particular extensions can optimize perfor-mance for a large fraction of the file system. Further-more, since the same extensions continue to be popularyear after year, one can develop special-case treatmentsfor today’s popular extensions and expect that they willstill be useful years from now.

3.5 Unwritten files

Figures 11 and 12 plot histograms and CDFs, respec-tively, of file systems by percentage of files that have notbeen written since they were copied onto the file sys-

0

20

40

60

80

100

64K8K1K128162Cum

ulat

ive%

offile

syst

ems

Directory count (log scale, power-of-2 bins)

20002001200220032004

Figure 13: CDFs of file systems by directory count

tem. We identify such files as ones whose modificationtimestamps are earlier than their creation timestamps,since the creation timestamp of a copied file is set tothe time at which the copy was made, but its modifica-tion timestamp is copied from the original file. Over oursample period, the arithmetic mean of the percentage oflocally unwritten files has grown from 66% to 76%, andthe median has grown from 70% to 78%. This suggeststhat users locally contribute to a decreasing fraction oftheir systems’ content. This may in part be due to theincreasing amount of total content over time.

Since more and more files are being copied acrossfile systems rather than generated locally, we can expectidentifying and coalescing identical copies to become in-creasingly important in systems that aggregate file sys-tems. Examples of systems with such support are theFARSITE distributed file system [1], the Pastiche peer-to-peer backup system [8], and the Single Instance Storein Windows file servers [5].

4 Directories

4.1 Directory count per file system

Figure 13 plots CDFs of file systems by count of di-rectories. The count of directories per file system hasincreased steadily over our five-year sample period: Thearithmetic mean has grown from 2400 to 8900 directoriesand the median has grown from 1K to 4K directories.

We discussed implications of the rising number of di-rectories per file system earlier, in §3.1.

4.2 Directory size

This section describes our findings regarding direc-tory size, measured by count of contained files, count ofcontained subdirectories, and total entry count. None ofthese size distributions has changed appreciably over oursample period, but the mean count of files per directoryhas decreased slightly.


Directory trends:Size is consistent

• Counts are rising. Median size is not, neither for contained files nor subdirectories.

• Colocating metadata with directories effective and cheap.

0

20

40

60

80

100

0 10 20 30 40 50

Cum

ulat

ive%

ofdi

rect

orie

s

Count of contained files

20002001200220032004

Figure 14: CDFs of directories by file count

0

20

40

60

80

100

0 2 4 6 8 10

Cum

ulat

ive%

ofdi

rect

orie

s

Count of subdirectories

20002001200220032004

2004 model

Figure 15: CDFs of directories by subdirectory count

Figure 14 plots CDFs of directories by size, as mea-sured by count of files in the directory. It shows that al-though the absolute count of directories per file systemhas grown significantly over time, the distribution hasnot changed appreciably. Across all years, 23–25% ofdirectories contain no files, which marks a change from1998, in which only 18% contained no files and therewere more directories containing one file than those con-taining none. The arithmetic mean directory size hasdecreased slightly and steadily from 12.5 to 10.2 overthe sample period, but the median directory size has re-mained steady at 2 files.

Figure 15 plots CDFs of directories by size, as mea-sured by count of subdirectories in the directory. It in-cludes a model approximation we will discuss later in§4.5. This distribution has remained unchanged over oursample period. Across all years, 65–67% of directoriescontain no subdirectories, which is similar to the 69%found in 1998.

Figure 16 plots CDFs of directories by size, as mea-sured by count of total entries in the directory. This dis-tribution has remained largely unchanged over our sam-ple period. Across all years, 46–49% of directories con-tain two or fewer entries.

Since there are so many directories with a small num-ber of files, it would not take much space to colocate

0

20

40

60

80

100

0 10 20 30 40 50

Cum

ulat

ive%

ofdi

rect

orie

s

Count of entries

20002001200220032004

Figure 16: CDFs of directories by entry count

0%

5%

10%

15%

20%

Wind

ows

Prog

ram

Files

Docu

men

tsan

dSe

ttings

Wind

ows

Prog

ram

Files

Docu

men

tsan

dSe

ttings

Files Bytes

%of

files

orby

tes

2000 2001 2002 2003 2004

Figure 17: Fraction of files and bytes in special subtrees

the metadata for most of those files with those directo-ries. Such a layout would reduce seeks associated withfile accesses. Therefore, it might be useful to preallocatea small amount of space near a new directory to hold amodest amount of child metadata. Similarly, most direc-tories contain fewer than twenty entries, suggesting usingan on-disk structure for directories that optimizes for thiscommon case.

4.3 Special directories

This section describes our findings regarding the usageof Windows special directories. We find that an increas-ing fraction of file-system storage is in the namespacesubtree devoted to system files, and the same holds forthe subtree devoted to user documents and settings.

Figure 17 plots the fraction of file-system files thatreside within subtrees rooted in each of three spe-cial directories: Windows, Program Files, andDocuments and Settings. This figure also plotsthe fraction of file-system bytes contained within each ofthese special subtrees.

For the Windows subtree, the fractions of files andbytes have both risen from 2–3% to 11% over our sam-ple period, suggesting that an increasingly large fractionof file-system storage is devoted to system files. In par-


Namespace tree:Depth increasing

• Namespace depth simple to model:

• Attach new directory to randomly selected extant directory d withweight c(d)+2.

0200400600800

1000120014001600

0 2 4 6 8 10 12 14 16Aver

age

#of

dire

ctor

ies

perF

S

Namespace depth (bin size 1)

20002001200220032004

Figure 18: Histograms of directories by namespace depth

ticular, we note that Windows XP was released betweenthe times of our 2000 and 2001 data collections.

For the Program Files subtree, the fractions offiles and bytes have trended in opposite directions withinthe range of 12–16%. For the Documents andSettings subtree, the fraction of bytes has increaseddramatically while the fraction of files has remained rel-atively stable.

The fraction of all files accounted for by these subtreeshas risen from 25% to 40%, and the fraction of bytestherein has risen from 30% to 41%, suggesting that ap-plication writers and end users have increasingly adoptedWindows’ prescriptive namespace organization [7].

Backup software generally does not have to back upsystem files, since they are static and easily restored.Since system files are accounting for a larger and largerfraction of used space, it is becoming more and moreuseful for backup software to exclude these files.

On the other hand, files in the Documents and Set-tings folder tend to be the most important files to back up,since they contain user-generated content and configura-tion information. Since the percentage of bytes devotedto these files is increasing, backup capacity plannersshould expect, surprisingly, that their capacity require-ments will increase faster than disk capacity is plannedto grow. On the other hand, the percentage of files is notincreasing, so they need not expect metadata storage re-quirements to scale faster than disk capacity. This maybe relevant if metadata is backed up in a separate repos-itory from the data, as done by systems such as EMCCentera [13].

4.4 Namespace tree depth

This section describes our findings regarding the depthof directories, files, and bytes in the namespace tree. Wefind that there are many files deep in the namespace tree,especially at depth 7. Also, we find that files deeperin the namespace tree tend to be orders-of-magnitudesmaller than shallower files.

0

20

40

60

80

100

0 2 4 6 8 10 12 14 16

Cum

ulat

ive%

ofdi

rect

orie

s

Namespace depth

20002001200220032004

2004 model

Figure 19: CDFs of directories by namespace depth

0

5000

10000

15000

20000

0 2 4 6 8 10 12 14 16

Aver

age

#of

files

perF

S


20002001200220032004

Figure 20: Histograms of files by namespace depth

Figure 18 plots histograms of directories by their depthin the namespace tree, and Figure 19 plots CDFs of thissame data; it also includes a model approximation wewill discuss later in §4.5. The general shape of the distri-bution has remained consistent over our sample period,but the arithmetic mean has grown from 6.1 to 6.9, andthe median directory depth has increased from 5 to 6.

Figure 20 plots histograms of file count by depth in thenamespace tree, and Figure 21 plots CDFs of this samedata. With a few exceptions, such as at depths 2, 3, and7, these distributions roughly track the observed distribu-tions of directory depth, indicating that the count of files

0

20

40

60

80

100

0 2 4 6 8 10 12 14 16

Cum

ulat

ive%

offile

s

Namespace depth

20002001200220032004

Figure 21: CDFs of files by namespace depth


0200400600800

1000120014001600

0 2 4 6 8 10 12 14 16Aver

age

#of

dire

ctor

ies

perF

S


20002001200220032004









0

20

40

60

80

100

0 2 4 6 8 10 12 14 16

Cum

ulat

ive%

ofdi

rect

orie

s

Namespace depth

20002001200220032004

2004 model


0

5000

10000

15000

20000

0 2 4 6 8 10 12 14 16

Aver

age

#of

files

perF

S


20002001200220032004




0

20

40

60

80

100

0 2 4 6 8 10 12 14 16

Cum

ulat

ive%

offile

s

Namespace depth

20002001200220032004



Namespace tree:Files getting deeper

• Mean of file depth increasing likedirectory depth

• Many files deep in hierarchy means deep path lookup should be optimized.

0200400600800

1000120014001600

0 2 4 6 8 10 12 14 16Aver

age

#of

dire

ctor

ies

perF

S


20002001200220032004









0

20

40

60

80

100

0 2 4 6 8 10 12 14 16

Cum

ulat

ive%

ofdi

rect

orie

s

Namespace depth

20002001200220032004

2004 model


0

5000

10000

15000

20000

0 2 4 6 8 10 12 14 16

Aver

age

#of

files

perF

S


20002001200220032004




0

20

40

60

80

100

0 2 4 6 8 10 12 14 16

Cum

ulat

ive%

offile

s

Namespace depth

20002001200220032004



0200400600800

1000120014001600

0 2 4 6 8 10 12 14 16Aver

age

#of

dire

ctor

ies

perF

S


20002001200220032004









0

20

40

60

80

100

0 2 4 6 8 10 12 14 16

Cum

ulat

ive%

ofdi

rect

orie

s

Namespace depth

20002001200220032004

2004 model


0

5000

10000

15000

20000

0 2 4 6 8 10 12 14 16

Aver

age

#of

files

perF

S


20002001200220032004




0

20

40

60

80

100

0 2 4 6 8 10 12 14 16

Cum

ulat

ive%

offile

s

Namespace depth

20002001200220032004



Space usage:Uniform over time

0

20

40

60

80

100

256G64G16G4G1G256MCum

ulat

ive%

offile

syst

ems

File system capacity (bytes)

20002001200220032004

Figure 24: CDFs of file systems by storage capacity

0

20

40

60

80

100

256G64G16G4G1G256M64M16M4MCum

ulat

ive%

offile

syst

ems

Used space (bytes)

20002001200220032004

Figure 25: CDFs of file systems by total consumed space

Intuitively, the proportional probability c(d) + 2 canbe interpreted as follows: If a directory already has somesubdirectories, it has demonstrated that it is a useful lo-cation for subdirectories, and so it is a likely place formore subdirectories to be created. The more subdirecto-ries it has, the more demonstrably useful it has been asa subdirectory home, so the more likely it is to continueto spawn new subdirectories. If the probability were pro-portional to c(d) without any offset, then an empty di-rectory could never become non-empty, so some offsetis necessary. We found an offset of 2 to match our ob-served distributions very closely for all five years of ourcollected data, but we do not understand why the partic-ular value of 2 should be appropriate.

5 Space Usage

5.1 Capacity and usage

Figure 24 plots CDFs of file system volumes by stor-age capacity, which has increased dramatically over ourfive-year sample period: The arithmetic mean has grownfrom 8 GB to 46 GB and the median has grown from5 GB to 40 GB. The number of small-capacity file sys-tem volumes has dropped dramatically: Systems of 4 GBor less have gone from 43% to 4% of all file systems.

0

20

40

60

80

100

100806040200Cum

ulat

ive%

offile

syst

ems

Fullness percentage

20002001200220032004

Figure 26: CDFs of file systems by fullness

Figure 25 plots CDFs of file systems by total con-sumed space, including not only file content but alsospace consumed by internal fragmentation, file metadata,and the system paging file. Space consumption increasedsteadily over our five-year sample period: The geomet-ric mean has grown from 1 GB to 9 GB, the arithmeticmean has grown from 3 GB to 18 GB, and the medianhas grown from 2 GB to 13 GB.

Figure 26 plots CDFs of file systems by percentageof fullness, meaning the consumed space relative to ca-pacity. The distribution is very nearly uniform for allyears, as it was in our 1998 study. The mean fullness hasdropped slightly from 49% to 45%, and the median filesystem has gone from 47% full to 42% full. By con-trast, the aggregate fullness of our sample population,computed as total consumed space divided by total file-system capacity, has held steady at 41% over all years.

In any given year, the range of file system capacities inthis organization is quite large. This means that softwaremust be able to accommodate a wide range of capacitiessimultaneously existing within an organization. For in-stance, a peer-to-peer backup system must be aware thatsome machines will have drastically more capacity thanothers. File system designs, which must last many years,must accommodate even more dramatic capacity differ-entials.

5.2 Changes in usage

This subsection describes our findings regarding howindividual file systems change in fullness over time. Forthis part of our work, we examined the 6536 snapshotpairs that correspond to the same file system in two con-secutive years. We also examined the 1320 snapshotpairs that correspond to the same file system two yearsapart. We find that 80% of file systems become fullerover a one-year period, and the mean increase in fullnessis 14 percentage points. This increase is predominantlydue to creation of new files, partly offset by deletion ofold files, rather than due to extant files changing size.


0

20

40

60

80

100


ulat

ive%

offile

syst

ems


20002001200220032004


0

20

40

60

80

100


ulat

ive%

offile

syst

ems

Used space (bytes)

20002001200220032004



5 Space Usage



0

20

40

60

80

100

100806040200Cum

ulat

ive%

offile

syst

ems

Fullness percentage

20002001200220032004








0

20

40

60

80

100


ulat

ive%

offile

syst

ems


20002001200220032004


0

20

40

60

80

100


ulat

ive%

offile

syst

ems

Used space (bytes)

20002001200220032004



5 Space Usage



0

20

40

60

80

100

100806040200Cum

ulat

ive%

offile

syst

ems

Fullness percentage

20002001200220032004








Space usage:Cohorts

• Most file systems became fuller after one year, by 14% on average.

• Due to creating new files, not due tochanging old ones.

02468

10121416

100500-50-100

%of

filesy

stem

s

Fullness increase (percentage points, 5-percentage-point bins)

2000 to 20012001 to 20022002 to 20032003 to 2004

Figure 27: Histograms of file systems by 1-year fullnessincrease

0

20

40

60

80

100

-100 -50 0 50 100Cum

ulat

ive%

offile

syst

ems

Fullness increase (percentage points, 5-percentage-point bins)

2000 to 20012001 to 20022002 to 20032003 to 2004

Figure 28: CDFs of file systems by 1-year fullness in-crease

When comparing two matching snapshots in differentyears, we must establish whether two files in successivesnapshots of the same file system are the same file. Wedo not have access to files’ inode numbers, because col-lecting them would have lengthened our scan times to anunacceptable degree. We thus instead use the followingproxy for file sameness: If the files have the same fullpathname, they are considered the same, otherwise theyare not. This is a conservative approach: It will judge afile to be two distinct files if it or any ancestor directoryhas been renamed.

Figures 27 and 28 plot histograms and CDFs, respec-tively, of file systems by percentage-point increase infullness from one year to the next. We define this term byexample: If a file system was 50% full in 2000 and 60%full in 2001, it exhibited a 10 percentage-point increasein fullness. The distribution is substantially the same forall four pairs of consecutive years. Figure 28 shows that80% of file systems exhibit an increase in fullness andfewer than 20% exhibit a decrease. The mean increasefrom one year to the next is 14 percentage points.

We also examined the increase in fullness over twoyears. We found the mean increase to be 22 percentagepoints. This is less than twice the consecutive-year in-

crease, indicating that as file systems age, they increasetheir fullness at a slower rate. Because we have so fewfile systems with snapshots in four consecutive years, wedid not explore increases over three or more years.

Since file systems that persist for a year tend to in-crease their fullness by about 14 points, but the meanfile-system fullness has dropped from 49% to 45% overour sample period, it seems that the steadily increasingfullness of individual file systems is offset by the replace-ment of old file systems with newer, emptier ones.

Analyzing the factors that contribute to the 14-pointmean year-to-year increase in fullness revealed the fol-lowing breakdown: Fullness increases by 28 percentagepoints due to files that are present in the later snapshotbut not in the earlier one, meaning that they were createdduring the intervening year. Fullness decreases by 15percentage points due to files that are present in the ear-lier snapshot but not in the later one, meaning that theywere deleted during the intervening year. Fullness alsoincreases by 1 percentage point due to growth in the sizeof files that are present in both snapshots. An insignifi-cant fraction of this increase is attributable to changes insystem paging files, internal fragmentation, or metadatastorage.

We examined the size distributions of files that werecreated and of files that were deleted, to see if they dif-fered from the overall file-size distribution. We foundthat they do not differ appreciably. We had hypothesizedthat users tend to delete large files to make room for newcontent, but the evidence does not support this hypothe-sis.

Since deleted files and created files have similar sizedistributions, file system designers need not expect thefraction of files of different sizes to change as a file sys-tem ages. Thus, if they find it useful to assign differentparts of the disk to files of different sizes, they can an-ticipate the allocation of sizes to disk areas to not needradical change as time passes.

Many peer-to-peer systems use free space on comput-ers to store shared data, so the amount of used space is ofgreat importance. With an understanding of how this freespace decreases as a file system ages, a peer-to-peer sys-tem can proactively plan how much it will need to offloadshared data from each file system to make room for ad-ditional local content. Also, since a common reason forupgrading a computer is because its disk space becomesexhausted, a peer-to-peer system can use a prediction ofwhen a file system will become full as a coarse approxi-mation to when that file system will become unavailable.

6 Related Work

This research extends our earlier work in measuringand modeling file-system metadata on Windows work-


Conclusions

• Usage trends advise:

• More careful data andmetadata placement.

• Specialized treatment forpopular file extensions.

• Metadata structures optimizedfor small file lists.