a five-year study of file-system metadataalumni.soe.ucsc.edu/~ajnelson/talks/ajn-talk-agrawal...a...
Post on 17-Feb-2020
0 Views
Preview:
TRANSCRIPT
A Five-Year Study of File-System Metadata
Nitin Agrawal, William J. Bolosky, John R. Douceur, Jacob R. Lorch
Slides by Alex NelsonFor UCSC CMPS 229, 2010-04-06
Large, longitudinal study
• Goal: Understand file system use trends.
• Why has “fullness” declined only slightly?
• Task: Study five years of desktop storage use.
• Analyzed 63K systems over 5 years.4 billion files, 700TB data.
Many planners and developers need this
• File system developers
• Disk manufacturers
• File system utility developers
• Backup, antivirus, indexing, encryption
• SAN designers, capacity planners,multitier storage system developers
Related studies:Single-time snapshots• This study dwarfs most others at single
times: 11-16K file systems per year.
• Mullender84: 1 Unix system
• Irlam93: 1050 Unix file systems
• Sienknecht94: 46 HP-UX systems, 267 file systems
• Douceur98: 10K desktops
Related studies:Longitudinal studies
• This study is longer-term than others:5 years.
• Bennett91: 1 day, 3 file servers.
• Smith94: 10 months, 4 file servers, 48 file systems
Overview
• Methodology
• File trends
• Directory trends
• Space usage trends
• Conclusion
Methodology:
• Data collection: By voluntary response, lottery incentive to participate, 1x/year.
• Data collected: Metadata, directory structure snapshot.
• File systems: NTFS, FAT32, FAT.
• Cohort identifiers: User name, computer name, volume ID, drive letter, total space.
File trend:File counts growing
• Metadata tables will be more loaded.
• O(File #) operations should expect more work
0
20
40
60
80
100
4M512K64K8K1K128162Cum
ulat
ive%
offile
syst
ems
File count (log scale)
20002001200220032004
Figure 1: CDFs of file systems by file count
0
2000
4000
6000
8000
10000
12000
128M8M512K32K2K12880
File
spe
rfile
syst
em
File size (bytes, log scale, power-of-2 bins)
20002001200220032004
Figure 2: Histograms of files by size
for directories. Thus, file system designers should en-sure their metadata tables scale to large file counts. Ad-ditionally, we can expect file system scans that examinedata proportional to the number of files and/or directo-ries to take progressively longer. Examples of such scansinclude virus scans and metadata integrity checks fol-lowing block corruption. Thus, it will become increas-ingly useful to perform these checks efficiently, perhapsby scanning in an order that minimizes movement of thedisk arm.
3.2 File size
This section describes our findings regarding file size.We report the size of actual content, ignoring the effectsof internal fragmentation, file metadata, and any otheroverhead. We observe that the overall file size distribu-tion has changed slightly over the five years of our study.By contrast, the majority of stored bytes are found in in-creasingly larger files. Moreover, the latter distributionincreasingly exhibits a double mode, due mainly to data-base and blob (binary large object) files.
Figure 2 plots histograms of files by size and Figure 3plots the corresponding CDFs. We see that the absolutecount of files per file system has grown significantly overtime, but the general shape of the distribution has not
0
20
40
60
80
100
256M16M1M64K4K256161
Cum
ulat
ive%
offile
s
File size (bytes, log scale)
20002001200220032004
Figure 3: CDFs of files by size
0200400600800
10001200140016001800
64G8G1G128M16M2M256K32K4K512Used
spac
epe
rfile
syst
em(M
B)
Containing file size (bytes, log scale, power-of-2 bins)
20002001200220032004
Figure 4: Histograms of bytes by containing file size
changed significantly. Although it is not visible on thegraph, the arithmetic mean file size has grown by 75%from 108 KB to 189 KB. In each year, 1–1.5% of fileshave a size of zero.
The growth in mean file size from 108 KB to 189 KBover four years suggests that this metric grows roughly15% per year. Another way to estimate this growth rate isto compare our 2000 result to the 1981 result of 13.4 KBobtained by Satyanarayanan [24]. This comparison esti-mates the annual growth rate as 12%. Note that this latterestimate is somewhat flawed, since it compares file sizesfrom two rather different environments.
0
20
40
60
80
100
128G16G2G256M32M4M512K64K8K1K
Cum
ulat
ive%
ofus
edsp
ace
Containing file size (bytes, log scale)
20002001200220032004
Figure 5: CDFs of bytes by containing file size
FAST ’07: 5th USENIX Conference on File and Storage Technologies USENIX Association34
File trend:Size proportions same
• Many small files take a small amount of the total space.
• Impacts metadata and data placement.
0
20
40
60
80
100
4M512K64K8K1K128162Cum
ulat
ive%
offile
syst
ems
File count (log scale)
20002001200220032004
Figure 1: CDFs of file systems by file count
0
2000
4000
6000
8000
10000
12000
128M8M512K32K2K12880
File
spe
rfile
syst
em
File size (bytes, log scale, power-of-2 bins)
20002001200220032004
Figure 2: Histograms of files by size
for directories. Thus, file system designers should en-sure their metadata tables scale to large file counts. Ad-ditionally, we can expect file system scans that examinedata proportional to the number of files and/or directo-ries to take progressively longer. Examples of such scansinclude virus scans and metadata integrity checks fol-lowing block corruption. Thus, it will become increas-ingly useful to perform these checks efficiently, perhapsby scanning in an order that minimizes movement of thedisk arm.
3.2 File size
This section describes our findings regarding file size.We report the size of actual content, ignoring the effectsof internal fragmentation, file metadata, and any otheroverhead. We observe that the overall file size distribu-tion has changed slightly over the five years of our study.By contrast, the majority of stored bytes are found in in-creasingly larger files. Moreover, the latter distributionincreasingly exhibits a double mode, due mainly to data-base and blob (binary large object) files.
Figure 2 plots histograms of files by size and Figure 3plots the corresponding CDFs. We see that the absolutecount of files per file system has grown significantly overtime, but the general shape of the distribution has not
0
20
40
60
80
100
256M16M1M64K4K256161
Cum
ulat
ive%
offile
s
File size (bytes, log scale)
20002001200220032004
Figure 3: CDFs of files by size
0200400600800
10001200140016001800
64G8G1G128M16M2M256K32K4K512Used
spac
epe
rfile
syst
em(M
B)Containing file size (bytes, log scale, power-of-2 bins)
20002001200220032004
Figure 4: Histograms of bytes by containing file size
changed significantly. Although it is not visible on thegraph, the arithmetic mean file size has grown by 75%from 108 KB to 189 KB. In each year, 1–1.5% of fileshave a size of zero.
The growth in mean file size from 108 KB to 189 KBover four years suggests that this metric grows roughly15% per year. Another way to estimate this growth rate isto compare our 2000 result to the 1981 result of 13.4 KBobtained by Satyanarayanan [24]. This comparison esti-mates the annual growth rate as 12%. Note that this latterestimate is somewhat flawed, since it compares file sizesfrom two rather different environments.
0
20
40
60
80
100
128G16G2G256M32M4M512K64K8K1K
Cum
ulat
ive%
ofus
edsp
ace
Containing file size (bytes, log scale)
20002001200220032004
Figure 5: CDFs of bytes by containing file size
FAST ’07: 5th USENIX Conference on File and Storage Technologies USENIX Association34
0
20
40
60
80
100
4M512K64K8K1K128162Cum
ulat
ive%
offile
syst
ems
File count (log scale)
20002001200220032004
Figure 1: CDFs of file systems by file count
0
2000
4000
6000
8000
10000
12000
128M8M512K32K2K12880
File
spe
rfile
syst
em
File size (bytes, log scale, power-of-2 bins)
20002001200220032004
Figure 2: Histograms of files by size
for directories. Thus, file system designers should en-sure their metadata tables scale to large file counts. Ad-ditionally, we can expect file system scans that examinedata proportional to the number of files and/or directo-ries to take progressively longer. Examples of such scansinclude virus scans and metadata integrity checks fol-lowing block corruption. Thus, it will become increas-ingly useful to perform these checks efficiently, perhapsby scanning in an order that minimizes movement of thedisk arm.
3.2 File size
This section describes our findings regarding file size.We report the size of actual content, ignoring the effectsof internal fragmentation, file metadata, and any otheroverhead. We observe that the overall file size distribu-tion has changed slightly over the five years of our study.By contrast, the majority of stored bytes are found in in-creasingly larger files. Moreover, the latter distributionincreasingly exhibits a double mode, due mainly to data-base and blob (binary large object) files.
Figure 2 plots histograms of files by size and Figure 3plots the corresponding CDFs. We see that the absolutecount of files per file system has grown significantly overtime, but the general shape of the distribution has not
0
20
40
60
80
100
256M16M1M64K4K256161
Cum
ulat
ive%
offile
s
File size (bytes, log scale)
20002001200220032004
Figure 3: CDFs of files by size
0200400600800
10001200140016001800
64G8G1G128M16M2M256K32K4K512Used
spac
epe
rfile
syst
em(M
B)
Containing file size (bytes, log scale, power-of-2 bins)
20002001200220032004
Figure 4: Histograms of bytes by containing file size
changed significantly. Although it is not visible on thegraph, the arithmetic mean file size has grown by 75%from 108 KB to 189 KB. In each year, 1–1.5% of fileshave a size of zero.
The growth in mean file size from 108 KB to 189 KBover four years suggests that this metric grows roughly15% per year. Another way to estimate this growth rate isto compare our 2000 result to the 1981 result of 13.4 KBobtained by Satyanarayanan [24]. This comparison esti-mates the annual growth rate as 12%. Note that this latterestimate is somewhat flawed, since it compares file sizesfrom two rather different environments.
0
20
40
60
80
100
128G16G2G256M32M4M512K64K8K1K
Cum
ulat
ive%
ofus
edsp
ace
Containing file size (bytes, log scale)
20002001200220032004
Figure 5: CDFs of bytes by containing file size
FAST ’07: 5th USENIX Conference on File and Storage Technologies USENIX Association34
File trend:Space/file increasing
• Space used is also becoming bimodal.
• Different file types have different size distributions.
0
20
40
60
80
100
4M512K64K8K1K128162Cum
ulat
ive%
offile
syst
ems
File count (log scale)
20002001200220032004
Figure 1: CDFs of file systems by file count
0
2000
4000
6000
8000
10000
12000
128M8M512K32K2K12880
File
spe
rfile
syst
em
File size (bytes, log scale, power-of-2 bins)
20002001200220032004
Figure 2: Histograms of files by size
for directories. Thus, file system designers should en-sure their metadata tables scale to large file counts. Ad-ditionally, we can expect file system scans that examinedata proportional to the number of files and/or directo-ries to take progressively longer. Examples of such scansinclude virus scans and metadata integrity checks fol-lowing block corruption. Thus, it will become increas-ingly useful to perform these checks efficiently, perhapsby scanning in an order that minimizes movement of thedisk arm.
3.2 File size
This section describes our findings regarding file size.We report the size of actual content, ignoring the effectsof internal fragmentation, file metadata, and any otheroverhead. We observe that the overall file size distribu-tion has changed slightly over the five years of our study.By contrast, the majority of stored bytes are found in in-creasingly larger files. Moreover, the latter distributionincreasingly exhibits a double mode, due mainly to data-base and blob (binary large object) files.
Figure 2 plots histograms of files by size and Figure 3plots the corresponding CDFs. We see that the absolutecount of files per file system has grown significantly overtime, but the general shape of the distribution has not
0
20
40
60
80
100
256M16M1M64K4K256161
Cum
ulat
ive%
offile
s
File size (bytes, log scale)
20002001200220032004
Figure 3: CDFs of files by size
0200400600800
10001200140016001800
64G8G1G128M16M2M256K32K4K512Used
spac
epe
rfile
syst
em(M
B)
Containing file size (bytes, log scale, power-of-2 bins)
20002001200220032004
Figure 4: Histograms of bytes by containing file size
changed significantly. Although it is not visible on thegraph, the arithmetic mean file size has grown by 75%from 108 KB to 189 KB. In each year, 1–1.5% of fileshave a size of zero.
The growth in mean file size from 108 KB to 189 KBover four years suggests that this metric grows roughly15% per year. Another way to estimate this growth rate isto compare our 2000 result to the 1981 result of 13.4 KBobtained by Satyanarayanan [24]. This comparison esti-mates the annual growth rate as 12%. Note that this latterestimate is somewhat flawed, since it compares file sizesfrom two rather different environments.
0
20
40
60
80
100
128G16G2G256M32M4M512K64K8K1K
Cum
ulat
ive%
ofus
edsp
ace
Containing file size (bytes, log scale)
20002001200220032004
Figure 5: CDFs of bytes by containing file size
FAST ’07: 5th USENIX Conference on File and Storage Technologies USENIX Association34
0
20
40
60
80
100
4M512K64K8K1K128162Cum
ulat
ive%
offile
syst
ems
File count (log scale)
20002001200220032004
Figure 1: CDFs of file systems by file count
0
2000
4000
6000
8000
10000
12000
128M8M512K32K2K12880
File
spe
rfile
syst
em
File size (bytes, log scale, power-of-2 bins)
20002001200220032004
Figure 2: Histograms of files by size
for directories. Thus, file system designers should en-sure their metadata tables scale to large file counts. Ad-ditionally, we can expect file system scans that examinedata proportional to the number of files and/or directo-ries to take progressively longer. Examples of such scansinclude virus scans and metadata integrity checks fol-lowing block corruption. Thus, it will become increas-ingly useful to perform these checks efficiently, perhapsby scanning in an order that minimizes movement of thedisk arm.
3.2 File size
This section describes our findings regarding file size.We report the size of actual content, ignoring the effectsof internal fragmentation, file metadata, and any otheroverhead. We observe that the overall file size distribu-tion has changed slightly over the five years of our study.By contrast, the majority of stored bytes are found in in-creasingly larger files. Moreover, the latter distributionincreasingly exhibits a double mode, due mainly to data-base and blob (binary large object) files.
Figure 2 plots histograms of files by size and Figure 3plots the corresponding CDFs. We see that the absolutecount of files per file system has grown significantly overtime, but the general shape of the distribution has not
0
20
40
60
80
100
256M16M1M64K4K256161
Cum
ulat
ive%
offile
s
File size (bytes, log scale)
20002001200220032004
Figure 3: CDFs of files by size
0200400600800
10001200140016001800
64G8G1G128M16M2M256K32K4K512Used
spac
epe
rfile
syst
em(M
B)
Containing file size (bytes, log scale, power-of-2 bins)
20002001200220032004
Figure 4: Histograms of bytes by containing file size
changed significantly. Although it is not visible on thegraph, the arithmetic mean file size has grown by 75%from 108 KB to 189 KB. In each year, 1–1.5% of fileshave a size of zero.
The growth in mean file size from 108 KB to 189 KBover four years suggests that this metric grows roughly15% per year. Another way to estimate this growth rate isto compare our 2000 result to the 1981 result of 13.4 KBobtained by Satyanarayanan [24]. This comparison esti-mates the annual growth rate as 12%. Note that this latterestimate is somewhat flawed, since it compares file sizesfrom two rather different environments.
0
20
40
60
80
100
128G16G2G256M32M4M512K64K8K1K
Cum
ulat
ive%
ofus
edsp
ace
Containing file size (bytes, log scale)
20002001200220032004
Figure 5: CDFs of bytes by containing file size
FAST ’07: 5th USENIX Conference on File and Storage Technologies USENIX Association34
File trend:New, larger file types
0
20
40
60
80
100
4M512K64K8K1K128162Cum
ulat
ive%
offile
syst
ems
File count (log scale)
20002001200220032004
Figure 1: CDFs of file systems by file count
0
2000
4000
6000
8000
10000
12000
128M8M512K32K2K12880
File
spe
rfile
syst
em
File size (bytes, log scale, power-of-2 bins)
20002001200220032004
Figure 2: Histograms of files by size
for directories. Thus, file system designers should en-sure their metadata tables scale to large file counts. Ad-ditionally, we can expect file system scans that examinedata proportional to the number of files and/or directo-ries to take progressively longer. Examples of such scansinclude virus scans and metadata integrity checks fol-lowing block corruption. Thus, it will become increas-ingly useful to perform these checks efficiently, perhapsby scanning in an order that minimizes movement of thedisk arm.
3.2 File size
This section describes our findings regarding file size.We report the size of actual content, ignoring the effectsof internal fragmentation, file metadata, and any otheroverhead. We observe that the overall file size distribu-tion has changed slightly over the five years of our study.By contrast, the majority of stored bytes are found in in-creasingly larger files. Moreover, the latter distributionincreasingly exhibits a double mode, due mainly to data-base and blob (binary large object) files.
Figure 2 plots histograms of files by size and Figure 3plots the corresponding CDFs. We see that the absolutecount of files per file system has grown significantly overtime, but the general shape of the distribution has not
0
20
40
60
80
100
256M16M1M64K4K256161
Cum
ulat
ive%
offile
s
File size (bytes, log scale)
20002001200220032004
Figure 3: CDFs of files by size
0200400600800
10001200140016001800
64G8G1G128M16M2M256K32K4K512Used
spac
epe
rfile
syst
em(M
B)
Containing file size (bytes, log scale, power-of-2 bins)
20002001200220032004
Figure 4: Histograms of bytes by containing file size
changed significantly. Although it is not visible on thegraph, the arithmetic mean file size has grown by 75%from 108 KB to 189 KB. In each year, 1–1.5% of fileshave a size of zero.
The growth in mean file size from 108 KB to 189 KBover four years suggests that this metric grows roughly15% per year. Another way to estimate this growth rate isto compare our 2000 result to the 1981 result of 13.4 KBobtained by Satyanarayanan [24]. This comparison esti-mates the annual growth rate as 12%. Note that this latterestimate is somewhat flawed, since it compares file sizesfrom two rather different environments.
0
20
40
60
80
100
128G16G2G256M32M4M512K64K8K1K
Cum
ulat
ive%
ofus
edsp
ace
Containing file size (bytes, log scale)
20002001200220032004
Figure 5: CDFs of bytes by containing file size
FAST ’07: 5th USENIX Conference on File and Storage Technologies USENIX Association34
0
200
400
600
800
1000
1200
1400
1600
1800
512-1K
4K-8K 32K-64K
256K-512K
2M-4M
16M-32M
128M-256M
1G-2G
8G-16G
64G-128G
Containing file size (bytes, log scale, power-of-2 bins)
Use
dsp
ace
per
FS(M
B)
Others Video DB Blob
Figure 6: Contribution of file types to Figure 4 (2004).Video means files with extension avi, dps, mpeg, mpg,vob, or wmv; DB means files with extension ldf, mad,mdf, ndf, ost, or pst; and Blob means files namedhiberfil.sys and files with extension bak, bkf,bkp, dmp, gho, iso, pqi, rbf, or vhd.
Figure 4 plots histograms of bytes by containingfile size, alternately described as histograms of filesweighted by file size. Figure 5 plots CDFs of these distri-butions. We observe that the distribution of file size hasshifted to the right over time, with the median weightedfile size increasing from 3 MB to 9 MB. Also, the distri-bution exhibits a double mode that has become progres-sively more pronounced. The corresponding distributionin our 1998 study did not show a true second mode, butit did show an inflection point around 64 MB, which isnear the local minimum in Figure 4.
To study this second peak, we broke out several cate-gories of files according to file-name extension. Figure 6replots the 2004 data from Figure 4 as a stacked bar chart,with the contributions of video, database, and blob filesindicated. We see that most of the bytes in large files arein video, database, and blob files, and that most of thevideo, database, and blob bytes are in large files.
Our finding that different types of files have differ-ent size distributions echoes the findings of other stud-ies. In 1981, Satyanarayanan [24] found this to be thecase on a shared file server in an academic environment.In 2001, Evans and Kuenning also noted this phenom-enon in their analysis of 22 machines running variousoperating systems at Harvey Mudd College and MarineBiological Laboratories [11]. The fact that this findingis consistent across various different environments andtimes suggests that it is fundamental.
02000400060008000
10000120001400016000
14y1.8y81d10d31hr3.8hr
File
spe
rfile
syst
em
File age (log scale, power-of-2 bins)
20002001200220032004
Figure 7: Histograms of files by age
0
20
40
60
80
100
7.1y0.9y41d5d15.2hr2hr
Cum
ulat
ive%
offile
s
File age (log scale, power-of-2 bins)
20002001200220032004
Figure 8: CDFs of files by age
There are several implications of the fact that a largenumber of small files account for a small fraction of diskusage, such as the following. First, it may not take muchspace to colocate many of these files with their meta-data. This may be a reasonable way to reduce the diskseek time needed to access these files. Second, a filesystem that colocates several files in a single block, likeReiserFS [22], will have many opportunities to do so.This will save substantial space by eliminating internalfragmentation, especially if a large block size is used toimprove performance. Third, designers of disk usage vi-sualization utilities may want to show not only directo-ries but also the names of certain large files.
3.3 File age
This subsection describes our findings regarding fileage. Because file timestamps can be modified by applica-tion programs [17], our conclusions should be regardedcautiously.
Figure 7 plots histograms of files by age, calculated asthe elapsed time since the file was created or last modi-fied, relative to the time of the snapshot. Figure 8 showsCDFs of this same data. The median file age ranges be-tween 80 and 160 days across datasets, with no cleartrend over time.
FAST ’07: 5th USENIX Conference on File and Storage TechnologiesUSENIX Association 35
• Distribution shown for 2004.
• New data: video, blob (including .vhd).
File trend:Extension consistency
• Most-popular extensions, by file count and agg. file size, are consistent across years.
• Special-case treatment would benefit nearly half the files and half the space.
gif gif gif gif gif
dll dlldll dll dll
h hh h h
htm htmhtm
htm htm
Ø ØØ Ø Ø
cpp cpp cpp cpp cpp
exe exe exe exe exe
jpg jpg jpg jpg jpg
txt txt txt txt txt
0%5%10%15%20%25%30%35%40%45%50%
2000 2001 2002 2003 2004
%of
files
Figure 9: Fraction of files with popular extensions
The distribution of file age is not memoryless, so theage of a file is useful in predicting its remaining lifetime.So, systems such as archival backup systems can use thisdistribution to make predictions of how much longer afile will be needed based on how old it is. Since the dis-tribution of file age has not appreciably changed acrossthe years, we can expect that a prediction algorithm de-veloped today based on the latest distribution will applyfor several years to come.
3.4 File-name extensions
This subsection describes our findings regarding pop-ular file types, as determined by file-name extension. Al-though the top few extensions have not changed dramat-ically over our five-year sample period, there has beensome change, reflecting a decline in the relative preva-lence of web content and an increase in use of virtualmachines. The top few extensions account for nearly halfof all files and bytes in file systems.
In old DOS systems with 8.3-style file names, the ex-tension was the zero to three characters following thesingle dot in the file name. Although Windows systemsallow file names of nearly arbitrary length and contain-ing multiple dots, many applications continue to indicatetheir file types by means of extensions. For our analy-ses, we define an extension as the five-or-fewer charac-ters following the last dot in a file name. If a name has nodots or has more than five characters after the last dot, weconsider that name to have no extension, which we repre-sent with the symbol Ø. As a special case, if a file nameends in .gz, .bz2, and .Z, then we ignore that suffixwhen determining extension. We do this because theseare types of compressed files wherein the actual contenttype is indicated by the characters prior to the compres-sion extension. To understand the typical usage of thefile extensions we discuss in this section, see Table 4.
Extension Typical Usagecpp C++ source codedll Dynamic link libraryexe Executablegif Image in Graphic Interchange Formath Source code headerhtm File in hypertext markup languagejpg Image in JPEG formatlib Code librarymp3 Music file in MPEG Layer III formatpch Precompiled headerpdb Source symbols for debuggingpst Outlook personal foldertxt Textvhd Virtual hard drive for virtual machinewma Windows Media Audio
Table 4: Typical usage of popular file extensions
dll dll dll dll dll
pdb pdb pdb pdb pdb
exeexe exe
exe exe
pstpst pst
pst pst
pchpch pch
pchpch
mp3mp3 mp3
mp3mp3
liblib
liblib
lib
ØØ
ØØ Ø
wma wma wmavhd vhd
0%5%10%15%20%25%30%35%40%45%50%
2000 2001 2002 2003 2004
%of
used
spac
e
wma
Figure 10: Fraction of bytes in files with popular exten-sions
Figure 9 plots, for the nine extensions that are the mostpopular in terms of file count, the fraction of files withthat extension. The fractions are plotted longitudinallyover our five-year sample period. The most notable thingwe observe is that these extensions’ popularity is rela-tively stable—the top five extensions have remained thetop five for this entire time. However, the relative popu-larity of gif files and htm files has gone down steadilysince 2001, suggesting a decline in the popularity of webcontent relative to other ways to fill one’s file system.
Figure 10 plots, for the ten extensions that are the mostpopular in terms of summed file size, the fraction of filebytes residing in files with that extension. Across allyears, dynamic link libraries (dll files) contain morebytes than any other file type. Extension vhd, whichis used for virtual hard drives, is consuming a rapidlyincreasing fraction of file-system space, suggesting that
FAST ’07: 5th USENIX Conference on File and Storage Technologies USENIX Association36
gif gif gif gif gif
dll dlldll dll dll
h hh h h
htm htmhtm
htm htm
Ø ØØ Ø Ø
cpp cpp cpp cpp cpp
exe exe exe exe exe
jpg jpg jpg jpg jpg
txt txt txt txt txt
0%5%10%15%20%25%30%35%40%45%50%
2000 2001 2002 2003 2004
%of
files
Figure 9: Fraction of files with popular extensions
The distribution of file age is not memoryless, so theage of a file is useful in predicting its remaining lifetime.So, systems such as archival backup systems can use thisdistribution to make predictions of how much longer afile will be needed based on how old it is. Since the dis-tribution of file age has not appreciably changed acrossthe years, we can expect that a prediction algorithm de-veloped today based on the latest distribution will applyfor several years to come.
3.4 File-name extensions
This subsection describes our findings regarding pop-ular file types, as determined by file-name extension. Al-though the top few extensions have not changed dramat-ically over our five-year sample period, there has beensome change, reflecting a decline in the relative preva-lence of web content and an increase in use of virtualmachines. The top few extensions account for nearly halfof all files and bytes in file systems.
In old DOS systems with 8.3-style file names, the ex-tension was the zero to three characters following thesingle dot in the file name. Although Windows systemsallow file names of nearly arbitrary length and contain-ing multiple dots, many applications continue to indicatetheir file types by means of extensions. For our analy-ses, we define an extension as the five-or-fewer charac-ters following the last dot in a file name. If a name has nodots or has more than five characters after the last dot, weconsider that name to have no extension, which we repre-sent with the symbol Ø. As a special case, if a file nameends in .gz, .bz2, and .Z, then we ignore that suffixwhen determining extension. We do this because theseare types of compressed files wherein the actual contenttype is indicated by the characters prior to the compres-sion extension. To understand the typical usage of thefile extensions we discuss in this section, see Table 4.
Extension Typical Usagecpp C++ source codedll Dynamic link libraryexe Executablegif Image in Graphic Interchange Formath Source code headerhtm File in hypertext markup languagejpg Image in JPEG formatlib Code librarymp3 Music file in MPEG Layer III formatpch Precompiled headerpdb Source symbols for debuggingpst Outlook personal foldertxt Textvhd Virtual hard drive for virtual machinewma Windows Media Audio
Table 4: Typical usage of popular file extensions
dll dll dll dll dll
pdb pdb pdb pdb pdb
exeexe exe
exe exe
pstpst pst
pst pst
pchpch pch
pchpch
mp3mp3 mp3
mp3mp3
liblib
liblib
lib
ØØ
ØØ Ø
wma wma wmavhd vhd
0%5%10%15%20%25%30%35%40%45%50%
2000 2001 2002 2003 2004
%of
used
spac
e
wma
Figure 10: Fraction of bytes in files with popular exten-sions
Figure 9 plots, for the nine extensions that are the mostpopular in terms of file count, the fraction of files withthat extension. The fractions are plotted longitudinallyover our five-year sample period. The most notable thingwe observe is that these extensions’ popularity is rela-tively stable—the top five extensions have remained thetop five for this entire time. However, the relative popu-larity of gif files and htm files has gone down steadilysince 2001, suggesting a decline in the popularity of webcontent relative to other ways to fill one’s file system.
Figure 10 plots, for the ten extensions that are the mostpopular in terms of summed file size, the fraction of filebytes residing in files with that extension. Across allyears, dynamic link libraries (dll files) contain morebytes than any other file type. Extension vhd, whichis used for virtual hard drives, is consuming a rapidlyincreasing fraction of file-system space, suggesting that
FAST ’07: 5th USENIX Conference on File and Storage Technologies USENIX Association36
File trend:Age ... inconclusive
• No clear trend or wild deviancy.
• Predictions of future utility from this data likely to stay useful awhile.
0
200
400
600
800
1000
1200
1400
1600
1800
512-1K
4K-8K 32K-64K
256K-512K
2M-4M
16M-32M
128M-256M
1G-2G
8G-16G
64G-128G
Containing file size (bytes, log scale, power-of-2 bins)
Use
dsp
ace
per
FS(M
B)
Others Video DB Blob
Figure 6: Contribution of file types to Figure 4 (2004).Video means files with extension avi, dps, mpeg, mpg,vob, or wmv; DB means files with extension ldf, mad,mdf, ndf, ost, or pst; and Blob means files namedhiberfil.sys and files with extension bak, bkf,bkp, dmp, gho, iso, pqi, rbf, or vhd.
Figure 4 plots histograms of bytes by containingfile size, alternately described as histograms of filesweighted by file size. Figure 5 plots CDFs of these distri-butions. We observe that the distribution of file size hasshifted to the right over time, with the median weightedfile size increasing from 3 MB to 9 MB. Also, the distri-bution exhibits a double mode that has become progres-sively more pronounced. The corresponding distributionin our 1998 study did not show a true second mode, butit did show an inflection point around 64 MB, which isnear the local minimum in Figure 4.
To study this second peak, we broke out several cate-gories of files according to file-name extension. Figure 6replots the 2004 data from Figure 4 as a stacked bar chart,with the contributions of video, database, and blob filesindicated. We see that most of the bytes in large files arein video, database, and blob files, and that most of thevideo, database, and blob bytes are in large files.
Our finding that different types of files have differ-ent size distributions echoes the findings of other stud-ies. In 1981, Satyanarayanan [24] found this to be thecase on a shared file server in an academic environment.In 2001, Evans and Kuenning also noted this phenom-enon in their analysis of 22 machines running variousoperating systems at Harvey Mudd College and MarineBiological Laboratories [11]. The fact that this findingis consistent across various different environments andtimes suggests that it is fundamental.
02000400060008000
10000120001400016000
14y1.8y81d10d31hr3.8hr
File
spe
rfile
syst
em
File age (log scale, power-of-2 bins)
20002001200220032004
Figure 7: Histograms of files by age
0
20
40
60
80
100
7.1y0.9y41d5d15.2hr2hr
Cum
ulat
ive%
offile
sFile age (log scale, power-of-2 bins)
20002001200220032004
Figure 8: CDFs of files by age
There are several implications of the fact that a largenumber of small files account for a small fraction of diskusage, such as the following. First, it may not take muchspace to colocate many of these files with their meta-data. This may be a reasonable way to reduce the diskseek time needed to access these files. Second, a filesystem that colocates several files in a single block, likeReiserFS [22], will have many opportunities to do so.This will save substantial space by eliminating internalfragmentation, especially if a large block size is used toimprove performance. Third, designers of disk usage vi-sualization utilities may want to show not only directo-ries but also the names of certain large files.
3.3 File age
This subsection describes our findings regarding fileage. Because file timestamps can be modified by applica-tion programs [17], our conclusions should be regardedcautiously.
Figure 7 plots histograms of files by age, calculated asthe elapsed time since the file was created or last modi-fied, relative to the time of the snapshot. Figure 8 showsCDFs of this same data. The median file age ranges be-tween 80 and 160 days across datasets, with no cleartrend over time.
FAST ’07: 5th USENIX Conference on File and Storage TechnologiesUSENIX Association 35
0
200
400
600
800
1000
1200
1400
1600
1800
512-1K
4K-8K 32K-64K
256K-512K
2M-4M
16M-32M
128M-256M
1G-2G
8G-16G
64G-128G
Containing file size (bytes, log scale, power-of-2 bins)
Use
dsp
ace
per
FS(M
B)
Others Video DB Blob
Figure 6: Contribution of file types to Figure 4 (2004).Video means files with extension avi, dps, mpeg, mpg,vob, or wmv; DB means files with extension ldf, mad,mdf, ndf, ost, or pst; and Blob means files namedhiberfil.sys and files with extension bak, bkf,bkp, dmp, gho, iso, pqi, rbf, or vhd.
Figure 4 plots histograms of bytes by containingfile size, alternately described as histograms of filesweighted by file size. Figure 5 plots CDFs of these distri-butions. We observe that the distribution of file size hasshifted to the right over time, with the median weightedfile size increasing from 3 MB to 9 MB. Also, the distri-bution exhibits a double mode that has become progres-sively more pronounced. The corresponding distributionin our 1998 study did not show a true second mode, butit did show an inflection point around 64 MB, which isnear the local minimum in Figure 4.
To study this second peak, we broke out several cate-gories of files according to file-name extension. Figure 6replots the 2004 data from Figure 4 as a stacked bar chart,with the contributions of video, database, and blob filesindicated. We see that most of the bytes in large files arein video, database, and blob files, and that most of thevideo, database, and blob bytes are in large files.
Our finding that different types of files have differ-ent size distributions echoes the findings of other stud-ies. In 1981, Satyanarayanan [24] found this to be thecase on a shared file server in an academic environment.In 2001, Evans and Kuenning also noted this phenom-enon in their analysis of 22 machines running variousoperating systems at Harvey Mudd College and MarineBiological Laboratories [11]. The fact that this findingis consistent across various different environments andtimes suggests that it is fundamental.
02000400060008000
10000120001400016000
14y1.8y81d10d31hr3.8hr
File
spe
rfile
syst
em
File age (log scale, power-of-2 bins)
20002001200220032004
Figure 7: Histograms of files by age
0
20
40
60
80
100
7.1y0.9y41d5d15.2hr2hr
Cum
ulat
ive%
offile
s
File age (log scale, power-of-2 bins)
20002001200220032004
Figure 8: CDFs of files by age
There are several implications of the fact that a largenumber of small files account for a small fraction of diskusage, such as the following. First, it may not take muchspace to colocate many of these files with their meta-data. This may be a reasonable way to reduce the diskseek time needed to access these files. Second, a filesystem that colocates several files in a single block, likeReiserFS [22], will have many opportunities to do so.This will save substantial space by eliminating internalfragmentation, especially if a large block size is used toimprove performance. Third, designers of disk usage vi-sualization utilities may want to show not only directo-ries but also the names of certain large files.
3.3 File age
This subsection describes our findings regarding fileage. Because file timestamps can be modified by applica-tion programs [17], our conclusions should be regardedcautiously.
Figure 7 plots histograms of files by age, calculated asthe elapsed time since the file was created or last modi-fied, relative to the time of the snapshot. Figure 8 showsCDFs of this same data. The median file age ranges be-tween 80 and 160 days across datasets, with no cleartrend over time.
FAST ’07: 5th USENIX Conference on File and Storage TechnologiesUSENIX Association 35
File trend:Users writing less
• Unwritten file proportion has grown.
• This bodes well for P2P or rapid-deployment systems.
0
5
10
15
20
100806040200
%of
filesy
stem
s
% of files unwritten (5-percentage-point bins)
20002001200220032004
Figure 11: Histograms of file systems by percentage offiles unwritten
0
20
40
60
80
100
100806040200Cum
ulat
ive%
offile
syst
ems
% of files unwritten
20002001200220032004
Figure 12: CDFs of file systems by percentage of filesunwritten
virtual machine use is increasing. The null extension ex-hibits a notable anomaly in 2003, but we cannot investi-gate the cause without decrypting the file names in ourdatasets, which would violate our privacy policy.
Since files with the same extension have similar prop-erties and requirements, some file system managementpolicies can be improved by including special-case treat-ment for particular extensions. Such special-case treat-ment can be built into the file system or autonomicallyand dynamically learned [16]. Since nearly half the files,and nearly half the bytes, belong to files with a few pop-ular extensions, developing such special-case treatmentfor only a few particular extensions can optimize perfor-mance for a large fraction of the file system. Further-more, since the same extensions continue to be popularyear after year, one can develop special-case treatmentsfor today’s popular extensions and expect that they willstill be useful years from now.
3.5 Unwritten files
Figures 11 and 12 plot histograms and CDFs, respec-tively, of file systems by percentage of files that have notbeen written since they were copied onto the file sys-
0
20
40
60
80
100
64K8K1K128162Cum
ulat
ive%
offile
syst
ems
Directory count (log scale, power-of-2 bins)
20002001200220032004
Figure 13: CDFs of file systems by directory count
tem. We identify such files as ones whose modificationtimestamps are earlier than their creation timestamps,since the creation timestamp of a copied file is set tothe time at which the copy was made, but its modifica-tion timestamp is copied from the original file. Over oursample period, the arithmetic mean of the percentage oflocally unwritten files has grown from 66% to 76%, andthe median has grown from 70% to 78%. This suggeststhat users locally contribute to a decreasing fraction oftheir systems’ content. This may in part be due to theincreasing amount of total content over time.
Since more and more files are being copied acrossfile systems rather than generated locally, we can expectidentifying and coalescing identical copies to become in-creasingly important in systems that aggregate file sys-tems. Examples of systems with such support are theFARSITE distributed file system [1], the Pastiche peer-to-peer backup system [8], and the Single Instance Storein Windows file servers [5].
4 Directories
4.1 Directory count per file system
Figure 13 plots CDFs of file systems by count of di-rectories. The count of directories per file system hasincreased steadily over our five-year sample period: Thearithmetic mean has grown from 2400 to 8900 directoriesand the median has grown from 1K to 4K directories.
We discussed implications of the rising number of di-rectories per file system earlier, in §3.1.
4.2 Directory size
This section describes our findings regarding direc-tory size, measured by count of contained files, count ofcontained subdirectories, and total entry count. None ofthese size distributions has changed appreciably over oursample period, but the mean count of files per directoryhas decreased slightly.
FAST ’07: 5th USENIX Conference on File and Storage TechnologiesUSENIX Association 37
0
5
10
15
20
100806040200
%of
filesy
stem
s
% of files unwritten (5-percentage-point bins)
20002001200220032004
Figure 11: Histograms of file systems by percentage offiles unwritten
0
20
40
60
80
100
100806040200Cum
ulat
ive%
offile
syst
ems
% of files unwritten
20002001200220032004
Figure 12: CDFs of file systems by percentage of filesunwritten
virtual machine use is increasing. The null extension ex-hibits a notable anomaly in 2003, but we cannot investi-gate the cause without decrypting the file names in ourdatasets, which would violate our privacy policy.
Since files with the same extension have similar prop-erties and requirements, some file system managementpolicies can be improved by including special-case treat-ment for particular extensions. Such special-case treat-ment can be built into the file system or autonomicallyand dynamically learned [16]. Since nearly half the files,and nearly half the bytes, belong to files with a few pop-ular extensions, developing such special-case treatmentfor only a few particular extensions can optimize perfor-mance for a large fraction of the file system. Further-more, since the same extensions continue to be popularyear after year, one can develop special-case treatmentsfor today’s popular extensions and expect that they willstill be useful years from now.
3.5 Unwritten files
Figures 11 and 12 plot histograms and CDFs, respec-tively, of file systems by percentage of files that have notbeen written since they were copied onto the file sys-
0
20
40
60
80
100
64K8K1K128162Cum
ulat
ive%
offile
syst
ems
Directory count (log scale, power-of-2 bins)
20002001200220032004
Figure 13: CDFs of file systems by directory count
tem. We identify such files as ones whose modificationtimestamps are earlier than their creation timestamps,since the creation timestamp of a copied file is set tothe time at which the copy was made, but its modifica-tion timestamp is copied from the original file. Over oursample period, the arithmetic mean of the percentage oflocally unwritten files has grown from 66% to 76%, andthe median has grown from 70% to 78%. This suggeststhat users locally contribute to a decreasing fraction oftheir systems’ content. This may in part be due to theincreasing amount of total content over time.
Since more and more files are being copied acrossfile systems rather than generated locally, we can expectidentifying and coalescing identical copies to become in-creasingly important in systems that aggregate file sys-tems. Examples of systems with such support are theFARSITE distributed file system [1], the Pastiche peer-to-peer backup system [8], and the Single Instance Storein Windows file servers [5].
4 Directories
4.1 Directory count per file system
Figure 13 plots CDFs of file systems by count of di-rectories. The count of directories per file system hasincreased steadily over our five-year sample period: Thearithmetic mean has grown from 2400 to 8900 directoriesand the median has grown from 1K to 4K directories.
We discussed implications of the rising number of di-rectories per file system earlier, in §3.1.
4.2 Directory size
This section describes our findings regarding direc-tory size, measured by count of contained files, count ofcontained subdirectories, and total entry count. None ofthese size distributions has changed appreciably over oursample period, but the mean count of files per directoryhas decreased slightly.
FAST ’07: 5th USENIX Conference on File and Storage TechnologiesUSENIX Association 37
Directory trends:Size is consistent
• Counts are rising. Median size is not, neither for contained files nor subdirectories.
• Colocating metadata with directories effective and cheap.
0
20
40
60
80
100
0 10 20 30 40 50
Cum
ulat
ive%
ofdi
rect
orie
s
Count of contained files
20002001200220032004
Figure 14: CDFs of directories by file count
0
20
40
60
80
100
0 2 4 6 8 10
Cum
ulat
ive%
ofdi
rect
orie
s
Count of subdirectories
20002001200220032004
2004 model
Figure 15: CDFs of directories by subdirectory count
Figure 14 plots CDFs of directories by size, as mea-sured by count of files in the directory. It shows that al-though the absolute count of directories per file systemhas grown significantly over time, the distribution hasnot changed appreciably. Across all years, 23–25% ofdirectories contain no files, which marks a change from1998, in which only 18% contained no files and therewere more directories containing one file than those con-taining none. The arithmetic mean directory size hasdecreased slightly and steadily from 12.5 to 10.2 overthe sample period, but the median directory size has re-mained steady at 2 files.
Figure 15 plots CDFs of directories by size, as mea-sured by count of subdirectories in the directory. It in-cludes a model approximation we will discuss later in§4.5. This distribution has remained unchanged over oursample period. Across all years, 65–67% of directoriescontain no subdirectories, which is similar to the 69%found in 1998.
Figure 16 plots CDFs of directories by size, as mea-sured by count of total entries in the directory. This dis-tribution has remained largely unchanged over our sam-ple period. Across all years, 46–49% of directories con-tain two or fewer entries.
Since there are so many directories with a small num-ber of files, it would not take much space to colocate
0
20
40
60
80
100
0 10 20 30 40 50
Cum
ulat
ive%
ofdi
rect
orie
s
Count of entries
20002001200220032004
Figure 16: CDFs of directories by entry count
0%
5%
10%
15%
20%
Wind
ows
Prog
ram
Files
Docu
men
tsan
dSe
ttings
Wind
ows
Prog
ram
Files
Docu
men
tsan
dSe
ttings
Files Bytes
%of
files
orby
tes
2000 2001 2002 2003 2004
Figure 17: Fraction of files and bytes in special subtrees
the metadata for most of those files with those directo-ries. Such a layout would reduce seeks associated withfile accesses. Therefore, it might be useful to preallocatea small amount of space near a new directory to hold amodest amount of child metadata. Similarly, most direc-tories contain fewer than twenty entries, suggesting usingan on-disk structure for directories that optimizes for thiscommon case.
4.3 Special directories
This section describes our findings regarding the usageof Windows special directories. We find that an increas-ing fraction of file-system storage is in the namespacesubtree devoted to system files, and the same holds forthe subtree devoted to user documents and settings.
Figure 17 plots the fraction of file-system files thatreside within subtrees rooted in each of three spe-cial directories: Windows, Program Files, andDocuments and Settings. This figure also plotsthe fraction of file-system bytes contained within each ofthese special subtrees.
For the Windows subtree, the fractions of files andbytes have both risen from 2–3% to 11% over our sam-ple period, suggesting that an increasingly large fractionof file-system storage is devoted to system files. In par-
FAST ’07: 5th USENIX Conference on File and Storage Technologies USENIX Association38
Namespace tree:Depth increasing
• Namespace depth simple to model:
• Attach new directory to randomly selected extant directory d withweight c(d)+2.
0200400600800
1000120014001600
0 2 4 6 8 10 12 14 16Aver
age
#of
dire
ctor
ies
perF
S
Namespace depth (bin size 1)
20002001200220032004
Figure 18: Histograms of directories by namespace depth
ticular, we note that Windows XP was released betweenthe times of our 2000 and 2001 data collections.
For the Program Files subtree, the fractions offiles and bytes have trended in opposite directions withinthe range of 12–16%. For the Documents andSettings subtree, the fraction of bytes has increaseddramatically while the fraction of files has remained rel-atively stable.
The fraction of all files accounted for by these subtreeshas risen from 25% to 40%, and the fraction of bytestherein has risen from 30% to 41%, suggesting that ap-plication writers and end users have increasingly adoptedWindows’ prescriptive namespace organization [7].
Backup software generally does not have to back upsystem files, since they are static and easily restored.Since system files are accounting for a larger and largerfraction of used space, it is becoming more and moreuseful for backup software to exclude these files.
On the other hand, files in the Documents and Set-tings folder tend to be the most important files to back up,since they contain user-generated content and configura-tion information. Since the percentage of bytes devotedto these files is increasing, backup capacity plannersshould expect, surprisingly, that their capacity require-ments will increase faster than disk capacity is plannedto grow. On the other hand, the percentage of files is notincreasing, so they need not expect metadata storage re-quirements to scale faster than disk capacity. This maybe relevant if metadata is backed up in a separate repos-itory from the data, as done by systems such as EMCCentera [13].
4.4 Namespace tree depth
This section describes our findings regarding the depthof directories, files, and bytes in the namespace tree. Wefind that there are many files deep in the namespace tree,especially at depth 7. Also, we find that files deeperin the namespace tree tend to be orders-of-magnitudesmaller than shallower files.
0
20
40
60
80
100
0 2 4 6 8 10 12 14 16
Cum
ulat
ive%
ofdi
rect
orie
s
Namespace depth
20002001200220032004
2004 model
Figure 19: CDFs of directories by namespace depth
0
5000
10000
15000
20000
0 2 4 6 8 10 12 14 16
Aver
age
#of
files
perF
S
Namespace depth (bin size 1)
20002001200220032004
Figure 20: Histograms of files by namespace depth
Figure 18 plots histograms of directories by their depthin the namespace tree, and Figure 19 plots CDFs of thissame data; it also includes a model approximation wewill discuss later in §4.5. The general shape of the distri-bution has remained consistent over our sample period,but the arithmetic mean has grown from 6.1 to 6.9, andthe median directory depth has increased from 5 to 6.
Figure 20 plots histograms of file count by depth in thenamespace tree, and Figure 21 plots CDFs of this samedata. With a few exceptions, such as at depths 2, 3, and7, these distributions roughly track the observed distribu-tions of directory depth, indicating that the count of files
0
20
40
60
80
100
0 2 4 6 8 10 12 14 16
Cum
ulat
ive%
offile
s
Namespace depth
20002001200220032004
Figure 21: CDFs of files by namespace depth
FAST ’07: 5th USENIX Conference on File and Storage TechnologiesUSENIX Association 39
0200400600800
1000120014001600
0 2 4 6 8 10 12 14 16Aver
age
#of
dire
ctor
ies
perF
S
Namespace depth (bin size 1)
20002001200220032004
Figure 18: Histograms of directories by namespace depth
ticular, we note that Windows XP was released betweenthe times of our 2000 and 2001 data collections.
For the Program Files subtree, the fractions offiles and bytes have trended in opposite directions withinthe range of 12–16%. For the Documents andSettings subtree, the fraction of bytes has increaseddramatically while the fraction of files has remained rel-atively stable.
The fraction of all files accounted for by these subtreeshas risen from 25% to 40%, and the fraction of bytestherein has risen from 30% to 41%, suggesting that ap-plication writers and end users have increasingly adoptedWindows’ prescriptive namespace organization [7].
Backup software generally does not have to back upsystem files, since they are static and easily restored.Since system files are accounting for a larger and largerfraction of used space, it is becoming more and moreuseful for backup software to exclude these files.
On the other hand, files in the Documents and Set-tings folder tend to be the most important files to back up,since they contain user-generated content and configura-tion information. Since the percentage of bytes devotedto these files is increasing, backup capacity plannersshould expect, surprisingly, that their capacity require-ments will increase faster than disk capacity is plannedto grow. On the other hand, the percentage of files is notincreasing, so they need not expect metadata storage re-quirements to scale faster than disk capacity. This maybe relevant if metadata is backed up in a separate repos-itory from the data, as done by systems such as EMCCentera [13].
4.4 Namespace tree depth
This section describes our findings regarding the depthof directories, files, and bytes in the namespace tree. Wefind that there are many files deep in the namespace tree,especially at depth 7. Also, we find that files deeperin the namespace tree tend to be orders-of-magnitudesmaller than shallower files.
0
20
40
60
80
100
0 2 4 6 8 10 12 14 16
Cum
ulat
ive%
ofdi
rect
orie
s
Namespace depth
20002001200220032004
2004 model
Figure 19: CDFs of directories by namespace depth
0
5000
10000
15000
20000
0 2 4 6 8 10 12 14 16
Aver
age
#of
files
perF
S
Namespace depth (bin size 1)
20002001200220032004
Figure 20: Histograms of files by namespace depth
Figure 18 plots histograms of directories by their depthin the namespace tree, and Figure 19 plots CDFs of thissame data; it also includes a model approximation wewill discuss later in §4.5. The general shape of the distri-bution has remained consistent over our sample period,but the arithmetic mean has grown from 6.1 to 6.9, andthe median directory depth has increased from 5 to 6.
Figure 20 plots histograms of file count by depth in thenamespace tree, and Figure 21 plots CDFs of this samedata. With a few exceptions, such as at depths 2, 3, and7, these distributions roughly track the observed distribu-tions of directory depth, indicating that the count of files
0
20
40
60
80
100
0 2 4 6 8 10 12 14 16
Cum
ulat
ive%
offile
s
Namespace depth
20002001200220032004
Figure 21: CDFs of files by namespace depth
FAST ’07: 5th USENIX Conference on File and Storage TechnologiesUSENIX Association 39
Namespace tree:Files getting deeper
• Mean of file depth increasing likedirectory depth
• Many files deep in hierarchy means deep path lookup should be optimized.
0200400600800
1000120014001600
0 2 4 6 8 10 12 14 16Aver
age
#of
dire
ctor
ies
perF
S
Namespace depth (bin size 1)
20002001200220032004
Figure 18: Histograms of directories by namespace depth
ticular, we note that Windows XP was released betweenthe times of our 2000 and 2001 data collections.
For the Program Files subtree, the fractions offiles and bytes have trended in opposite directions withinthe range of 12–16%. For the Documents andSettings subtree, the fraction of bytes has increaseddramatically while the fraction of files has remained rel-atively stable.
The fraction of all files accounted for by these subtreeshas risen from 25% to 40%, and the fraction of bytestherein has risen from 30% to 41%, suggesting that ap-plication writers and end users have increasingly adoptedWindows’ prescriptive namespace organization [7].
Backup software generally does not have to back upsystem files, since they are static and easily restored.Since system files are accounting for a larger and largerfraction of used space, it is becoming more and moreuseful for backup software to exclude these files.
On the other hand, files in the Documents and Set-tings folder tend to be the most important files to back up,since they contain user-generated content and configura-tion information. Since the percentage of bytes devotedto these files is increasing, backup capacity plannersshould expect, surprisingly, that their capacity require-ments will increase faster than disk capacity is plannedto grow. On the other hand, the percentage of files is notincreasing, so they need not expect metadata storage re-quirements to scale faster than disk capacity. This maybe relevant if metadata is backed up in a separate repos-itory from the data, as done by systems such as EMCCentera [13].
4.4 Namespace tree depth
This section describes our findings regarding the depthof directories, files, and bytes in the namespace tree. Wefind that there are many files deep in the namespace tree,especially at depth 7. Also, we find that files deeperin the namespace tree tend to be orders-of-magnitudesmaller than shallower files.
0
20
40
60
80
100
0 2 4 6 8 10 12 14 16
Cum
ulat
ive%
ofdi
rect
orie
s
Namespace depth
20002001200220032004
2004 model
Figure 19: CDFs of directories by namespace depth
0
5000
10000
15000
20000
0 2 4 6 8 10 12 14 16
Aver
age
#of
files
perF
S
Namespace depth (bin size 1)
20002001200220032004
Figure 20: Histograms of files by namespace depth
Figure 18 plots histograms of directories by their depthin the namespace tree, and Figure 19 plots CDFs of thissame data; it also includes a model approximation wewill discuss later in §4.5. The general shape of the distri-bution has remained consistent over our sample period,but the arithmetic mean has grown from 6.1 to 6.9, andthe median directory depth has increased from 5 to 6.
Figure 20 plots histograms of file count by depth in thenamespace tree, and Figure 21 plots CDFs of this samedata. With a few exceptions, such as at depths 2, 3, and7, these distributions roughly track the observed distribu-tions of directory depth, indicating that the count of files
0
20
40
60
80
100
0 2 4 6 8 10 12 14 16
Cum
ulat
ive%
offile
s
Namespace depth
20002001200220032004
Figure 21: CDFs of files by namespace depth
FAST ’07: 5th USENIX Conference on File and Storage TechnologiesUSENIX Association 39
0200400600800
1000120014001600
0 2 4 6 8 10 12 14 16Aver
age
#of
dire
ctor
ies
perF
S
Namespace depth (bin size 1)
20002001200220032004
Figure 18: Histograms of directories by namespace depth
ticular, we note that Windows XP was released betweenthe times of our 2000 and 2001 data collections.
For the Program Files subtree, the fractions offiles and bytes have trended in opposite directions withinthe range of 12–16%. For the Documents andSettings subtree, the fraction of bytes has increaseddramatically while the fraction of files has remained rel-atively stable.
The fraction of all files accounted for by these subtreeshas risen from 25% to 40%, and the fraction of bytestherein has risen from 30% to 41%, suggesting that ap-plication writers and end users have increasingly adoptedWindows’ prescriptive namespace organization [7].
Backup software generally does not have to back upsystem files, since they are static and easily restored.Since system files are accounting for a larger and largerfraction of used space, it is becoming more and moreuseful for backup software to exclude these files.
On the other hand, files in the Documents and Set-tings folder tend to be the most important files to back up,since they contain user-generated content and configura-tion information. Since the percentage of bytes devotedto these files is increasing, backup capacity plannersshould expect, surprisingly, that their capacity require-ments will increase faster than disk capacity is plannedto grow. On the other hand, the percentage of files is notincreasing, so they need not expect metadata storage re-quirements to scale faster than disk capacity. This maybe relevant if metadata is backed up in a separate repos-itory from the data, as done by systems such as EMCCentera [13].
4.4 Namespace tree depth
This section describes our findings regarding the depthof directories, files, and bytes in the namespace tree. Wefind that there are many files deep in the namespace tree,especially at depth 7. Also, we find that files deeperin the namespace tree tend to be orders-of-magnitudesmaller than shallower files.
0
20
40
60
80
100
0 2 4 6 8 10 12 14 16
Cum
ulat
ive%
ofdi
rect
orie
s
Namespace depth
20002001200220032004
2004 model
Figure 19: CDFs of directories by namespace depth
0
5000
10000
15000
20000
0 2 4 6 8 10 12 14 16
Aver
age
#of
files
perF
S
Namespace depth (bin size 1)
20002001200220032004
Figure 20: Histograms of files by namespace depth
Figure 18 plots histograms of directories by their depthin the namespace tree, and Figure 19 plots CDFs of thissame data; it also includes a model approximation wewill discuss later in §4.5. The general shape of the distri-bution has remained consistent over our sample period,but the arithmetic mean has grown from 6.1 to 6.9, andthe median directory depth has increased from 5 to 6.
Figure 20 plots histograms of file count by depth in thenamespace tree, and Figure 21 plots CDFs of this samedata. With a few exceptions, such as at depths 2, 3, and7, these distributions roughly track the observed distribu-tions of directory depth, indicating that the count of files
0
20
40
60
80
100
0 2 4 6 8 10 12 14 16
Cum
ulat
ive%
offile
s
Namespace depth
20002001200220032004
Figure 21: CDFs of files by namespace depth
FAST ’07: 5th USENIX Conference on File and Storage TechnologiesUSENIX Association 39
Space usage:Uniform over time
0
20
40
60
80
100
256G64G16G4G1G256MCum
ulat
ive%
offile
syst
ems
File system capacity (bytes)
20002001200220032004
Figure 24: CDFs of file systems by storage capacity
0
20
40
60
80
100
256G64G16G4G1G256M64M16M4MCum
ulat
ive%
offile
syst
ems
Used space (bytes)
20002001200220032004
Figure 25: CDFs of file systems by total consumed space
Intuitively, the proportional probability c(d) + 2 canbe interpreted as follows: If a directory already has somesubdirectories, it has demonstrated that it is a useful lo-cation for subdirectories, and so it is a likely place formore subdirectories to be created. The more subdirecto-ries it has, the more demonstrably useful it has been asa subdirectory home, so the more likely it is to continueto spawn new subdirectories. If the probability were pro-portional to c(d) without any offset, then an empty di-rectory could never become non-empty, so some offsetis necessary. We found an offset of 2 to match our ob-served distributions very closely for all five years of ourcollected data, but we do not understand why the partic-ular value of 2 should be appropriate.
5 Space Usage
5.1 Capacity and usage
Figure 24 plots CDFs of file system volumes by stor-age capacity, which has increased dramatically over ourfive-year sample period: The arithmetic mean has grownfrom 8 GB to 46 GB and the median has grown from5 GB to 40 GB. The number of small-capacity file sys-tem volumes has dropped dramatically: Systems of 4 GBor less have gone from 43% to 4% of all file systems.
0
20
40
60
80
100
100806040200Cum
ulat
ive%
offile
syst
ems
Fullness percentage
20002001200220032004
Figure 26: CDFs of file systems by fullness
Figure 25 plots CDFs of file systems by total con-sumed space, including not only file content but alsospace consumed by internal fragmentation, file metadata,and the system paging file. Space consumption increasedsteadily over our five-year sample period: The geomet-ric mean has grown from 1 GB to 9 GB, the arithmeticmean has grown from 3 GB to 18 GB, and the medianhas grown from 2 GB to 13 GB.
Figure 26 plots CDFs of file systems by percentageof fullness, meaning the consumed space relative to ca-pacity. The distribution is very nearly uniform for allyears, as it was in our 1998 study. The mean fullness hasdropped slightly from 49% to 45%, and the median filesystem has gone from 47% full to 42% full. By con-trast, the aggregate fullness of our sample population,computed as total consumed space divided by total file-system capacity, has held steady at 41% over all years.
In any given year, the range of file system capacities inthis organization is quite large. This means that softwaremust be able to accommodate a wide range of capacitiessimultaneously existing within an organization. For in-stance, a peer-to-peer backup system must be aware thatsome machines will have drastically more capacity thanothers. File system designs, which must last many years,must accommodate even more dramatic capacity differ-entials.
5.2 Changes in usage
This subsection describes our findings regarding howindividual file systems change in fullness over time. Forthis part of our work, we examined the 6536 snapshotpairs that correspond to the same file system in two con-secutive years. We also examined the 1320 snapshotpairs that correspond to the same file system two yearsapart. We find that 80% of file systems become fullerover a one-year period, and the mean increase in fullnessis 14 percentage points. This increase is predominantlydue to creation of new files, partly offset by deletion ofold files, rather than due to extant files changing size.
FAST ’07: 5th USENIX Conference on File and Storage TechnologiesUSENIX Association 41
0
20
40
60
80
100
256G64G16G4G1G256MCum
ulat
ive%
offile
syst
ems
File system capacity (bytes)
20002001200220032004
Figure 24: CDFs of file systems by storage capacity
0
20
40
60
80
100
256G64G16G4G1G256M64M16M4MCum
ulat
ive%
offile
syst
ems
Used space (bytes)
20002001200220032004
Figure 25: CDFs of file systems by total consumed space
Intuitively, the proportional probability c(d) + 2 canbe interpreted as follows: If a directory already has somesubdirectories, it has demonstrated that it is a useful lo-cation for subdirectories, and so it is a likely place formore subdirectories to be created. The more subdirecto-ries it has, the more demonstrably useful it has been asa subdirectory home, so the more likely it is to continueto spawn new subdirectories. If the probability were pro-portional to c(d) without any offset, then an empty di-rectory could never become non-empty, so some offsetis necessary. We found an offset of 2 to match our ob-served distributions very closely for all five years of ourcollected data, but we do not understand why the partic-ular value of 2 should be appropriate.
5 Space Usage
5.1 Capacity and usage
Figure 24 plots CDFs of file system volumes by stor-age capacity, which has increased dramatically over ourfive-year sample period: The arithmetic mean has grownfrom 8 GB to 46 GB and the median has grown from5 GB to 40 GB. The number of small-capacity file sys-tem volumes has dropped dramatically: Systems of 4 GBor less have gone from 43% to 4% of all file systems.
0
20
40
60
80
100
100806040200Cum
ulat
ive%
offile
syst
ems
Fullness percentage
20002001200220032004
Figure 26: CDFs of file systems by fullness
Figure 25 plots CDFs of file systems by total con-sumed space, including not only file content but alsospace consumed by internal fragmentation, file metadata,and the system paging file. Space consumption increasedsteadily over our five-year sample period: The geomet-ric mean has grown from 1 GB to 9 GB, the arithmeticmean has grown from 3 GB to 18 GB, and the medianhas grown from 2 GB to 13 GB.
Figure 26 plots CDFs of file systems by percentageof fullness, meaning the consumed space relative to ca-pacity. The distribution is very nearly uniform for allyears, as it was in our 1998 study. The mean fullness hasdropped slightly from 49% to 45%, and the median filesystem has gone from 47% full to 42% full. By con-trast, the aggregate fullness of our sample population,computed as total consumed space divided by total file-system capacity, has held steady at 41% over all years.
In any given year, the range of file system capacities inthis organization is quite large. This means that softwaremust be able to accommodate a wide range of capacitiessimultaneously existing within an organization. For in-stance, a peer-to-peer backup system must be aware thatsome machines will have drastically more capacity thanothers. File system designs, which must last many years,must accommodate even more dramatic capacity differ-entials.
5.2 Changes in usage
This subsection describes our findings regarding howindividual file systems change in fullness over time. Forthis part of our work, we examined the 6536 snapshotpairs that correspond to the same file system in two con-secutive years. We also examined the 1320 snapshotpairs that correspond to the same file system two yearsapart. We find that 80% of file systems become fullerover a one-year period, and the mean increase in fullnessis 14 percentage points. This increase is predominantlydue to creation of new files, partly offset by deletion ofold files, rather than due to extant files changing size.
FAST ’07: 5th USENIX Conference on File and Storage TechnologiesUSENIX Association 41
0
20
40
60
80
100
256G64G16G4G1G256MCum
ulat
ive%
offile
syst
ems
File system capacity (bytes)
20002001200220032004
Figure 24: CDFs of file systems by storage capacity
0
20
40
60
80
100
256G64G16G4G1G256M64M16M4MCum
ulat
ive%
offile
syst
ems
Used space (bytes)
20002001200220032004
Figure 25: CDFs of file systems by total consumed space
Intuitively, the proportional probability c(d) + 2 canbe interpreted as follows: If a directory already has somesubdirectories, it has demonstrated that it is a useful lo-cation for subdirectories, and so it is a likely place formore subdirectories to be created. The more subdirecto-ries it has, the more demonstrably useful it has been asa subdirectory home, so the more likely it is to continueto spawn new subdirectories. If the probability were pro-portional to c(d) without any offset, then an empty di-rectory could never become non-empty, so some offsetis necessary. We found an offset of 2 to match our ob-served distributions very closely for all five years of ourcollected data, but we do not understand why the partic-ular value of 2 should be appropriate.
5 Space Usage
5.1 Capacity and usage
Figure 24 plots CDFs of file system volumes by stor-age capacity, which has increased dramatically over ourfive-year sample period: The arithmetic mean has grownfrom 8 GB to 46 GB and the median has grown from5 GB to 40 GB. The number of small-capacity file sys-tem volumes has dropped dramatically: Systems of 4 GBor less have gone from 43% to 4% of all file systems.
0
20
40
60
80
100
100806040200Cum
ulat
ive%
offile
syst
ems
Fullness percentage
20002001200220032004
Figure 26: CDFs of file systems by fullness
Figure 25 plots CDFs of file systems by total con-sumed space, including not only file content but alsospace consumed by internal fragmentation, file metadata,and the system paging file. Space consumption increasedsteadily over our five-year sample period: The geomet-ric mean has grown from 1 GB to 9 GB, the arithmeticmean has grown from 3 GB to 18 GB, and the medianhas grown from 2 GB to 13 GB.
Figure 26 plots CDFs of file systems by percentageof fullness, meaning the consumed space relative to ca-pacity. The distribution is very nearly uniform for allyears, as it was in our 1998 study. The mean fullness hasdropped slightly from 49% to 45%, and the median filesystem has gone from 47% full to 42% full. By con-trast, the aggregate fullness of our sample population,computed as total consumed space divided by total file-system capacity, has held steady at 41% over all years.
In any given year, the range of file system capacities inthis organization is quite large. This means that softwaremust be able to accommodate a wide range of capacitiessimultaneously existing within an organization. For in-stance, a peer-to-peer backup system must be aware thatsome machines will have drastically more capacity thanothers. File system designs, which must last many years,must accommodate even more dramatic capacity differ-entials.
5.2 Changes in usage
This subsection describes our findings regarding howindividual file systems change in fullness over time. Forthis part of our work, we examined the 6536 snapshotpairs that correspond to the same file system in two con-secutive years. We also examined the 1320 snapshotpairs that correspond to the same file system two yearsapart. We find that 80% of file systems become fullerover a one-year period, and the mean increase in fullnessis 14 percentage points. This increase is predominantlydue to creation of new files, partly offset by deletion ofold files, rather than due to extant files changing size.
FAST ’07: 5th USENIX Conference on File and Storage TechnologiesUSENIX Association 41
Space usage:Cohorts
• Most file systems became fuller after one year, by 14% on average.
• Due to creating new files, not due tochanging old ones.
02468
10121416
100500-50-100
%of
filesy
stem
s
Fullness increase (percentage points, 5-percentage-point bins)
2000 to 20012001 to 20022002 to 20032003 to 2004
Figure 27: Histograms of file systems by 1-year fullnessincrease
0
20
40
60
80
100
-100 -50 0 50 100Cum
ulat
ive%
offile
syst
ems
Fullness increase (percentage points, 5-percentage-point bins)
2000 to 20012001 to 20022002 to 20032003 to 2004
Figure 28: CDFs of file systems by 1-year fullness in-crease
When comparing two matching snapshots in differentyears, we must establish whether two files in successivesnapshots of the same file system are the same file. Wedo not have access to files’ inode numbers, because col-lecting them would have lengthened our scan times to anunacceptable degree. We thus instead use the followingproxy for file sameness: If the files have the same fullpathname, they are considered the same, otherwise theyare not. This is a conservative approach: It will judge afile to be two distinct files if it or any ancestor directoryhas been renamed.
Figures 27 and 28 plot histograms and CDFs, respec-tively, of file systems by percentage-point increase infullness from one year to the next. We define this term byexample: If a file system was 50% full in 2000 and 60%full in 2001, it exhibited a 10 percentage-point increasein fullness. The distribution is substantially the same forall four pairs of consecutive years. Figure 28 shows that80% of file systems exhibit an increase in fullness andfewer than 20% exhibit a decrease. The mean increasefrom one year to the next is 14 percentage points.
We also examined the increase in fullness over twoyears. We found the mean increase to be 22 percentagepoints. This is less than twice the consecutive-year in-
crease, indicating that as file systems age, they increasetheir fullness at a slower rate. Because we have so fewfile systems with snapshots in four consecutive years, wedid not explore increases over three or more years.
Since file systems that persist for a year tend to in-crease their fullness by about 14 points, but the meanfile-system fullness has dropped from 49% to 45% overour sample period, it seems that the steadily increasingfullness of individual file systems is offset by the replace-ment of old file systems with newer, emptier ones.
Analyzing the factors that contribute to the 14-pointmean year-to-year increase in fullness revealed the fol-lowing breakdown: Fullness increases by 28 percentagepoints due to files that are present in the later snapshotbut not in the earlier one, meaning that they were createdduring the intervening year. Fullness decreases by 15percentage points due to files that are present in the ear-lier snapshot but not in the later one, meaning that theywere deleted during the intervening year. Fullness alsoincreases by 1 percentage point due to growth in the sizeof files that are present in both snapshots. An insignifi-cant fraction of this increase is attributable to changes insystem paging files, internal fragmentation, or metadatastorage.
We examined the size distributions of files that werecreated and of files that were deleted, to see if they dif-fered from the overall file-size distribution. We foundthat they do not differ appreciably. We had hypothesizedthat users tend to delete large files to make room for newcontent, but the evidence does not support this hypothe-sis.
Since deleted files and created files have similar sizedistributions, file system designers need not expect thefraction of files of different sizes to change as a file sys-tem ages. Thus, if they find it useful to assign differentparts of the disk to files of different sizes, they can an-ticipate the allocation of sizes to disk areas to not needradical change as time passes.
Many peer-to-peer systems use free space on comput-ers to store shared data, so the amount of used space is ofgreat importance. With an understanding of how this freespace decreases as a file system ages, a peer-to-peer sys-tem can proactively plan how much it will need to offloadshared data from each file system to make room for ad-ditional local content. Also, since a common reason forupgrading a computer is because its disk space becomesexhausted, a peer-to-peer system can use a prediction ofwhen a file system will become full as a coarse approxi-mation to when that file system will become unavailable.
6 Related Work
This research extends our earlier work in measuringand modeling file-system metadata on Windows work-
FAST ’07: 5th USENIX Conference on File and Storage Technologies USENIX Association42
Conclusions
• Usage trends advise:
• More careful data andmetadata placement.
• Specialized treatment forpopular file extensions.
• Metadata structures optimizedfor small file lists.
top related