scalable storage for digital librarieslitwin/cours98/doc-cours...scalable storage for digital...

53
Scalable Storage for Digital Libraries * Paul Mather Department of Computer Science Virginia Polytechnic Institute and State University Blacksburg, VA 24061 USA 23 October 2001 Abstract I propose a storage system optimised for digi- tal libraries. Its key features are its heteroge- neous scalability; its integration and exploita- tion of rich semantic metadata associated with digital objects; its use of a name space; and its aggressive performance optimisation in the dig- ital library domain. Contents 1 Introduction 1 2 Background Motivation 2 3 Scalable Storage 3 3.1 Local File Systems ............ 3 3.2 Workload Studies ............ 4 3.2.1 Workload Characterisation and Modelling ......... 5 3.3 Local File System Performance Issues 6 3.3.1 Caching ............. 7 3.3.2 Clustering and Fragmentation 7 3.3.3 Block Sizing and Allocation . . 8 3.3.4 Log-Structured Approaches . . 8 3.4 Disk Arrays ............... 9 3.5 Networked Storage ........... 12 3.5.1 Intelligent Disks ........ 13 3.5.2 Parallel File Systems ...... 14 3.5.3 Distributed File Systems .... 15 4 Digital Libraries Object Repositories 16 4.1 Digital Objects vs. Files ......... 17 4.2 Naming and Location .......... 18 * This material was originally submitted as the Ph.D. preliminary exam document of the author. It is repro- duced here without modification. 4.2.1 Uniqueness and Location De- pendence ............ 18 4.2.2 Uniform Resource Names . . . 19 4.2.3 Scalable Object Location .... 20 4.3 Redundant Encoding for Reliability . 21 4.4 Metadata ................. 22 4.4.1 Buckets .............. 22 4.4.2 Terms and Conditions ..... 23 4.5 Digital Object Repositories ....... 23 4.5.1 Kahn-Wilensky and its Exten- sions ............... 24 4.5.2 Other Repository Approaches . 25 5 Proposed Research 26 5.1 Measuring Storage Systems ...... 26 5.2 Online workload characterisation . . . 27 5.3 Specialisation .............. 28 5.4 Exploiting Low Volatility ........ 28 5.5 Decentralising the Name Space .... 29 5.6 Exploiting Object Metadata ...... 30 5.7 Improving Performance in a Read- Mostly Repository ............ 32 5.8 Summary ................. 35 6 Research Plan and Expected Contributions 35 1 Introduction More than ever before, there is a large amount of information available in digital form. Increas- ingly, this content is stored in large reposito- ries accessed over wide-area networks. With the dramatic rise of digital libraries, electronic commerce and the everyday use of networked information sources in business has also come a pressing need for high availability of network- accessible data. In the information age, data Page 1

Upload: others

Post on 26-Jun-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scalable Storage for Digital Librarieslitwin/cours98/Doc-cours...Scalable Storage for Digital Libraries Paul Mather Department of Computer Science Virginia Polytechnic Institute and

Scalable Storage for Digital Libraries∗

Paul Mather

Department of Computer ScienceVirginia Polytechnic Institute and State University

Blacksburg, VA 24061USA

23 October 2001

Abstract

I propose a storage system optimised for digi-tal libraries. Its key features are its heteroge-neous scalability; its integration and exploita-tion of rich semantic metadata associated withdigital objects; its use of a name space; and itsaggressive performance optimisation in the dig-ital library domain.

Contents

1 Introduction 1

2 Background Motivation 2

3 Scalable Storage 33.1 Local File Systems . . . . . . . . . . . . 33.2 Workload Studies . . . . . . . . . . . . 4

3.2.1 Workload Characterisationand Modelling . . . . . . . . . 5

3.3 Local File System Performance Issues 63.3.1 Caching . . . . . . . . . . . . . 73.3.2 Clustering and Fragmentation 73.3.3 Block Sizing and Allocation . . 83.3.4 Log-Structured Approaches . . 8

3.4 Disk Arrays . . . . . . . . . . . . . . . 93.5 Networked Storage . . . . . . . . . . . 12

3.5.1 Intelligent Disks . . . . . . . . 133.5.2 Parallel File Systems . . . . . . 143.5.3 Distributed File Systems . . . . 15

4 Digital Libraries Object Repositories 164.1 Digital Objects vs. Files . . . . . . . . . 174.2 Naming and Location . . . . . . . . . . 18

∗This material was originally submitted as the Ph.D.preliminary exam document of the author. It is repro-duced here without modification.

4.2.1 Uniqueness and Location De-pendence . . . . . . . . . . . . 18

4.2.2 Uniform Resource Names . . . 194.2.3 Scalable Object Location . . . . 20

4.3 Redundant Encoding for Reliability . 214.4 Metadata . . . . . . . . . . . . . . . . . 22

4.4.1 Buckets . . . . . . . . . . . . . . 224.4.2 Terms and Conditions . . . . . 23

4.5 Digital Object Repositories . . . . . . . 234.5.1 Kahn-Wilensky and its Exten-

sions . . . . . . . . . . . . . . . 244.5.2 Other Repository Approaches . 25

5 Proposed Research 265.1 Measuring Storage Systems . . . . . . 265.2 Online workload characterisation . . . 275.3 Specialisation . . . . . . . . . . . . . . 285.4 Exploiting Low Volatility . . . . . . . . 285.5 Decentralising the Name Space . . . . 295.6 Exploiting Object Metadata . . . . . . 305.7 Improving Performance in a Read-

Mostly Repository . . . . . . . . . . . . 325.8 Summary . . . . . . . . . . . . . . . . . 35

6 Research Plan and Expected Contributions 35

1 Introduction

More than ever before, there is a large amount ofinformation available in digital form. Increas-ingly, this content is stored in large reposito-ries accessed over wide-area networks. Withthe dramatic rise of digital libraries, electroniccommerce and the everyday use of networkedinformation sources in business has also comea pressing need for high availability of network-accessible data. In the information age, data

Page 1

Page 2: Scalable Storage for Digital Librarieslitwin/cours98/Doc-cours...Scalable Storage for Digital Libraries Paul Mather Department of Computer Science Virginia Polytechnic Institute and

are the lifeblood of many organisations, and theconstant availability of those data is paramount.

To address the voracious appetite for storage,research in that arena recently has turned to-wards the issue of scalability—in terms of hard-ware, software, and administration. Moreover,a current focus has turned towards the use ofcommodity off-the-shelf (COTS) hardware as ameans of achieving scalability. In particular, ap-proaches such as clustering, networked attachedstorage, and storage area networks are fast becom-ing industry standards as means for achievingmanageable, scalable storage on the order ofterabytes (240 bytes) and upwards.

The bulk of storage research at the operatingsystem support level has looked at providingfile system support and associated functions.Although there have been proposed some archi-tectures for digital libraries storage repositories,with the Kahn-Wilensky architecture [88] find-ing favour, all digital library implementationsthus far have been built atop traditional oper-ating system file systems or databases (whichare themselves usually built atop traditional op-erating system file systems). Such file systemsare relatively weak in terms of providing sup-port for the rich metadata and file types typi-cally used in digital libraries.

Traditional file system metadata is geared to-wards providing the operating system logicaland physical location of file data, and sim-ple user protection mechanisms. It is typi-cally canonical, and relatively crude comparedwith metadata for digital libraries, which pro-vides multiple viewpoints of digital objects, andpotentially sophisticated rights management.In addition, semantically different object (file)types might require radically different storageand access techniques. In traditional file sys-tems, this often must be handled in an ad hocfashion, often by means of arbitrary partitioningof storage and use of specialised device drivers.

Here, we aim to reconcile diverse strandsof scalable storage design with digital libraryrepository access semantics as the target. Par-ticular emphasis will be placed on scalability inthe storage realm, and on metadata support andsystem support for different object types in the

digital libraries realm.In Section 2 is background motivation. Sec-

tion 3 contains an overview of related researchin scalable storage. A brief guide to relevantwork in digital library object repositories andaccess support is presented in Section 4. Sec-tion 5 contains an outline of the problem to beaddressed in this research. A plan of work isgiven in Section 6.

2 Background Motivation

More data are available online than ever be-fore. There is a dramatic increase in the amountavailable to the public served over wide-areanetworks, and the level of growth continues.In some areas, data gathering has followed ap-proximately Moore’s law: in astronomy, for ex-ample, the volume of data gathered is doublingroughly every 20 months [188].

Already, there are several terabyte-sizedrepositories of information of various kinds ac-cessible to millions of users worldwide, andmore are being proposed.

The TerraServer [13] provides aerial, satellite,and topographic images of Earth delivered viathe World Wide Web (WWW). There are eightservers—six WWW servers and two databaseback-end servers—averaging 5–8 million hitsand 50 GB of image downloads per day. Asof February 2000, the database comprised over1.5 terabytes of data. The two database back-end servers can contain up to 3.2 TB and 1.2 TBof data stored using RAID-5 on 324 9 GB UltraSmall Computer Systems Interface (SCSI) and140 9 GB FiberChannel disks respectively. Thesystem is configured to handle a maximum of40 million hits per day, and 6,000 simultaneoususers.

The Zoom Project [10, 190] provides access toover 70,000 images of the Fine Arts Museums ofSan Francisco over the WWW. The images com-prise approximately 2.5 TB of storage servedfrom a cluster of 20 storage nodes hosting a to-tal of 396 8.4 GB Ultra-Wide SCSI disks. Thestorage cluster is called Tertiary Disk [189]. Fourof the storage nodes are “disk heavy,” with 70

Page 2

Page 3: Scalable Storage for Digital Librarieslitwin/cours98/Doc-cours...Scalable Storage for Digital Libraries Paul Mather Department of Computer Science Virginia Polytechnic Institute and

disks per 2 nodes; the remaining 16 are “CPU-heavy,” with 32 disks per 2 nodes. The stor-age nodes are networked via a 100 Mbps Eth-ernet local area network (LAN), and the clus-ter is connected to the Internet via an asyn-chronous transfer mode (ATM) network con-nection. Strings of disks are double-ended foradded reliability, meaning that even if one stor-age node (or the SCSI host adapter) connectedto the string fails, the other node can continueto access the disks.

The Sloan Digital Sky Survey (SDSS) intendsto provide multi-terabyte astronomical observa-tional data for interactive exploration and queryvia the Internet [188]. The SDSS expects to col-lect over 40 terabytes of raw data in the next fiveyears. The proposed architecture will includea 20 node array aggregating 6 TB of databasestorage and amassing a cumulative 20 billionsof instructions per second (BIPS) of processingpower to scan continually the SDSS dataset toapply user-supplied query predicates. It is esti-mated that a scan rate of 2 GBps per node acrossthe array can be achieved, allowing the entiredataset to be processed about every two min-utes.

Even larger datasets are being planned. TheNASA Earth Observing System Data Informa-tion System (EOSDIS) project [55, 93] seeks toprovide access to a huge collection of remotesensing data collected over a 15 year time-span.By 2002, it is predicted that raw data will be col-lected at a rate of over 360 GB per day, and thatthe archive will contain over 3 petabytes (250

bytes) of data across more than 260 data prod-ucts. The system will utilise Distributed ActiveArchive Centers (DAACs) networked togethervia a high-speed backbone to collect and processthe data, and the entire collection will be acces-sible to users via the WWW using a browser.

Even larger than EOSDIS is the data stor-age needs of the Large Hadron Collider (LHC)project at CERN [175]. This particle acceler-ator will enter operation in 2005, and experi-ments run on it will generate some 5 petabytesof data per year, with data rates in the region of100 MB to 1.5 GB per second. Experiments willrun over 15 years, generating in excess of 100

petabytes. The engineering challenge, however,tackles storage requirements in the exabyte (260

bytes) range.Along with such large scale projects, ordinary

production and consumption of data continuesat a pace that has even outpaced Moore’s Law.Decision support systems (DSS) are rabid con-sumers of data, with demand doubling roughlyevery 9–12 months, in contrast to the approx-imately 18-month doubling period of Moore’sLaw [91]. Indeed, the largest DSS database re-ported in 1997 almost doubled in size from 2.4TB to 4.4 TB in 1998 alone [205]. The biggest DSSdatabase of 1998 more than tripled from 1.3 TBto 4.6 TB over its previous year’s size [206].

The huge rate of increase in data suggeststhat massive-scale storage architectures need tobe able to scale at a faster rate than the growthof processors and disks themselves: we cannotrely on hardware improvements alone to keeppace with the demand for data storage.

3 Scalable Storage

Computer storage has long been the subject ofresearch. With the rise of parallel and dis-tributed systems came a plethora of associatedstorage systems. For the purposes of this re-search, this activity can broadly be grouped inthe areas of local file systems; parallel file sys-tems; distributed file systems; and scalable net-worked storage. Within all these areas, atten-tion has been paid to operating system and ap-plication programming support; layout, buffer-ing, caching, and efficiency; reliability, fault-tolerance, and crash recovery; and security.Some of these goals are contradictory: high faulttolerance might require a lower efficiency, forexample.

3.1 Local File Systems

A file system is a means by which data is man-aged on secondary storage. At the user level,these data are organised using an abstractionknown as a file. Files are a logical structur-ing mechanism that allows data to be organ-

Page 3

Page 4: Scalable Storage for Digital Librarieslitwin/cours98/Doc-cours...Scalable Storage for Digital Libraries Paul Mather Department of Computer Science Virginia Polytechnic Institute and

ised as a byte-addressable sequence that may berandomly accessed. File systems mask the de-tails of how this logical structure is stored onactual physical devices attached to the system.File systems not only manage the actual long-term storage of data—layout and structuring—but also manage the access to that data—nameresolution, buffering and caching. File systemsemploy metadata to assist in the organisationof files on storage. Examples of such metadatamight be pointers locating blocks belonging to agiven file on a disk and the length of that file,or a bitmap indicating the location of unusedblocks on a disk.

We deem local file systems to be those whosestorage is provided by directly-attached sec-ondary storage. Most often, this storage is inthe form of direct-access storage such as diskdrives. The improvement in hard disk tech-nology in terms of total storage (areal density)and transfer rate has improved roughly accord-ing to Moore’s Law [73, 74]. However, rota-tional speeds and, especially, seek times have in-creased at a much slower rate. Head seek times,in particular, are limited by the physics of in-ertia, and may not see vast improvement. Toaddress this, file system research has exploredthe use of alternative layout schemes [145], ag-gregation [31, 67], and aggressive caching andbuffering [62, 148] amongst other techniques toameliorate the effects of head seek times duringreads to and writes from disk.

3.2 Workload Studies

There have been several studies of the contentand usage of local file systems [49, 70, 144, 156,200]. Such studies characterise the workload ofthe file system. Workloads provide importantinformation to file system architects, and canguide the design of layout schemes; bufferingand caching strategies; migration policies; andperformance tuning.

One consistent feature to arise out of work-load studies for typical engineering and officeapplications and environments is that most filesare small: in Gibson and Miller’s study [70] themajority of files are less than 4K in size, and

in Douceur and Bolonsky [49] the median filesize is 4K. The distribution of file sizes is heavy-tailed, with most bytes being concentrated inrelatively few large files: most files are small,but large files use most disk capacity. Further-more, a general trend is that most files are notaccessed very often in the long term.

Traces of long-term file activity collected byGibson and Miller [70] reveal that on a typicalday, less than 3% of all files are used, and ofthat usage, file accesses account for over twicethe number of either file creations or deletions,and the number of files modified is one thirdthe number of creations or deletions. In addi-tion, they found that files modified are aboutas likely to remain the same size as they are togrow, and that modified files rarely get smaller[70]. In fact, the amount of file growth duringfile modifications found in their study was usu-ally less than 1K, regardless of the file’s size [70].Long-term activity of the file systems surveyedby Douceur and Bolonsky [49] determined themedian file age to be 48 days, where file age iscalculated as the time passed since the most re-cent of either the file’s creation or last modifi-cation relative to when the file system contentssnapshot was taken.

Traces capturing the short-term behaviour offile system workloads show that many files andaccesses are ephemeral and bursty. The seminalUnix 4.2 BSD study of Ousterhout et al. [144] de-termined that files are almost always processedsequentially, and that more than 66% are whole-file transfers. Of the remaining non-whole filetransfers, most read sequentially after initiallyperforming an lseek, i.e., to append. When mea-sured over a short time scale, files tend to bevery short-lived, or only open for a very shorttime.

Subsequent studies have confirmed the re-sults of Ousterhout et al, but tending towardsextremes. Baker et al. [12] discovered the timebetween a file being created and subsequentlyeither deleted or truncated to zero length to beless than 30 seconds for 65%–80% of all files, andthose files tend to be small, accounting for onlyabout 4%–27% of all new bytes deleted or over-written within 30 seconds. Vogels [200] found

Page 4

Page 5: Scalable Storage for Digital Librarieslitwin/cours98/Doc-cours...Scalable Storage for Digital Libraries Paul Mather Department of Computer Science Virginia Polytechnic Institute and

that files are open for even shorter periods: 75%are open for less than 10 milliseconds, and 55%of new files are deleted within 5 seconds; 26%are overwritten within 4 milliseconds. Similarly,both Baker et al. [12] and Vogels [200] confirmthat most files accessed are short, and sequen-tially accessed (but shifting more towards ran-dom access), but that most bytes transferred be-long to large files, and that large files are gettinglarger (20% are 4 MB or larger).

Ramakrishnan et al. [156] collected relativelyshort-term traces of file system activity at eightdifferent customer sites running VAX/VMSover several different time periods at varioustimes of the day and night. The sites repre-sented different workloads (interactive; time-sharing and database activity; transaction pro-cessing; and batch data processing), but analysisshowed that file system activity was consistentacross different workload environments. Thestudy revealed that for almost all workloads,most files—over 80%—are inactive (not opened,read, or written during the trace), and of thoseactive files, a small percentage account for mostof the file opens (in an airline transaction pro-cessing workload, less than 3% of active files ac-count for over 80% of file opens). Over all work-loads, over 50% of all active files were openedonly once or twice, and 90% of active files wereopened less than 10 times in a typical 9–12 hour“prime time” period.

Of those files active during the trace period,over 50% of them were read only once or twice,and 90% of active files in all but one workloadhad between 24 and 46 reads. In the case ofwrites, 31–46% of active files were not written atall, and 14–28% had only one write. Over 50% ofthe active files had fewer than two write opera-tions. Once again, a small percentage of files of-ten account for most writes, and skew the meannumber of writes upwards. For the timesharingworkloads studied, 1.3–2% of files accounted for80–87% of writes. In the case of transaction pro-cessing workloads, 1.3–1.8% of files accountedfor 95% of writes.

Studies that have tracked the semantic con-tent of files have found there is a good correla-tion between file type and file size [18, 49, 165].

Analysis of the WWW by Adamic and Huber-man [2] reveals that files and sites there followa power law with respect to many attributes, in-cluding file size, site size, popularity, and sitelinkage, thus confirming in a wider context the“heavy-tailed distribution” phenomenon com-mon to all file system studies.

Some consequences of workload studies arethat the name space is dominated by small files,but that transfer bandwidth is dominated bylarge files. Furthermore, files exhibit a genera-tional behaviour in that if they survive the shortterm then they will tend to persist—largelyinactive—for the long term. The likely uncer-tain early lifetime makes write-back caching at-tractive in the anticipation that a file will subse-quently be deleted or modified before actuallybeing written to disk. The more of this volatilitythat can be smoothed out by caching or buffer-ing the better.

3.2.1 Workload Characterisation and Mod-elling

Workload characterisation has long been usedin assessing the performance of systems [27]. Instorage research, analysis of workload traces, asdescribed above, features prominently. Theseworkload traces capture I/O activity as it hap-pens over a period of time, and contain impor-tant information about access requests and ac-cess patterns.

Because actual traces are often large and dif-ficult to capture (because of technical and polit-ical reasons), researchers often have to resort tousing synthetic traces to model workload. Gen-erating synthetic traces that are representativeof real-world activity is a significant problem[60]. Ganger and Patt [61] advocate a system-level approach to representing I/O workloads,and, in particular, that different classes of I/Orequest be treated with different importance tobetter model their effect on overall system per-formance. SynRGen [54] is a tool for generat-ing workloads based upon algorithmic specifi-cations of tasks, offering more realistic artificialtraces of system activity.

Traces are primarily used as input to simula-

Page 5

Page 6: Scalable Storage for Digital Librarieslitwin/cours98/Doc-cours...Scalable Storage for Digital Libraries Paul Mather Department of Computer Science Virginia Polytechnic Institute and

tions of storage systems. Simulation is popularbecause of the flexibility with which data canbe collected and operational parameters con-trolled. This allows “what-if” scenarios to be in-vestigated to examine the effect of specific pa-rameters within the storage system on globalsystem performance, e.g., cache size, seek time,block size, bus speed, etc.

Ruemmler and Wilkes [161] describe a de-tailed disk simulation, which they apply to theHP C2200A and HP 97560 disk drives. Wor-thington et al. [207] describe the extraction ofdisk drive parameters directly by interrogationof SCSI drives, combined with empirical mea-surements, thereby improving modeling accu-racy. Ganger et al. [63] produced a highly-configurable disk simulator called DiskSim.DiskSim can model such storage aspects as de-vice drivers, busses, controllers, adapters, anddisk drives. DIXtrac [168] improves upon theextraction capabilities of [207] in that it can ex-tract over 100 disk parameters in a fully auto-mated fashion. DIXtrac can be used to provideinput for disk drives simulated by DiskSim.Pantheon [202] is a toolkit that supports the de-velopment of storage simulations. It has beensuccessfully used in such storage projects asTickerTAIP [29] and AutoRAID [203].

Although simulation is prevalent in storageresearch, it has the disadvantage of being costlyin terms of time and resources needed to runsimulations. Analytical modeling has the ad-vantage of being a quicker and more direct an-swer for exploring alternative design spaces.The disadvantage of analytical modeling is itsinaccuracy compared to simulation.

Uysal et al. [196] present an analytical modelto predict throughput of a modern disk arraythat captures many of the complex optimisa-tions found in such hardware. They validatetheir model against a real commercial array andobtain prediction accuracy within 32% in mostcases, and to within 15% on average. Shriver[177] presents a detailed analytical model of asingle disk drive. Barve et al. [15] extend thisby analysing the performance of several driveson the same SCSI bus, where bus contentionbecomes a factor when large request sizes are

used. Shriver et al. [178] present an analyticalmodel of disk drives that support read-aheadand request reordering. Their prediction accu-racy is to within 17% across a variety of diskdrives and workloads. Borowsky et al. [24]present an analytical model of response time ofa device subject to phased and correlated work-loads. Their model is designed to determinewhether a given quality of service bound is sat-isfied for the workloads presented.

Because analytical models lend themselves todirect computation, unlike long-running simu-lations, this makes them ideal for use in storageoptimisation problems. Of recent attention, inparticular, is the area of attribute-managed stor-age [72]. Attribute-managed storage is a con-strained optimisation problem in which a set ofworkload units and devices are presented as in-put to a solver. The output is a mapping ofworkload units to devices such that the needsof the workload units are met. Shriver [176] de-scribes the workload and device attributes andtheir mapping in greater detail. Alvarez et al.[4] describe a suite of tools called Minerva thatcan be used to design storage systems automat-ically. In a test, Minerva is able successfully todesign a storage system to handle a decision-support workload that performs as well as ahuman-designed configuration. The Minerva-generated system used fewer disks than thehuman-designed system (16 vs. 30), and tookonly 10–13 minutes to produce the design. Min-erva is limited by its analytical models to thetypes of storage system it can design (currently,disk arrays).

3.3 Local File System Performance Issues

As mentioned previously, seek times dominatedisk access times. A simple “break-even” pointto make a seek worthwhile would be to spendas much time actually transferring data as seek-ing to find it. Obviously, though, the smallerthe overall fraction of a total read or write is oc-cupied by seeking the better. Even COTS harddrives have relatively high sustained transferrates, making the time lost seeking even worse.For example, a “desktop use” 25 GB EIDE drive

Page 6

Page 7: Scalable Storage for Digital Librarieslitwin/cours98/Doc-cours...Scalable Storage for Digital Libraries Paul Mather Department of Computer Science Virginia Polytechnic Institute and

with an average seek time of 9 ms supports asustained transfer rate of 15.5 MB/s [86]. A“server” 36 GB Ultra 160 SCSI drive with anaverage seek time of 4.9 ms supports a sus-tained transfer rate of 36 MB/s [87]. For the25 GB drive, just over 142 KB of data could betransferred in the time spent in an average seek(0.009 s at 15.5 MB/s); for the 36 GB drive, justover 180 KB could be transferred (0.0049 s at 36MB/s). If only, say, 1 KB were transferred be-fore seeking elsewhere, then less than 1% of thetotal time would be spent actually transferringuseful data using the above mentioned drives asexamples. Even for a larger block size—4 KB—still less than 3% of the time is spent usefullytransferring data. Successful use of the availabledisk bandwidth means making fewer seeks andlarger transfers.

3.3.1 Caching

The impact of seeking on disk I/O is often ad-dressed by caching. A large RAM-based cachecan absorb the effects of multiple reads of thesame disk block over time, and most modernoperating systems employ such buffer caches.However, caches depend upon locality of refer-ence to be successful: what happened before willhappen again. As mentioned previously, work-load studies indicate most reads are whole-filesequential reads. These have a negative impacton cache performance, as they can effectivelystream through and flush the cache for largeenough files, if action is not taken to mitigatesuch behaviour. Baker et al. report cache readmiss rates of about 40% in their study, rising to97% on machines processing large files [12].

Caches will cache the most commonly ac-cessed disk blocks. In an engineering and of-fice environment, the most commonly-used fileswill likely be application and system programs:word processors, compilers, editors, and simi-lar. Across the entire user population, individ-ual user data files will be accessed less often rel-ative to the software that operates upon thosedata. Caching will likely be more successful forprograms than data. The exception to this is per-haps the caching of writes.

A disk block that is read, modified, and writ-ten back will exhibit strong temporal localityof reference. Unfortunately, out of concern forsafety, many applications will not update filedata in situ, but, instead, often will make a tem-porary copy of the old file to which are appliedthe updates. After processing, the original file iseither deleted or renamed as a backup copy, andthe copy is renamed to the original. Baker et al.[12] report that only 10% of new data written isdeleted or overwritten in the cache rather thanactually being written to disk. However, theyalso report that the 30-second dirty block clean-ing rule in their system accounted for over 70%of blocks written from the cache, indicating thatcache content safety (flushing dirty blocks todisk to guard against loss due to hardware fail-ure), not cache size, is a major factor on write-back cache performance.

3.3.2 Clustering and Fragmentation

Another method of ameliorating the effects ofseeks is to co-locate related data physically closetogether on the disk. In this way, only smallseeks or a rotational delay must be incurred toread successive blocks of a file. Physical cluster-ing is especially useful in the case of file systemmetadata. The Fast File System (FFS) for Unixgreatly improved the performance of the previ-ous file system design by spreading file meta-data throughout the disk [126]. The new designdivided the disk into cylinder groups, with eachcylinder group having its own local metadata.In the previous design, all file system metadatawas clustered at the beginning of the disk. Datablocks located on tracks away from the start ofthe disk thus incurred long seeks every time themetadata for the file stored there was accessed.McKusick [126] estimates that such a segrega-tion of metadata, coupled with the small 512-byte block size, meant that the old Unix file sys-tem design was only able to use 3-5% of the diskbandwidth. But, by interspersing (clustering)metadata and data blocks nearby on the disk,and using a larger block size (4 K and up), diskbandwidth utilisation increases to 47% in thenew FFS design [126]. Clustering in FFS also im-

Page 7

Page 8: Scalable Storage for Digital Librarieslitwin/cours98/Doc-cours...Scalable Storage for Digital Libraries Paul Mather Department of Computer Science Virginia Polytechnic Institute and

proves crash recovery, as file system metadatais not physically concentrated in one localisedarea of the disk, but is spread over it.

3.3.3 Block Sizing and Allocation

A more general application of increasing theblock size to lessen the effects of seeks and im-prove storage contiguity to benefit the large se-quential transfers prevalent in file system tracesis to adopt variable-sized blocks for files. In ef-fect, the disk is treated as an area of storage fromwhich variable-sized allocations are made. Sucha problem is not new: efficient memory alloca-tion and garbage collection has been long stud-ied. Disks pose additional performance con-straints because fragmentation seriously affectsperformance by translating into long seeks (ex-cessive head movement), and garbage collectionon disk is more costly than in main memory.

Wilson et al. [204] provide a comprehen-sive survey of the memory allocation litera-ture. Memory allocation allowing variable-sized blocks has a disadvantage of extra meta-data overheads for pointers to link allocatedareas, and compaction suffers significant over-heads. Knowlton [92] introduced the buddy sys-tem of allocation that is considerably more effi-cient when freeing blocks.

In buddy systems, only a fixed set of blocksizes, b1 < b2 < · · · < bk, are available for al-location, and so there is a potential for storageto be wasted if an object is placed in a biggerblock than necessary. However, such lossageis a problem for traditional fixed-size alloca-tion schemes (though it is strictly bounded). Astandard block sizing scheme is to use increas-ing powers of 2 (the exponential buddy system).Hirschberg [78] describes a Fibonacci buddy sys-tem in which the block sizes follow the Fibonacciseries. The use of the Fibonacci series permits alarger number of different sizes, and hence in-creases the probability of a good fit, thereby de-creasing internal fragmentation.

Koch [94] designed a disk allocation schemebased upon the exponential buddy system. Filesare stored in a bounded number of contiguousextents, and a reallocation algorithm runs peri-

odically to improve the allocation of poorly ar-ranged files. Koch reports a mean number of ex-tents per file of 1.5 and an average internal frag-mentation of less than 4%.

Ghandeharizadeh et al. [65] present and anal-yse several algorithms for managing files in a hi-erarchical storage environment with a focus onthe tradeoff between contiguity of files and theamount of wasted space. (Their work explicitlyexcludes support for striping, however.) Ghan-deharizadeh et al. [64, 66] describe algorithmsfor the layout of files on disk such that each nblock file is stored in no more than dlg ne sepa-rate extents in order to minimise seeks.

Seltzer and Stonebreaker [173] examine sev-eral variable block size file allocation strate-gies and conclude that such extent-based ap-proaches offer excellent performance in read-mostly environments because of seek time min-imisation and large sequential reads.

3.3.4 Log-Structured Approaches

A major trend in local file system design thattook hold in the early 1990s was a move towardslog-structured file systems to improve writeperformance [145, 159, 160, 170, 171]. Insteadof the fixed layout schemes employed in tradi-tional file systems such as FFS, log-structuredfile systems treat the blocks of the disk as a logto which data is appended. Unlike FFS-like filesystems which incur seek overheads for eachwrite, log-structured file systems instead batchall writes and then write all the changes to diskin a single large disk transfer operation. Typi-cally, log-structured file systems divide the disklogically into segments, and write an entire seg-ment at a time.

By performing large disk transfers, the impactof a seek is lessened. In addition, crash recov-ery is greatly assisted, because in the event of acrash, the file system software need only locatethe last checkpoint and then perform a roll for-ward on the writes made after that point. Thisis in contrast to the time-consuming multi-passconsistency algorithm required in FFS-like filesystems [127].

Because disks are finite in size, the log must

Page 8

Page 9: Scalable Storage for Digital Librarieslitwin/cours98/Doc-cours...Scalable Storage for Digital Libraries Paul Mather Department of Computer Science Virginia Polytechnic Institute and

wrap when it reaches the end of the disk. Thisnecessitates special handling of free space inthe file system. The most common approachis to employ a cleaner daemon that runs asyn-chronously alongside the mounted file system.The cleaner acts like a garbage collector, copy-ing live blocks out of one or more segments con-taining deleted space and consolidating the liveblocks together and writing them to clean seg-ments in the log. The segments out of whichthe live blocks were copied are now free to bereused as clean segments for the log. Variouspolicies can be enforced for segment cleaning, tobalance the impact on the file system [124, 159,Chapter 5].

Another way to improve write performance,along the lines of log-structured file systems,is to log writes to a smaller, separate loggingmedium and then asynchronously commit thechanges to the actual data disks of the file sys-tem. The logging medium could be non-volatileRAM [80, 112], or a dedicated, separate log disk[34]. Non-volatile RAM has also been used toimplement reliable buffer and file cache mech-anisms to mitigate the impact of periodic writ-ing of dirty blocks to ensure file system consis-tency. Prestoserve [137] uses non-volatile RAMto cache NFS writes. Rio [32] and Phoenix [59]implement reliable file caches, making writesto the file cache as permanent and as safe asfiles on disk, but with much greater speed. Rio,for example, performs 4–22 times as fast as astandard Unix file system that employs write-through caching, or 2–14 times as fast as a stan-dard Unix file system using write-back caching.Even when using delayed metadata writes in astandard Unix file system [62], Rio is 1–3 timesas fast, but has the reliability advantage of syn-chronous write behaviour.

3.4 Disk Arrays

File systems that reside on a single physicaldisk have an inherent bottleneck by virtue ofthere being only a single head assembly throughwhich to perform all reads and writes. As wehave seen, the time needed to move this headassembly over the disk surface is probably the

major limiting factor in terms of disk I/O andhence file system performance.

One way to mitigate this bottleneck is to in-crease, in effect, the number of head assem-blies. The easiest way to achieve this is to cre-ate a virtual disk out of an array of individualphysical disks. Not only does this increase thepotential parallelism of reading and writing—because different areas of the virtual disk can beread and written simultaneously—but it also in-creases the potential reliability, as a hardwarefailure now no longer renders the entire vir-tual disk unusable. Such virtual disk arraysare commonly called disk arrays. An important,and the most prevalent, class of disk arrays, inwhich performance and reliability are empha-sised, is called Redundant Array of Indepen-dent/Inexpensive Disks (RAID) [31, 67].

RAID employs data striping to improve per-formance, and redundancy to improve reliabil-ity. Striping is a way to distribute data transpar-ently over disks in an array. The granularity ofthe stripe unit determines how many disks willbe involved in a given I/O request. The largerthe stripe unit, the fewer disks need be involvedfor small I/O requests, allowing multiple I/Orequests to be serviced in parallel.

Redundancy in RAID involves using extrastorage space to improve the reliability of thedata stored in the RAID. The extra space holdserror correction information for the actual filedata stored on the RAID. The most prevalent er-ror correction scheme used in RAID is XOR par-ity. In this scheme, a parity stripe unit is com-puted as the XOR of the corresponding stripeunit from each of the data disks. Such a schemecan tolerate the failure of a single disk, becausethe parity data can be used with the remain-ing good data to reconstitute the failed disk’sdata. There are other error correction schemes.For example, the P+Q Redundancy scheme usesReed-Solomon codes to protect against the failureof up to two disks.

The major performance impact of using er-ror correcting data in a RAID is that additionalreads and writes are required when writingdata, particularly for small writes that updateonly one data disk. This is because the parity

Page 9

Page 10: Scalable Storage for Digital Librarieslitwin/cours98/Doc-cours...Scalable Storage for Digital Libraries Paul Mather Department of Computer Science Virginia Polytechnic Institute and

data need also be updated on a write. In the caseof a small write, a read-modify-write process mustbe used in which the old data must be read be-fore the new data is written, to determine howthe new differs from the old, and then apply-ing the differences to the parity block. Thus, asmall write actually requires two reads and twowrites. Because parity is always written on adata write, if parity is kept on a single disk, thatdisk can easily become a bottleneck for the en-tire RAID.

The interleaving, or striping, of data and re-dundancy information over the disks of a RAIDhas a big impact on the performance and stor-age utilisation of the disks participating in theaggregate storage. There have emerged severalcharacteristic organisations, or levels of RAID.These are briefly described below:

RAID 0: Data are striped across the disks com-prising the RAID unit with no redundancyemployed whatsoever. This gives maximaluse of available disk space, and the bestwrite performance, but with no fault tol-erance. If a disk in the RAID fails, thenthe whole RAID fails (for all practical pur-poses).

RAID 1: Twice as many disks are employed inthis RAID level as in level 0. Data arestriped across the data disks, but, also,are mirrored on a corresponding redundantdisk. So, for each data disk, there is an-other mirror disk. When data are written,both the data disk and its mirror are up-dated. When data are read, either of thedata disk or the mirror can be read, as bothcontain identical information. Load balanc-ing can be used to select the best disk towhich to direct the read request, e.g., to thedisk with the shortest request queue, or theone whose head is currently closest to thedata. If a disk in a RAID 1 fails, all readsand writes will be to and from its mirror.RAID 1 has very high read performance.

RAID 2: This RAID level uses bit-level stripingand memory-style error-correcting codesto detect and correct disk failures. So, a

RAID with n data disks using Hammingcodes for redundancy will use log n addi-tional disks to store parity. If a disk fails,the parity disks can, in combination, de-tect which disk must have failed, and cor-rect the failed data. As the RAID set growslarger, the number of parity disks growsmore slowly. However, because disk fail-ures are self-identifying, the extra parity en-coding needed to identify which disk failedis usually redundant, making this schemelittle-used.

RAID 3: Another bit-level striping scheme,RAID 3 uses the fact that disk failures areself-identifying to eliminate the O(log n)parity disks, replacing them with a sin-gle disk per RAID set. When a disk fails,the parity disk can be used in conjunc-tion with the remaining good disks to re-construct the failed disk’s data. In thisbit-interleaved scheme, every disk partici-pates in each read or write, and, in effect,all disks operate identically (the heads aresynchronised). This simplifies implemen-tation, and enables a high bandwidth tobe delivered, making RAID 3 attractive formultimedia applications where quality ofservice bounds must be observed.

RAID 4: Similar to RAID 3, RAID 4 is a block-interleaved scheme that uses a single, ded-icated parity disk. Read requests smallerthan the striping unit need only access asingle disk, increasing the I/O parallelismof this scheme. The drawback, however, isthat all parity updates on writes go to a sin-gle disk, which can quickly become a bot-tleneck.

RAID 5: This organisation is similar to RAID4, but improves upon that scheme by dis-tributing the parity information across thedata disks of the RAID, instead of hav-ing it reside on a separate disk. This dis-tributed parity means that, now, all disks inthe RAID are used to store data, but also,some blocks of each disk are used to storeparity. The equivalent of a single disk’s

Page 10

Page 11: Scalable Storage for Digital Librarieslitwin/cours98/Doc-cours...Scalable Storage for Digital Libraries Paul Mather Department of Computer Science Virginia Polytechnic Institute and

worth of storage is still consumed by paritystorage, but it is spread across all the disksin the RAID. Although the layout is morecomplicated, it means that no single diskbecomes a bottleneck for parity writes, andalso has the advantage that all disks canparticipate in reads. The precise distribu-tion of parity can affect performance [108].RAID 5 generally has the best performancefor small and large reads and large writesof any redundant disk array, but its biggestdrawback is the read-modify-write penaltyfor small writes.

Those are the major RAID categories, butby no means all. For example, the aforemen-tioned P+Q redundancy scheme, which can pro-tect against up to two disk failures, is often clas-sified as RAID level 6. In practice, the most com-monly used RAID organisations are RAID lev-els 0, 1, 3, and 5.

Although excellent performance can be ob-tained from a RAID with all disks operating,performance can be severely degraded if a diskfails. When a disk fails, the RAID is said to beoperating in degraded mode. In this mode, alldisks must participate in each read and write(except for RAID level 1), removing any I/Oparallelism.

To counter this, a RAID can have online or hotspare disks—disks that are part of the RAID, butthat are not used to store either data or parityin the course of normal operation. When a diskfails, its data is reconstructed onto a spare disk.This is done transparently, to ensure the RAIDis not offline. If hot spares are not used, datareconstruction takes place when the failed diskis eventually replaced with a working unit. Ifhot swappable drives are used in the RAID, thiscan be accomplished without shutting down theentire RAID. Distributed sparing [130] and paritysparing [30] are two techniques that take advan-tage of online spare disks to improve the normalperformance of the RAID, whilst still allowingfast reconstruction to begin.

Even in non-degraded mode, care must betaken to safeguard the integrity of the RAID toprotect against system crashes. In particular, the

RAID must keep track of not only which diskshave failed, but also which logical sectors of afailed disk have been reconstructed, or whichlogical sectors are currently being updated. Thisis necessary to avoid reading stale data from aRAID. This information kept track of is referredto as the metastate of the RAID.

In addition, it is important to keep track ofwhich parity sectors are consistent and which areinconsistent in the event of a system crash. Cus-tomarily, this means that before each write, theparity sector must be marked as inconsistent un-til new, valid parity is written. Upon a systemcrash, recovery mandates that all inconsistentparity must be regenerated. Inconsistent par-ity in the presence of a disk failure means thatthe data will not be able to be reconstructed cor-rectly. However, because reconstructing a con-sistent parity sector also results in a consistentparity sector, it is permissable to trigger the re-generation of all parity sectors upon return froma system crash, instead of maintaining stableparity metastate information.

As well as individual disk failures, it is alsothe case that multiple disks can fail systemati-cally. Typically, this is because the controller orbus to which they are connected fails. Thus, in-stead of a single disk failing, a string of disksfails. One way to combat this problem is to ar-range error correction groups orthogonal to theactual hardware arrangement. Such a strategyis broadly called orthogonal RAID [139, 169].

RAID organisations are the subject of intenseresearch, particularly in improving write effi-ciency and reconstruction performance. Float-ing parity [131] and parity logging [183] can im-prove small write performance. Declustered par-ity [79, 132] can improve reconstruction perfor-mance by distributing the parity groups suchthat the load is balanced more evenly across thedisk array in degraded mode.

Research has been undertaken into employ-ing adaptive RAID organisations, to matchworkload with the best layout. The HP Au-toRAID system [203] uses two levels of RAIDbetween which files may migrate automaticallydepending upon how often they are updated.RAID 1 (mirroring) is used for frequently up-

Page 11

Page 12: Scalable Storage for Digital Librarieslitwin/cours98/Doc-cours...Scalable Storage for Digital Libraries Paul Mather Department of Computer Science Virginia Polytechnic Institute and

dated data, and RAID 5 for infrequently up-dated data. Initially, all but 10% of data is storedas RAID 5. In the course of normal operation,frequently-updated data in RAID 5 is promotedto mirroring.

One of the major problems with RAIDschemes is that they are effectively bus-based so-lutions, and so have restricted scalability. Typ-ically, the disks of a RAID array are connecteddirectly to a RAID controller that mediates alldisk I/O. The RAID controller may be a customhardware controller, or, alternatively—and per-haps more increasingly—a “software RAID” inwhich a RAID disk driver is available as part ofthe operating system. The RAID driver can thenact as a “higher level” layer upon the native diskdevice drivers to create arbitrary RAID configu-rations using those disks (or partitions thereof).

The RAIDframe [38] software toolkit is a re-search vehicle for experimenting with differentsoftware RAID organisations. It has been portedto several variants of Unix, offering various lev-els of functionality: as a RAID simulator and aRAID device driver. For example, NetBSD 1.4onwards includes a port of RAIDframe that en-ables, through a kernel configuration option, aRAID pseudo device driver. This driver allowsRAIDs of levels 0, 1, 4, and 5 to be created us-ing block devices or even using other RAIDs.So, it is possible to create a RAID 0 array outof several RAID 5 arrays, or to construct otherarbitrarily complex hierarchies. The driver alsosupports hot spares, various parity declusteringoptions, and so on.

Widely-available software RAID makes itfairly straightforward to build large disk arraysusing COTS hardware. For example, a single PCclone with four wide SCSI host adapter cards,each driving fifteen 40 GB SCSI hard drives,could provide over 2 terabytes of RAID storage.However, in such a setup, the limiting factor islikely to be the memory bus bandwidth.

3.5 Networked Storage

There is a vast literature on networked storage.Put simply, networked storage is storage availablevia a local-area or wide-area network. Typically,

a client/server model is employed, in which theactual disk storage is provided by servers, andaccessed over the network by clients. The di-vision of labour between client and server canvary greatly. At one extreme, the server can actas a “black box,” providing all functionality ofauthentication, name lookup, consistency, anddata delivery (e.g., Network File System (NFS)[162]); at the other, it can provide simply a low-level storage abstraction that clients may readand write (e.g., the Swarm storage server [75]).

From a hardware perspective, networkedstorage can be divided into the following broadcategories:

Server Attached Disks: Storage made avail-able by the server consists of disks that aredirectly attached to the server host. Anarchetypal example is one or more harddisks connected to the server via a SCSIhost adapter. Only the server host com-puter has direct access to the disks. Allother access—by networked clients—mustbe via the server itself.

Network Attached Disks: This organisationhas disks directly attached to the networkand able to send data to clients via somehigh-level networking protocol such asTCP/IP. Usually, network attached disksrequire some form of lightweight aid in theform of file manager servers that mediateclient requests to the disks themselves. Thefile manager interacts with the networkattached disks via a private network toensure integrity of data on the disks ismaintained. Once authorised, networkattached disks deliver data directly to theclient making the request.

Storage Area Networks: Here, a high-speed,private network separates storage devices(disk, tape, etc.) and storage servers. Thestorage servers act as a front-end to the stor-age pool, and any storage server can com-municate with any storage device on thestorage area network (SAN). Clients makestorage requests via any storage server.

Page 12

Page 13: Scalable Storage for Digital Librarieslitwin/cours98/Doc-cours...Scalable Storage for Digital Libraries Paul Mather Department of Computer Science Virginia Polytechnic Institute and

Data is transferred between client and stor-age devices via servers.

From a server/file manager point of view,networked storage can be divided into the fol-lowing broad categories:

Central: All client requests are served by a sin-gle server, which carries the full burden offile management functionality. NFS [162]is the archetypal centralised server. Cen-tral server designs are relatively straight-forward in design and implementation, butlack scalability.

Distributed: Distributed servers cooperatewith peer servers in order to satisfy clientrequests. Typically, distributed serverswill partition the name space and eachwill handle some portion of it. Exam-ples of this category are Andrew FileSystem (AFS) [167] and the Sprite filesystem [143].

Merged client/server: In this category, clientscan act as both clients and servers, actingas a server for locally-attached storage, andas a client when accessing non-local stor-age. Most of the file management burden isplaced on the clients. Network of Worksta-tions (NOW) [6] and xFS [7,201] are typicalexamples of this category.

There is a lack of standardised terminology inthe area of large-scale, reliable, scalable storage.Devlin et al. [48] offer a taxonomy describingtypical enterprise-scale storage organisations.Much terminology, however, is industry-driven[56, 185, 199].

Current emphasis is away from SCSI-baseddisk systems to fibre channel-capable drives,which supports the notion of network-attachedstorage. Fibre channel [17] is a high-speed se-rial interface offering speeds upwards of 100MB/sec. It is more flexible than SCSI in thatit supports three basic interconnection topolo-gies that may be scaled up: point-to-point; fab-ric (switch); and arbitrated loop (ring). An arbi-trated loop can support up to 126 devices with-out the need of a switch. This is more scalable

than SCSI, which can support only up to 15 de-vices per bus. In addition, fibre channel can bedriven over much larger distances—1 to 10 km,depending upon wiring—as opposed to a max-imum of 25 metres with SCSI. Fibre channel in-tegrates both storage and networking protocols,and is implemented using six protocol layers.

3.5.1 Intelligent Disks

Storage research continues the trend of migrat-ing functionality into the disk itself. Insteadof moving data to code, one approach is tomove the code to the data and execute high-level functions on the drive itself [1, 91, 158]. Ineffect, make the disk an object-oriented devicethat responds to high-level method invocationsfrom clients. In such a model, files become ob-jects, the management of which becomes largelyopaque to the outside world.

Keeton et al. [91] at Berkeley put forward acase for Intelligent Disks, or IDISKs, that pro-vides higher-level application functionality onthe disk itself. Gibson et al. [68, 69] describe adisk-centric storage architecture called NetworkAttached Secure Disks (NASD) that also concen-trates functionality on the disk itself. An exten-sion of this work is that of Active Disks under-taken at Carnegie Mellon and at the Universityof California, Santa Barbara.

Riedel and Gibson [158] at Carnegie Mellondescribe the issues and potential benefits of be-ing able to execute code directly on NASD. Ap-plications studied include database select andparallel sort. Their proposal of active disks lo-calises activity on the disk, with relatively mod-est processing requirements and no communi-cation between disks.

Acharya et al. [1] at UCSB also propose an ac-tive disk architecture that is a little more power-ful in its scope. In it, disklets run on disks, dis-patched and coordinated by a host. They out-line several applications, including: SQL selectand group-by; external sort; relational databasedatacube computation; image convolution; andgeneration of composite satellite images. Theirsimulation results indicate active disks outper-form conventional disk architectures.

Page 13

Page 14: Scalable Storage for Digital Librarieslitwin/cours98/Doc-cours...Scalable Storage for Digital Libraries Paul Mather Department of Computer Science Virginia Polytechnic Institute and

The IDISK architecture described in Keeton etal. [91] is probably the most powerful, in that itposits higher bandwidth disk-to-disk communi-cation, allowing for more general-purpose par-allelism.

All of the active disk architectures are predi-cated on increased technological sophisticationof the disk drive itself. The processors used incurrent drives are typically one generation be-hind those used in the host computers drivingthem. Making a transition from 0.68 micron to a0.35 micron process would, for example, enablethe integration of a 200 MHz StrongARM RISCcore into the current ASIC die space, plus an ad-ditional 100,000 gates for specialised processingin addition to incorporating all existing format-ting, servo, ECC, and SCSI functions [158]. Ex-perts at Seagate anticipated that by 2001, driveswill provide upwards of 100 MIPS in processingpower, plus up to 64 MB of RAM and 2–4 MB offlash memory on the drive [5].

3.5.2 Parallel File Systems

In terms of software organisation, many net-worked storage architectures have been pro-posed. One important class is that of paral-lel file systems. Initially, these were relativelysimple transpositions of the traditional Unix filesystem semantics that treated files as linear se-quences of bytes [152]. However, one distin-guishing feature of parallel file systems is thatdisk I/O is tightly coupled with the parallelcomputation. Thus, parallel I/O began to as-sume more importance in parallel computing,especially in those applications that are data-driven. The Message Passing Interface (MPI)working group is evolving standards for paral-lel I/O [35], as is the Scalable I/O Initiative [37].

Some parallel file systems recognise that filescan be structured to mirror the application andworkload under which they will be used. TheVesta Parallel File System [36] treats files as mul-tiple disjoint partitions that can be accessed inparallel, and allows layout to be tuned to antici-pated access patterns. The Hurricane File Sys-tem [97] provides hierarchical building blocksthat can be composed to custom-build access

methods for files in a shared-memory multipro-cessor environment. Application data files canthus be associated with precisely-tuned seman-tics appropriate to their chosen application andaccess patterns.

Disk-Directed I/O [96] is a scalable I/O ap-proach for MIMD multiprocessor systems. In-stead of compute nodes sending requests toI/O nodes independently, they instead collec-tively send a single request to all I/O nodes.The I/O nodes then control the flow of databack to the compute nodes to best maximisedisk bandwidth. The Galley Parallel File Sys-tem [140] implements a version of disk-directedI/O, although its I/O requests are not explic-itly collective. The PARADISE system [26] inte-grates many theoretical parallel I/O techniquesinto an experimental file system, including disk-directed I/O; cooperative caching; flexible filestructuring via metadata; and selectable file ac-cess methods. The TickerTAIP [28] parallelRAID architecture bridges across the parallelI/O and RAID models. River [9] is another ap-proach to parallel I/O management. It providesa data-flow programming model and I/O layerfor cluster environments. It uses a distributedqueue to load balance the workload amongstdata consumers in the system, and graduateddeclustering to do workload balancing amongstthe data producers of the system. The goal isto provide persistent peak performance acrossworkload perturbations.

The IEEE Storage System Standards WorkingGroup has been working on a reference modelfor a large-scale, wide-area, scalable storage sys-tem designed for petabyte-scale repositories ofpotentially billions of datasets, each of whichcould be terabytes in size and spread acrossheterogeneous secondary and tertiary storage[39, 82]. The model is designed to be widelydispersed, geographically, yet support an I/Othroughput of gigabytes per second. The HighPerformance Storage System (HPSS) project isan implementation based upon the IEEE MassStorage System Reference Model. The HPSSconcentrates on highly-parallel, supercomputer,and cluster environments, and seeks to deliverdata at upwards of 100 MB/sec [40]. One of the

Page 14

Page 15: Scalable Storage for Digital Librarieslitwin/cours98/Doc-cours...Scalable Storage for Digital Libraries Paul Mather Department of Computer Science Virginia Polytechnic Institute and

design criteria of the HPSS is to utilise existingand emerging standards for increased deploy-ability. Another design goal is to provide APIaccess at appropriate levels for use by separatedigital library, object storage, and data manage-ment systems [40].

3.5.3 Distributed File Systems

Distributed file systems are networked file sys-tems that spread file system contents over sev-eral machines. In the simplest case, servershandle all file system responsibilities, includ-ing serving data; maintaining consistency; andhandling locking. In more advanced cases, thisworkload is shared between clients and servers,and in some systems, clients may act as bothclients and servers.

Although the NFS may be used to mimic adistributed file system, it is not really consid-ered one. Sprite [143], developed at Berkeley, isa distributed file system in which the global filesystem name space is partitioned across multi-ple domains. A file server may handle one ormore domains. Sprite uses aggressive cachingto improve performance. Clients cache fileslocally, to lessen server workloads. However,when a file becomes write-shared, the server be-comes involved, sending call-back messages toall clients with the file open to disable cachingfor that file. The AFS [167] is similar to Sprite inthat the file system is distributed across servers.Like Sprite, AFS maintains state, and performscallbacks to clients when cached shared file datais modified by another client. Consistency inAFS is at the whole file level, with the last filewritten being the version maintained by theserver.

The biggest problem with AFS and Sprite isthat they do not scale well to large numbers ofclients. Availability is also a problem, shoulda given server become disconnected. The con-sistency granularity of AFS is also weaker thantraditional Unix semantics. AFS was commer-cialised and developed as the Distributed FileSystem (DFS). The DFS is the basis of the OpenSoftware Foundation (OSF) Distributed Com-puting Environment (DCE) [141], and provides

stronger consistency semantics over AFS. Coda(“COnstant Data Availability”) [166] builds fur-ther upon AFS, and is intended to provide im-proved availability and even support discon-nected operation for mobile users. In Coda, theglobal name space is partitioned into volumesconsisting of files and directories. A volume isthe unit of replication. It may be explicitly repli-cated when it is created, and a volume replicationdatabase is used to keep track of all volumes. Thevolume replication database is itself replicatedat every server.

Coda clients use more aggressive cachingthan in AFS in that they can proactively cacheentire files for future disconnected operation.Disconnected operation may be the result of net-work inaccessibility to any Coda server replicat-ing a volume used by the client, or a voluntarydisconnection, such as a laptop computer beingremoved physically from the network. The lo-cal disk of a Coda client is treated as a file cache,and, during disconnected operation the clientbecomes a server, handling all requests out of itslocal cache. Upon reconnection, modificationsare consolidated with the servers of the volumesused by the disconnected client. A repair tool isprovided to allow the user to resolve inconsis-tencies that cannot be resolved by Coda auto-matically.

Berkeley’s NOW project produced xFS, aServerless Network File System [7]. In xFS, clientsare both clients and servers: there is no cen-tralised server (or small subset of machinesdesignated as servers) in xFS. This greatly im-proves scalability. Metadata management is dis-tributed across all nodes, instead of being parti-tioned onto canonical servers.

The Global File System (GFS) is a serverlessfile system in which a cluster of clients con-nects to network attached storage devices ar-ranged in a network storage pool connected viaa storage area network [182]. GFS uses a dis-tributed lock manager controlled by the storagedevices. Devices in the network storage pool arearranged into subpools, grouping devices withsimilar characteristics. These subpools are thenexploited for efficiency. For example, file meta-data is placed on low-latency subpools, whereas

Page 15

Page 16: Scalable Storage for Digital Librarieslitwin/cours98/Doc-cours...Scalable Storage for Digital Libraries Paul Mather Department of Computer Science Virginia Polytechnic Institute and

file data is placed on high-bandwidth subpools.Although GFS is designed for clusters, it canbe extended to LAN and WAN environment byGFS clients exporting file systems via other pro-tocols such as NFS and Hypertext Transfer Pro-tocol (HTTP) [21].

Clusters are receiving a lot of attention as away of producing low-cost, powerful, scalablecomputing facilities. Correspondingly, therehas been significant cluster development result-ing in mature products. Clusters are attractivebecause of their smaller administrative domainthan wide-area distributed file systems. Thisallows for lower latency, higher stable band-width, and simpler security considerations. Adistributed lock manager, for example, is easierto deploy with high performance in a SAN orcluster environment, than over a wide-area net-work.

VAXClusters and Tru64 Unix cluster pro-vide high-performance, fault-tolerant cluster-level file systems [99, 125]. Cluster memberscan share locally attached storage, or use NASDover a variety of interconnection fabrics. A hi-erarchical lock structure and distributed lockmanager control file sharing. The lock managercan detect deadlocks, and the system featuresfailure recovery. The Lustre cluster file systemis similar in scope [122].

The Berkeley NOW work has sparked muchcluster-level file system work. The latter de-velopment of Sprite [143] gave rise to Zebra[76], which provided striping across disks. In-stead of striping files, Zebra uses a per-clientlog which is striped across the disks rather likea log-structured file system. The Swarm stor-age system, developed at the University of Ari-zona, is similar in that it provides a striped logabstraction on cluster storage nodes for clients[75]. DEC’s Frangipani [192] is a cluster-levelstorage system based upon their Petal [109] vir-tual disks. Petal provides virtual disks, eachof which has a 64-bit address space. A virtualdisk is made up of a pool of physical disks thatmay be spread across multiple servers. Petalis a software RAID in many respects. It fea-tures component failure tolerance and chaineddeclustered data access. Copy-on-write tech-

niques enable Petal virtual disks to support ef-ficient snapshots, for online backup purposes.Frangipani builds on Petal to provide a consis-tent view of the same set of files to all clients,and uses a distributed lock manager to ensurecoherence.

4 Digital Libraries Object Repos-itories

Although there has been extensive and intenseresearch in the area of file and disk storage,there has been comparatively little undertakenin the area of storage for digital libraries. Mostof what has been done is at the application level:data structures and file layouts for indices; pro-tocols for search and data interoperation; fileformats for archival storage; etc. Digital librarymetrics studies have focused on user interfaceand behavioural concerns, along with searchsession and query characterisation [110].

Although digital libraries are voracious con-sumers of storage, a tacit assumption in mostcases is that the underlying operating systemI/O substrate will be used for storage. Anapplication-layer shim is applied to provide themore sophisticated I/O that is characteristic ofdigital libraries [135,136]. This may lead to a se-vere “impedance mismatch” between the lowerand upper layers.

One notable exception, in which storage andaccess is given a great deal of attention, is in thearea of digital multimedia. There is a large lit-erature on the design and implementation of ar-chitectures for digital multimedia, with a partic-ular focus on video servers. Although this is buta subset of the digital library universe, it indi-cates that an attention to low-level detail is war-ranted where performance is necessary.

Traditional file systems treat files abstractlyas simple linear sequences of bytes, with littleor no additional semantics beyond that. Tradi-tional file system files have little in the way ofmetadata attached to files. What metadata ex-ists is limited to information about the low-levelstorage of the file, and the ownership and ac-cess dispositions of the file. Almost no semantic

Page 16

Page 17: Scalable Storage for Digital Librarieslitwin/cours98/Doc-cours...Scalable Storage for Digital Libraries Paul Mather Department of Computer Science Virginia Polytechnic Institute and

information is associated with the file, reinforc-ing the “neutral linear byte stream” semanticsprevalent in traditional file systems. There, themantra is that applications provide semantics.Whereas that is true to an extent, as we haveseen, great performance improvements can re-sult from matching a file’s storage and accessat the low level to its intended application (e.g.,Hurricane, PARADISE, et al.). File system stud-ies that have tracked file types consistently re-port a good correlation between file size and filetype (semantics) [49].

In addition, a file’s type may limit its accesssemantics. For example, compressed files areoften not amenable to random access, becausethe decompression state of the current block de-pends upon that of all the previous blocks inthe file. Such files, therefore, will almost cer-tainly be accessed linear sequentially. (There arecompressed file formats that support nonlinearaccess, most notably many multimedia file for-mats.)

All this semantic information may be used toimprove storage layout and efficiency. A filesystem that naively assumes all files will be ac-cessed in the same way misses an opportunityto tune performance to file semantics.

4.1 Digital Objects vs. Files

A major differentiation between file systemsand digital libraries is that file systems operateon files and digital libraries operate on digital ob-jects. Digital objects are more sophisticated thanfiles, and more abstract. Files, by comparison,are concrete and simple.

Digital objects have an underlying data rep-resentation of the digital object, and they alsohave associated metadata that contains infor-mation about the digital object. At a minimum,the metadata includes a unique handle [186] bywhich the digital object is known.

Files reside in file systems, whereas digital ob-jects reside in repositories [88]. Repositories areresponsible for storing digital objects and secur-ing them according to rights associated with thedigital object. These rights may include termsand conditions under which the digital object

may be disseminated by the repository. A repos-itory access protocol is used to effect appropriateaccess to digital objects stored in a repository,and to undertake the deposit of digital objectsinto a repository.

A digital object repository is not a digital li-brary. A digital library is much more than a dig-ital object repository, and includes functionalityto facilitate user interaction; searching; brows-ing; work-flow and content management; andmore. A digital object repository is one elementof a digital library, much in the same way a filesystem is part of a larger operating system infra-structure. A digital object repository can pro-vide a functional substrate upon which digitallibraries may be built.

Another characteristic difference betweendigital objects and files is that digital objectsare typically longer-lived. An important facetof digital libraries—like that of conventionallibraries—is content management. Digital li-braries usually host collections, the contents ofwhich are chosen carefully. There is cataloguingeffort associated with introducing an item intoa collection, and so to maximise the return onsuch investment, items are usually introducedwith a long-term archival storage goal in mind.

Although collection items do degrade, gomissing, or are otherwise deleted from conven-tional library collections, the frequency of dele-tion is small compared with the frequency of ac-cess in the collection as a whole. Unlike con-ventional file systems, which may contain veryshort-lived “temporary files,” or files whosecontents are updated frequently, digital objectsare relatively static, or read-mostly in file systemterms. Even then, updating a digital object isusually considered a significant event, and mayresult in different explicit versions of the digitalobject.

Digital objects also may be complex com-pound objects. For example, a digital objectcould be the result of running a given programon a given input file with given parameters.

Finally, a digital object may actually con-tain no digital data at all: it may consist en-tirely of metadata that refers to some non-digitalarchival object such as an objet d’art owned by a

Page 17

Page 18: Scalable Storage for Digital Librarieslitwin/cours98/Doc-cours...Scalable Storage for Digital Libraries Paul Mather Department of Computer Science Virginia Polytechnic Institute and

museum. Unlike file system metadata, whichis relatively simple and focused on low-levelstorage concerns, digital object metadata is veryrich in that it may not be uniform from digitalobject to digital object. In fact, a given digitalobject may have several encodings of the sameunderlying metadata, to make it interoperablebetween several different cataloguing conven-tions.

4.2 Naming and Location

Naming is very important for digital libraries.Unlike local file systems, names for digital ob-jects are usually intended to be globally unique.Part of this stems from conventional libraries,which deal with authoritative versions of ob-jects, and part stems from the desire for feder-ation.

Names in file systems denote specific bit-streams as they exist at the current time. Namesfor digital objects denote the underlying intel-lectual property, or digital object, and not spe-cific bitstreams or particular renditions of thatdigital object. A digital object may contain orsupport several possible disseminations of thatdigital object, but these are not generally con-sidered to be separate digital objects in them-selves. In file system terms, a digital object isultimately more like a directory in that it is a con-tainer for all the digital artifacts (metadata anddata) directly pertaining to that digital object.One could represent a digital object using a di-rectory in which reside individual files contain-ing the digital content (metadata and data), oreven symbolic links to other directories repre-senting digital objects. The directory would rep-resent the digital object, and the directory path-name would be its name.

The important point to note about digital ob-jects is that they are containers rather than sim-ply individual flat entities. So, whereas a namerefers to the digital object container, accessingthe innards of that container is a more compli-cated operation, and one akin to accessing anobject through its methods in the object orientedparadigm. In this respect, digital objects arecloser to the object-based storage paradigm of

IDISK, Active Disks, and NASD (see the dis-cussion starting on page 13). So, to access thecontent of a digital object might require qual-ifying the name with additional data (e.g., the“method”). The result of an access might be ac-tual data, or other names to dereference.

4.2.1 Uniqueness and Location Dependence

There are two major classes of names: locallyunique and globally unique. Locally uniquenames are unique to a particular server, that is,the same name on two different servers may re-fer to different objects. (For example, the file/etc/motd may contain two entirely differ-ent message of the day contents on two differ-ent servers at the same instance in time, eventhough they have the same name: /etc/motd.)

File system file names are typically locallyunique, unless the file system is a wide-area dis-tributed file system, in which case they are usu-ally globally unique (often made so by qualify-ing them with the global server name).

In addition, a name can be either locationdependent or location independent. Names thatare location dependent are tied to a particu-lar server or server cluster, and incorporate theserver name as some facet of the name itself. Lo-cation independent names incorporate no refer-ence to any particular server holding the namedobject, and part of the name resolution processinvolves identifying a server storing the objectto which the name refers.

Location dependent names are undesirablefor at least three reasons: 1) they are tied to a“physical location,” in that if an object is moved(not copied) to another server, the original namebecomes invalid; 2) it is not immediately dis-cernible whether two differently named objectsare in fact the same thing; and 3) a set of locationdependent names is harder to scale because theset is tied to a particular cluster.

A ubiquitous example of a location depen-dent globally unique name is the Uniform Re-source Locator (URL) [22] used in the WWW.URLs explicitly encode the host and locationwithin that host of a given resource. When ac-cessing an object named by a URL, the host

Page 18

Page 19: Scalable Storage for Digital Librarieslitwin/cours98/Doc-cours...Scalable Storage for Digital Libraries Paul Mather Department of Computer Science Virginia Polytechnic Institute and

named in the URL is contacted. For popular ob-jects, this can lead to severe load problems onthe server host so named. There are techniquesfor alleviating such congestion. HTTP redirects[21], Active Names [197], Persistent UniformResource Locator (PURL)s [174], the Internet2Distributed Storage Initiative [16], etc., seek tomake location dependent names less so by in-terjecting a layer of indirection in the resolutionprocess.

4.2.2 Uniform Resource Names

A URL is a form of Uniform Resource Identifier(URI) [20]. A Uniform Resource Name (URN)[181] is also a form of URI, but one that is lo-cation independent. Unlike URLs, which haveprecise syntax and semantics, URNs have a pre-cise syntax [134] but currently rather loose se-mantics [180].

An overview of work on URNs is given in[195]. Paskin [150] delineates many of the func-tional requirements and progress towards im-plementing URNs. The fundamental require-ments of URNs include: global scope; globaluniqueness; persistence; scalability; legacy sup-port; extensibility; and independence [181].

The basic syntactic form of a URN isurn:nid:nss, where nid is the namespace identifier,that identifies the particular type of URN, andnss is the namespace-specific string under the aus-pices of the nid. The semantic interpretation andsyntactic format of the nss is local to, and de-fined by, the scheme implemented by the nam-ing authority controlling the nid namespace.

There are several efforts currently underwayto introduce a working URN scheme. TheWorld Wide Web Consortium (W3C) has aworking group looking at URNs [83]. Theireffort concentrates on resolver discovery, andthey propose adaptations to the existing Do-main Name System (DNS) to enable a re-solver to be located for a given URN [128].This approach adds a new type of record toDNS databases: the Naming Authority Pointer(NAPTR) [129]. The NAPTR includes rewritingrules to enable a URI to locate an appropriateresolver for resolution [129]. The W3C has pro-

posed a simple mechanism for resolving URNsvia HTTP [44]. Powell [154] describes a simpletechnique that enables the resolution of URNsusing the Squid WWW proxy cache.

Two major URN initiatives are the CNRI Han-dle system [186] and the Document Object Iden-tifier (DOI) [149]. The DOI is a publisher-ledeffort to develop a URN scheme for persistentidentification of intellectual property that in-cludes a mechanism for resolving the DOI tosome useful resource or service associated withthat intellectual property. As well as identify-ing an item of intellectual property, a DOI in-cludes basic metadata describing the target ofthe DOI [149].

The DOI actually uses the Handle system asits resolution mechanism. The Handle systemis a global, decentralised, replicated URN sys-tem. It is comprised of a Global Handle Reg-istry and many Local Handle Services. The gen-eral syntax of a handle is urn:hdl:naming-authority/unique-identifier. The naming-authorityis the organisation that has authority over thatparticular portion of the global handle names-pace. In DNS terms, it is akin to being authorita-tive for a domain. The naming-authority has ad-ministrative control over the unique-identifiers.

The Global Handle Registry is akin to the rootnameservers in the DNS. Local Handle Servicesmanage actual handles for one or more namingauthorities. A naming authority may be homedat either the Global Handle Registry, or at a Lo-cal Handle Service. A naming authority can behomed at only one Local Handle Service (or, al-ternatively, at the Global Handle Registry). Boththe Global Handle Registry and a Local Han-dle Service may actually be comprised of sev-eral servers, for the purposes of fault toleranceand load balancing. The service at which a nam-ing authority is homed has administrative con-trol over the handles within that naming author-ity.

A client library is required for applicationswishing to resolve handles. Handle servers fea-ture caching, and handle resolution is some-what akin to that of the DNS in that first the au-thoritative naming authority is sought, and thenthe handle is resolved via a handle server of

Page 19

Page 20: Scalable Storage for Digital Librarieslitwin/cours98/Doc-cours...Scalable Storage for Digital Libraries Paul Mather Department of Computer Science Virginia Polytechnic Institute and

that naming authority. Handles may actually becomprised of a set of values. Each value has itsown information, including a type entry. Han-dle resolution queries can thus specify that onlyvalues of a certain type, e.g., URL, be returnedby a server. This allows a measure of content ne-gotiation by the client.

The Handle system is described in more de-tail in [187]. The chief disadvantage of us-ing handles as a ubiquitous naming scheme isthe overhead of registering and resolving themin the global system. For short-lived, tempo-rary objects, this is a significant additional over-head. However, handles are designed primarilyto designate significant archival content: digitalobjects.

4.2.3 Scalable Object Location

Name space operations are characteristicallylookup operations. An abstract way to thinkof name space operations is as a hash table inwhich the key is the name, and the value is theset of operations supported by the underlyingobject.

One way to scalably distribute such a hashtable across the nodes of a cluster is to use ascalable distributed data structure (SDDS) [47,98, 118–120, 198]. Born out of dynamic hashing[105, 106], linear hashing [107, 113] is one of themost common family of SDDS. Litwin, in par-ticularly, has proposed many extensions to thebasic linear hashing approach, including vari-ants featuring grouping [115], mirroring [116],security [117], scalable availability [114], Reed-Solomon codes [121], and a variant tuned forswitched multicomputers [90].

SDDS have three basic design properties:1) there is no central directory through whichclients must access, avoiding “hot spots;” 2) ex-pansion to new servers is gradual, and accord-ing to load; and 3) the access and maintenanceprimitives do not require atomic updates tomultiple clients. These properties provide ex-cellent incremental scalability.

SDDS hash tables are accessed via a primarykey. For a digital library object repository, wewill use the object’s URN as a primary key. Be-

cause a URN is variable in length, we can canon-icalise its format by hashing the URN string toa fixed signature, for example to the MD5 hashof the string. The MD5 hash will canonicaliseany URN string to a 128-bit “object ID” withlow probability of collision (2−128). It will alsopseudo-randomise any URN, due to the one-way nature of the MD5 hash function. Othersuitable candidate one-way hash functions in-clude the Secure Hash Algorithm (SHA) [57],which hashes an arbitrary length bit stream toa 160-bit string. SHA-1 is used in cryptographicapplications for digital signatures.

Randomised placement approaches havebeen gaining favour in storage area networksbecause they offer statistical guarantees for ac-cess and provide good incremental scalability.The problem with approaches such as RAIDand similar striping approaches is that they donot scale well with the addition of new disks.With high-efficiency parity layouts, often theentire layout must be recomputed with the ad-dition of new disks, necessitating a large move-ment of data. Randomised placement also hasthe advantage that it performs well for non-predictive access patterns.

The SICMA [25] and RIO [23] projects userandomised placement for multimedia servers.Both are resource efficient, and ensure an evendistribution of requests amongst the disks in thesystem, even in the presence of non-predictiveaccess patterns. RIO also employs partial repli-cation of disk blocks to further evenly distributethe request load across the disks in the system[163]. It has been shown that even for predic-tive access patterns, RIO performs equivalentlyto that of normal striping approaches [164].

The PRESTO [19] project improves uponSICMA and RIO. PRESTO uses a parity ap-proach rather than straightforward replicationto improve space efficiency. PRESTO also hasan efficient re-mapping technique that requiresonly 1/(n + 1) virtual data blocks to be movedwhen adding an additional disk [19].

With the increasing popularity of peer-to-peernetworks for cooperative wide-area storage andserving has come a focus on algorithms for scal-able object location. The explosive growth of

Page 20

Page 21: Scalable Storage for Digital Librarieslitwin/cours98/Doc-cours...Scalable Storage for Digital Libraries Paul Mather Department of Computer Science Virginia Polytechnic Institute and

the WWW has necessitated a need for scalabil-ity to alleviate “hot spots” and ensure consistentand timely access to highly popular content.In particular, Karger et al. [89] introduced theconcept of consistent hashing for locating treesof caches within a dynamically changing setof hosts. Consistent hashing has the desirableproperty that each node in the distributed hashtable stores approximately the same number ofkeys, but there is relatively little movement ofkeys as nodes enter or leave the system. Consis-tent hashing also has the desirable property thatnot all clients need know about all the nodesparticipating in the distributed hash table, re-moving the need for global consistency and syn-chronisation.

Chord [184] uses a variant of consistent hash-ing to effect distributed lookup in a peer-to-peerenvironment. It is used as the distributed hashlayer of the Cooperative File System (CFS) [43],a peer-to-peer wide-area read-only storage sys-tem. Chord uses O(log N) messages to look upa key in a network of N nodes. The cachingscheme of Karger et al. [89] has O(1) lookup, butwith more routing information stored at eachnode than Chord.

Plaxton et al. [153] describe a distributed datalocation technique designed to locate the clos-est copy of a data item to a client. LikeChord, it also uses O(log N) messages in thelookup, but has stronger guarantees about howfar those messages travel within the network.The OceanStore [100] distributed storage sys-tem uses a variant of the algorithm of Plaxtonet al.

Ratnasamy et al. use a d-dimensionalCartesian coordinate space for their Content-Addressable Network (CAN) [157]. The CANprovides O(dN

1

d ) message key lookup, and re-quires O(d) state maintained at each node. Thiskeeps the per-node state fixed, but at an asymp-totically worse lookup cost. Past [50] uses a pro-tocol similar to Chord, but is based on prefixrouting.

The proliferation of recent schemes illus-trates the necessity of a scalable lookup systemfor quickly and efficiently locating resources

in a networked environment. The peer-to-peer schemes described above also tackle widerproblems such as highly dynamic node partici-pation in the network; inconsistent views; sys-tem security; anonymity; administration; andfault tolerance. In particular, the volatility ofboth the network membership and its topologyis expected to be significantly lower in a stor-age cluster: nodes will join the network whennew disks are added, and leave when they areremoved for replacement.

4.3 Redundant Encoding for Reliability

Many schemes have been proposed for redun-dantly encoding data to improve reliability andtolerate faults. Several schemes have been pro-posed for the LH∗ family of SDDS, includingmirroring [116], bit-level striping with parity[117], data with scalable parity [114], and Reed-Solomon encoding [121].

Other approaches are available for increas-ing reliability. Upfal and Wigderson [194] in-troduce a technique for replicating k copies ofa datum across nodes in a distributed systemto improve reliability in the presence of failures.Their technique requires that only a majority ofthe k copies be read or written in any given reador write of the datum.

Rabin [155] proposes a technique for “efficientdispersal of information” in which data can beencoded into n pieces such that only m out ofn need be read to recreate the data. Alon andLuby [3] present a linear-time algorithm for an(n, c, l, r)-erasure-resilient code that can be usedfor encoding and decoding a n-bit string into aset of l-bit pieces such that when decoding, anysubset of the l-bit pieces whose combined lengthis r is sufficient to recreate the original string.The encoded string is of size cn, and r need onlybe slightly larger than n.

The PRESTO project [19] employs a simpleparity-based redundancy scheme. Each logicalblock is decomposed into k equal sized sub-blocks (k > 1), plus an extra parity block thesame size as a sub-block. When reading a logicalblock, any k of the k + 1 sub-blocks can be readto reconstruct the original logical block. When

Page 21

Page 22: Scalable Storage for Digital Librarieslitwin/cours98/Doc-cours...Scalable Storage for Digital Libraries Paul Mather Department of Computer Science Virginia Polytechnic Institute and

writing, all k + 1 blocks (sub-blocks plus par-ity sub-block) are updated. The sub-blocks arepseudo-randomly distributed throughout thesystem. The system can tolerate with high prob-ability the failure of any single storage server. InPRESTO, k = 4 for read-mostly objects, giving a25% storage redundancy overhead [19].

4.4 Metadata

Metadata is, broadly speaking, “data aboutdata.” However, that simplification hides therich complexity inherent in metadata. Metadatahas widely varying uses, and that use may besubjective, meaning that metadata is interpretedin a specific context. A digital object may beput to different uses by different entities, suchas publishers, archivists, and end-users. Suchentities will often create and interpret metadataspecific to their own intended purposes. Ratherthan requiring a global union of all possiblemetadata, it is more flexible to be able to aggre-gate diverse sets of metadata for a given object toenable interpretation of that object by a diverseset of end users.

One approach that has curried widespreadsupport is to consider a digital object as a con-tainer into which is placed digital content andassociated metadata. The Warwick Framework[102], which arose out of a metadata confer-ence, describes this approach. It seeks to ad-dress some of the limitations of the Dublin Core[51, 193] metadata set, the chief one being thatthe Dublin Core seeks to be a global metadataset, and such generality is undesirable in someinstances.

In the Warwick Framework, the metadata foran object is actually a container holding pack-ages of metadata. Each package can either be,itself, a metadata container; a metadata set con-taining actual metadata; or an indirect referenceto an external metadata object, referenced viaa URN. A client may access the contents of ametadata container via an operation that returnsa sequence of packages inside the container. Theclient may then skip over unknown packages—metadata that the client does not know how todeal with, or cannot access—and use only meta-

data that it “understands” for the application athand [102].

4.4.1 Buckets

The theoretical framework for the WarwickFramework was laid down by Kahn and Wilen-sky [88]. The Warwick Framework was subse-quently generalised by recognising that there isno real distinction between metadata and data,and so containers should store both. Danieland Lagoze [45] describe such a generalisation,known as Distributed Active Relationships, latermore commonly known as buckets [138]. Anarchitecture based around this approach is FE-DORA [151].

The bucket paradigm treats digital objects asabstract data types whose contents are accessi-ble through defined methods that allow a userto interrogate the contents of the digital object.In this way, more “intelligence” is invested inthe digital object itself, and this makes the dig-ital object less reliant on functionality built intothe repository in which it is stored. This philos-ophy is termed Smart Objects, Dumb Archives(SODA) by Maly et al. [123]. There are interest-ing parallels between the philosophy of SODAand the disk-centric approach of IDISK andNASD, most notably that all emphasise self-management of objects.

Each bucket in the SODA model has a han-dle, and contains one or more typed elementsand zero or more packages. Packages are usedto aggregate elements. Typically, several dif-ferent renditions of the same document, e.g.,PostScript, PDF, TeX will form a single pack-age, with each different format of the same doc-ument being a distinct element of that package.A package can also contain metadata, and termsand conditions for accessing the bucket, and caneven be a pointer to a remote bucket, element, orpackage. Packages or elements in a bucket can,themselves, be buckets. The entire bucket hasseveral primitive methods to allow controlledaccess to the contents of the bucket. The NC-STRL+ project [138] seeks to expand upon theNCSTRL distributed digital library [111] by us-ing the bucket approach, amongst other things.

Page 22

Page 23: Scalable Storage for Digital Librarieslitwin/cours98/Doc-cours...Scalable Storage for Digital Libraries Paul Mather Department of Computer Science Virginia Polytechnic Institute and

The FEDORA model is very similar, exceptthat packages are, instead, typed byte streamscalled DataStreams that are accessed via dissem-inators. A Primitive Disseminator allows bucket-level access, whilst individual Content-Type Dis-seminators allow structured access to the indi-vidual DataStreams.

4.4.2 Terms and Conditions

An advantage of making digital objects self-contained, as in the bucket approach, is thatit ties them less to a particular repository, andso makes replication and distribution easier ina heterogenous architectural environment. Inparticular, terms and conditions can be im-planted within the digital object, instead of be-ing wholly the responsibility of the repository.So, the move is to secure digital objects, ratherthan communication channels, which simplifiesthe superdistribution of digital objects.

Cryptolopes [71] adhere to this philosophy byencrypting the contents of a digital object, andincluding encrypted versions of the keys used todo so, along with digital certificates and check-sums that may be used to verify the integrityand authenticity of the content. A third-partyclearinghouse is used to enforce the terms andconditions of the bucket, facilitating the buyingor obtaining of the appropriate keys to decryptthe content [95]. A similar technique is used inthe DigiBox architecture [179].

Finally, the Resource Description Framework(RDF) is gaining popularity as a model for en-coding metadata for digital objects (resources)[133]. The RDF uses XML as its encodingsystems, and provides a means for associatingproperties with resources. The properties maybe assigned their own XML namespace, to dis-ambiguate the notion of, say, “creator” in onescheme from “creator” in another scheme. Inthis way, the RDF supports the ability of beingable to describe the same resource (digital ob-ject) with heterogenous metadata sets, as in theWarwick Framework.

The major disadvantage of many metadataencoding formats, from a low-level storageviewpoint, is that they are often free-format,

and so relatively uneconomical and unwieldyin terms of storage management. Importantly,too, they are relatively expensive to parsewhen compared to typical file system metadata,which, for efficiency reasons, is necessarily com-pact and fixed in format. It is a significant chal-lenge to extend the richness of digital objectmetadata to low-level repository storage.

4.5 Digital Object Repositories

Digital object repositories are akin to the filesystem layer of traditional operating systems.They are responsible for the storage, manage-ment, and access of digital objects. They pro-vide low-level functionality upon which digitallibrary functionality may be built.

Traditionally, a digital object repository isseen as a form of “middleware” that providesa shim to the operating system file systemand network layers to allow low-level accessto structured digital object content. This is autilitarian stance, to leverage existing storagesystem functionality in a platform-independentfashion. It does, however, have the potentialto divorce digital object storage from the actuallow-level storage subsystem, enough to cause asignificant potential loss of efficiency.

There are two competing goals when de-signing a digital object repository: preserva-tion versus efficiency. A repository designedwith preservation as the primary goal seeks tomake the storage as simple and transparent aspossible, so that recovery of data can be doneas easily as possible. This involves the use ofself describing files and data structures; no dele-tions; and use of only disposable auxiliary struc-tures (ones that can be faithfully recreated fromscratch, if necessary). The major impact of suchan approach is a significant detriment on per-formance, and so little work has been done inthis area. Crespo and Garcıa-Molina [42] pro-pose a layered digital archive architecture withlong-term data preservation as its primary goal.

Instead of designing preservation into a dig-ital object repository from the ground-up, it isusually assumed that the low-level storage willbe supported by a disaster-recovery plan that al-

Page 23

Page 24: Scalable Storage for Digital Librarieslitwin/cours98/Doc-cours...Scalable Storage for Digital Libraries Paul Mather Department of Computer Science Virginia Polytechnic Institute and

lows it to be recovered after significant organ-isational loss. Allied to that is a copy-forwardpreservation approach, that migrates the low-level storage and organisation of digital objectsinto new, appropriate formats as system soft-ware undergoes generational change.

4.5.1 Kahn-Wilensky and its Extensions

A significant abstract semi-formal descriptionof a digital object repository was introduced byKahn and Wilensky [88]. They introduce manykey concepts that have been built upon sub-sequently by researchers. Important amongstthese concepts are the notion of a digital objectand its handle, as discussed previously. Theyalso describe the notion of repository, and a repos-itory access protocol (RAP) which may be used tocontrol the deposit of and access to digital ob-jects in the repository.

A repository is a logical entity in that thename by which a repository is identified neednot correspond to a unique server. Rather, itmay correspond to a list of servers, each ofwhich are responsible for the stewardship ofdigital objects stored in that repository. Thisnecessarily makes the concept of repository adistributed one.

In the Kahn-Wilensky approach, digital ob-jects are accessed via their associated handle.Handle servers are used to locate a repositoryat which a digital object is stored. This reposi-tory can then be used to access the digital objectby means of the RAP. Access to a digital objectat a repository is by means of the digital object’shandle, a service type, and possible additionalparameters. Such an access is termed a dissemi-nation.

Kahn and Wilensky [88] define only threeservice types for their RAP: ACCESS DO, DE-POSIT DO, and ACCESS REF. The first two arefor accessing and depositing a digital object, re-spectively. The third is to access the referenceservices of the repository. This allows a client todetermine the contents of a repository.

Lagoze [101] describes an implementation ofthe Kahn-Wilensky repository approach calledInter-operable Secure Object Stores (ISOS). ISOS

uses the distributed object store Common Ob-ject Request Broker Architecture (CORBA) forits underlying implementation mechanism.

Dienst [46, 104] is another implementationof the Kahn-Wilensky approach. Dienst usesHTTP as its basic transport protocol, and par-titions digital library functionality into severalbasic services. These services may be furtherrefined by arguments to the service type whenaccessing digital objects within Dienst, analo-gous to the service type and parameter mode ofrepository access in the Kahn-Wilensky model.Dienst uses handles to name its objects.

The Networked Computer Science TechnicalReport Library (NCSTRL) [111] previously usedDienst for its digital library services. NCSTRLhas been subject to active development, primar-ily to extend to it the notion of multiple collec-tions or genres [138].

The Cornell Digital Library Research Groupdesigned the Cornell Reference Architecture forDistributed Digital Libraries (CRADDL) infra-structure to support the collection concept [103].CRADDL is a layered model, and uses CORBAfor its implementation. The repository servicelayer of CRADDL is called Flexible and Extensi-ble Digital Object Repository Architecture (FE-DORA) [151]. FEDORA is designed to sup-port the bucket model of digital objects andmetadata, as described in Section 4.4. The FE-DORA approach invests much of the intelli-gence within the objects themselves. Each objectis opaquely packaged and contains a primitivedisseminator through which the actual contentsof the object may be inspected and retrieved.The primitive disseminator can be thought ofas basic bootstrapping primitive methods that al-low a remote entity to access the particular con-tent of a given object. The primitive dissemina-tor guarantees a minimal set of services for han-dling an object. Handling, in this context, alsomeans incorporating new content into a digitalobject [151].

Because much of the functionality resideswithin the objects themselves, the RAP in FE-DORA is relatively simple, and involves mainlythe creation and deletion of digital objects, andassociating URNs with them. The repository

Page 24

Page 25: Scalable Storage for Digital Librarieslitwin/cours98/Doc-cours...Scalable Storage for Digital Libraries Paul Mather Department of Computer Science Virginia Polytechnic Institute and

also provides an environment in which contentaccess and authorisation methods may be exe-cuted [151].

4.5.2 Other Repository Approaches

An alternative approach to digital object repos-itories is to view digital object stores as semi-structured databases. The metadata of a dig-ital object is stored in fields in a databasetable, whilst the data of a digital object aretreated as binary large objects [77] stored inthe database. A database provides a flexiblemethod of searching and storing metadata, andthe advances in distributed databases providesa ready platform upon which distributed digitalobject repositories may be deployed.

IBM Digital Library [85] employs thedatabase approach. Its architecture consists of alibrary server and one or more object servers. Thelibrary server handles the metadata and search-ing, whilst the object servers handle storageand delivery of the actual digital content.

A client accessing the digital library gains ac-cess to a digital object via the library server.Once the library server has authenticated andlocated the object server storing the digital ob-ject the client wishes to access, the digital objectis transferred directly between the object serverin question and the client, requiring no furtherparticipation from the library server.

Although there may be multiple objectservers, and those object servers may be attunedto the content they deliver, there is only a sin-gle library server per digital library. This cre-ates a bottleneck in the architecture, especiallyin metadata-intensive applications. Because thelibrary server is a database, it should be possi-ble, though, to parallelise it to allow it to scaleup.

The Oracle Internet File System (iFS) is an-other database-driven file system [142]. The un-derlying database engine allows more flexibleand extensible metadata to be attached to filesthan is supported by traditional file systems. Italso supports flexible “views” of files within thefile system. The iFS also supports versioning offiles, and check-in and check-out access for file

update. The protection model is access controllist-based, rather than permission-based as withmany file systems.

Files in iFS are parsed and stored in a canoni-cal internal format. Renderers are used to recon-struct parsed files in an application-specific way.One of the features of iFS is it aims to be “proto-col independent” when accessing files. The iFScan be accessed via SMB, HTTP, FTP, IMAP4,and SMTP. XML and Java are the technologiesthrough which the iFS API is used.

Like the IBM Digital Library, iFS is a step to-wards content management rather than file man-agement. Content management is similar in mostrespects to the area of digital libraries in thatthey share common goals and services. Thereis perhaps more emphasis on work-flow, though,in the area of content management.

The SDSC Storage Resource Broker (SRB) ar-chitecture [14] is an API for managing accessto widely distributed heterogenous data objects.One notable feature of SRB is that it allows ac-cess to objects stored in a variety of storage sys-tems, such as databases, file systems, and ter-tiary storage, providing a simple, clean interfaceto the client. The Storage Resource Broker usesa metadata catalogue (MCAT) to keep track ofresources belonging to collections managed bySRB.

The Stanford Infobus [146] is a CORBA-baseddigital library object repository mechanism formediating access to distributed digital objects.The Infobus uses library services to perform var-ious functions on objects stored within the In-fobus. Library service proxies mediate client ac-cess to the Infobus and its library services (andhence to the objects therein). The Universityof Michigan Digital Library [53] uses an agent-based approach to broker access to data storedwithin it.

Many of these systems are designed to handlewidely distributed collections, and do not fo-cus on local, cluster-based repositories, though.This complicates the design requirements, plac-ing stricter emphasis on authentication, hetero-geneity, and network infrastructure. This, inturn, makes the opportunities for optimisationscarcer.

Page 25

Page 26: Scalable Storage for Digital Librarieslitwin/cours98/Doc-cours...Scalable Storage for Digital Libraries Paul Mather Department of Computer Science Virginia Polytechnic Institute and

5 Proposed Research

Several important themes have emerged in thepreceeding discussion. Chief amongst theseare the following: there is a great demandfor scalable storage; a storage system’s work-load has significant impact on its performanceand on what design helps optimise that per-formance; different types of files have differ-ent access semantics and requirements (e.g.,high-bandwidth vs. low latency); a storage sys-tem’s performance will significantly improve ifit is tuned to its workload; there is a trendtowards serverless, intelligent disks and self-management of data; and, digital library ob-ject repositories are characteristically differentfrom ordinary general-purpose file systems, pri-marily in that they are object-based, have richermetadata, and are read-mostly.

Previous work on storage systems has beenalmost exclusively in the context of operat-ing systems services in which the policies andmechanisms have been limited by the demandsof the overall system performance. For exam-ple, the need to provide CPU time to runningjobs, and maintain a good response time to usersmeans that time spent, say, in the block allo-cation algorithm for newly-written blocks mustbe relatively limited, placing greater limits onthe amount of layout optimisation that can beachieved. It is only with the recent advent ofintelligent disks and network-attached storagethat dedicated processing power and autonomyfor file system related activities has the potentialto improve storage efficiency and I/O perfor-mance significantly. Advances in hardware, andthe increased emphasis on storage, means thatit is now cost-effective to devote relatively sub-stantial computing resources to support storagethan was possible before.

File type (and by definition, its semantics)has enormous impact on file storage and ac-cess performance. Yet, aside from application-driven file I/O prominent only in parallel com-puting, file type has not played a role in file sys-tem design. Part of this stems from the needto provide a system-wide compromise solutionwithin the available system resources. How-

ever, a larger problem lies with the difficulty ofidentifying file types, especially as storage sys-tems often deal with a lower-level abstractionthan files, namely that of blocks. At a blocklevel, it is harder to optimise macro-scale per-formance. Yet it is that macro-scale behaviourthat can seriously affect performance. Lookingonly on the smaller scale can miss potential op-timisations.

With digital libraries, digital objects carrymuch richer explicit metadata than do conven-tional files in file systems, allowing for closer“typing” of files via explicit cataloguing ratherthan relying on guessing via signature “magicnumbers.”

In this research we propose to investigatethe use of metadata and workload informationin cluster-based scalable storage system perfor-mance. In doing so, we will address several ofthe themes mentioned at the beginning of thissection.

5.1 Measuring Storage Systems

For the purposes of this research, we must iden-tify aspects of storage performance in which weare interested. We must then originate someway to measure those aspects, to impart someobjective means to observe performance of dif-ferent schemes employed.

There are several aspects of storage systemperformance worthy of investigation. These as-pects include the following: 1) access speed; 2)storage utilisation; 3) cohesion; and, 4) fault tol-erance. In cluster or networked storage, we arealso interested in network overheads.

For access speed, two time measures are ofinterest: access latency and sustained transferspeed. The former is the delay between request-ing an object stored in the system and receivingthe first data. The latter is the average rate ofdata transferred during the actual data transferperiod. Sustained transfer speed may be calcu-lated both with and without the initial latencyincluded.

Storage utilisation usually measures the ra-tio of the amount of data stored in the systemagainst the amount of storage allocated for stor-

Page 26

Page 27: Scalable Storage for Digital Librarieslitwin/cours98/Doc-cours...Scalable Storage for Digital Libraries Paul Mather Department of Computer Science Virginia Polytechnic Institute and

age. This may include all ancillary metadatastructures. Almost always, it measures the per-centage of space “lost” due to blocking, i.e., theamount of bytes unused in an allocated block.

Cohesion is sometimes referred to as fragmen-tation, and usually is a measure of the spatiallocality of blocks comprising a file. Cohesioncould also measure temporal locality of refer-ence. For the purposes of our research, theamount of gross head movement is a good mea-sure of overall file system cohesion.

Fault tolerance is the ability to operate in thepresence of and recover from faults. Faults mayinclude the loss of hardware components suchas disk blocks, entire disks, disk controllers, net-work adapters, or cluster hosts. Faults mayalso include software failures due to applicationor operating system bugs. Fault tolerance maycomplicate storage system design, and lowerperformance. However, some fault toleranttechniques, e.g., replication, may actually im-prove performance by increasing the opportu-nities for parallelism in data access. Measuresof fault tolerance are usually in terms of systemperformance in the presence of defined faults.For example, the access and sustained transferspeed could be reported for increasing numbersof disk failures.

Finally, network overheads measure theamount of network traffic needed in additionto that needed actually to transfer data. Suchoverheads may include control and coordina-tion messages needed to effect a transfer, espe-cially if data are striped across more than onestorage unit.

5.2 Online workload characterisation

If the type and past access history of a file pro-vides useful data as to its efficient storage andaccess, it would be useful if an accurate snap-shot of the digital object repository contentswere available. Most file system work makeseither static assumptions about its workload,or seeks to exploit temporal locality of refer-ence. But, the handling of a large multimediafile might benefit from a radically different ap-proach to that of a small text file. Even within

the same type of file, some files might be moreactive than others. There may be characteris-tic access patterns common to certain file types,and files of these types may even exhibit partic-ular discernable characteristic access patterns.

Although we believe the digital object reposi-tory workload has significant impact on its per-formance, there is currently little research intothe actual workload characteristics of digital li-braries. Much of the digital library metrics workhas focused on user interface concerns, andlow-level information retrieval concerns such asquery lengths and term frequencies.

We believe an “instrumented” storage systemwould not only provide useful research data ondigital library content usage, but would also bea useful input into storage system placementand access algorithms during its actual opera-tion.

There are two levels of access information inwhich we are interested. The first is object-levelaccess data. This charts access on an individ-ual digital object basis. We can think of this as ametadata-level trace. The other level is that of in-dividual disk storage units comprising the stor-age cluster. This is more of a block-storage-leveltrace, and measures in part the overall utilisa-tion of an individual disk node in the system.

For the metadata-level trace, we are particu-larly interested in tracking file attributes and ac-cess statistics with the type of digital object. Par-ticularly, we are interested in whether access fora given file type is typically read or write; ran-dom or sequential. For file attributes, we are sig-nificantly interested in the file size. Combinedwith characteristic access mode, file size is animportant component to determining an effec-tive block size for a given object, and whetheror not caching will be effective.

For the block-level trace, we are interested indisk node utilisation both in terms of idlenessand also node storage utilisation. This will givedata on load balancing—in terms of workloadand storage allocation—across the disk nodes.

Page 27

Page 28: Scalable Storage for Digital Librarieslitwin/cours98/Doc-cours...Scalable Storage for Digital Libraries Paul Mather Department of Computer Science Virginia Polytechnic Institute and

5.3 Specialisation

The central idea in this research is whetherthe special characteristics of digital libraries ob-ject repositories in a cluster-based storage envi-ronment can be used to improve performance.Specifically, can type-specific information and aread-mostly collection be exploited?

One important way in which type specific in-formation for an object can be exploited is toprovide different storage and handling methodsaccording to object type. So, for example, a dif-ferent block size, file layout, caching policy, andstriping approach may be adopted for MPEGmultimedia files than for text documents.

One way to cater to object specialisation isat the software level. A repertoire of policiesand mechanisms may be known to the storagesystem, and these employed according to anobject’s determined characteristics (e.g., objecttype).

Another form of specialisation is at the hard-ware level. Some storage devices have differ-ent characteristics. So, a storage device withlow latency and low bandwidth may benefit cer-tain types of files more than a storage devicewith high latency but high bandwidth. In addi-tion, because certain storage organisations ben-efit certain workloads to the detriment of others,storage devices also should be allowed to spe-cialise for certain tasks. Rather than being a jackof all trades and master of none, a storage de-vice should be allowed to excel at certain tasksbest attuned to its hardware.

One possible disadvantage of such speciali-sation is that it may be done badly. For exam-ple, too many devices may specialise in cateringto write-intensive objects when the storage sys-tem global workload has few writes. It is hopedthat the ongoing online workload characterisa-tion mentioned previously would assist in pro-viding feedback in that regard, allowing for adynamic reconfiguration of device roles or col-lection apportionment. If we think of a storagesystem as a kind of market economy, the devicespecialisation would be seen as the supply, andthe workload statistics as an indication of de-mand. Attuning the two would be a long-term

goal of the storage system.

5.4 Exploiting Low Volatility

It is contended here that digital libraries objectrepositories exhibit less volatility than generalpurpose file systems. In particularly, they tendto be “read mostly,” in that the ratio of readsto writes is very high. Part of this stems fromthe generally long-term archival nature of dig-ital libraries object repositories. Another ratio-nale stems from the overt cataloguing effort thattakes place in digital libraries compared to or-dinary file systems. A result of this is that suchobjects tend to be longer-lived. Longer-lived ob-jects make long-term optimisation more worth-while.

It is anticipated that most explicit write ac-tivity in a digital libraries object repository willtake place when new objects are deposited inthe system. To ameliorate the effects of this ac-tivity from normal read activity, one approachwould be to adopt a “staging area” technique inwhich designated write-optimised storage unitsaccept new content. After the new object is writ-ten and analysed, it can be relocated to a morepermanent home on another storage device (oracross multiple devices).

A big advantage of using a staging area ina read-mostly environment is that the stagingarea can be optimised for writing and recovery.In particular, log-based storage techniques andexpensive persistent caches can be employedin staging area devices, thereby improving andspeeding up crash recovery. In addition, file sys-tem studies have found the general trend thatfiles tend to be either very ephemeral or else rel-atively long lived. Like generational garbage col-lection, a staging area allows longer-lived objectsto be promoted to longer-term or deeper storagewithin the system.

If we accept the thesis that in a read-mostlyenvironment, most files are typically writtenand then read thereafter, the overheads of mov-ing from the staging area is a small cost easilyamortised over the lifetime of the file. Yet thestaging area provides an opportunity of excel-lent write and recovery performance, and is an

Page 28

Page 29: Scalable Storage for Digital Librarieslitwin/cours98/Doc-cours...Scalable Storage for Digital Libraries Paul Mather Department of Computer Science Virginia Polytechnic Institute and

effective way to “weed out” those objects whichare indeed ephemeral.

Objects in the staging area could be “flushed”to more permanent storage according to var-ious policies. For example, objects could beflushed out after attaining a certain age (similarto the asynchronous buffer cache flushing dae-mon implemented under many Unix systems);when the staging area attains a designated “filllevel” (i.e., like a cache replacement policy); orwhen an optimal location in long-term storagehas been determined for the object.

Note that the staging area is simply an ex-treme form of the specialisation approach de-scribed in Section 5.3, except that instead of spe-cialising according to certain file types, the spe-cialisation is more towards a particular function(i.e., writing). Exploiting such specialisation is atopic of interesting future research.

5.5 Decentralising the Name Space

A common theme emerging from workloadstudies is the following phenomenon: pick abyte at random, and it is likely to belong toa large file; pick a file name at random andit is likely to name a small file. So, althoughthe name space is dominated by small files,the transfer bandwidth is dominated by largefiles. Furthermore, name space operations arecharacteristically small transfers favouring low-latency, and often are bursty in nature. Thisfavours a decoupling of control and data path-ways of the storage system, allowing optimisa-tion of network traffic on each.

As we have seen previously, digital librariesobject repositories are object-based rather thanblock-based. Content is accessed by name, ormore abstractly, by invoking a method at aname. This resolution mechanism is a signifi-cant workload for the system to bear. Clearly, tohandle such a task at a single “gateway” nodeto the storage cluster is not scalable. Serverlessfile systems indicate clearly that scalable perfor-mance is attainable by distributing the serverburden across all nodes in the system: in effect,every node becomes both client and server.

In a cluster-based/SAN environment there is

the complicating notion of the storage nodes in-habiting a “private” network, whilst requestsenter the system via some portal connecting tothe outside “public” network. Necessarily, thereis some finite entry window. However, this en-try load can be mitigated by minimising thetask performed at such a gateway. Typically,the job of gateway machines is to perform somekind of efficient packet routing or request hand-off, using techniques such as distributed packetrewriting [11, 81], or locality-aware request dis-tribution [147], for example.

The approach to be employed in this researchis to pseudo-randomise the name space acrossthe nodes in the storage cluster by means of acryptographic one-way hash function such asMD5 or SHA on the URN. The MD5 hash func-tion will hash an arbitrary string to a 128-bitquantity with low probability of collision. This128-bit quantity will represent the object ID ofthe digital object named by the URN, and canbe considered a primary key for all access to theobject. We can see the MD5 as a way of ran-domising a potentially highly-correlated initialkey space (URNs with likely highly clusteredprefixes) to a uniformly distributed one.

SDDS research has usually focused on min-imising the number of servers required for agiven file, and only utilising new servers whenload dictates. In our case, our servers are ded-icated to storage, and so not to utilise all ofthose available would be inefficient. In practi-cal terms, this means setting the initial numberof buckets to the number of servers in the stor-age cluster (or to a subset earmarked for meta-data storage, should we wish not to use all of thedisks in the cluster for metadata storage). Theproblem, then, involves what happens whennew disks are added to the system, or existingdisks are removed. This question is addressedby the splitting and merging policy employed.

SDDSs are designed to grow and shrink thefile according to load. Growing involves split-ting an existing bucket, transferring a portion ofthe existing data to the newly created bucket.The split policy has significant impact on theperformance of the SDDS. Existing split criteriarevolve around either a bucket overflowing (this

Page 29

Page 30: Scalable Storage for Digital Librarieslitwin/cours98/Doc-cours...Scalable Storage for Digital Libraries Paul Mather Department of Computer Science Virginia Polytechnic Institute and

presumes a fixed size bucket), or the bucket’sutilisation exceeding some split threshold (e.g.,between 70% and 100%). Splitting can be un-dertaken concurrently, and with or without theneed for a designated split coordinator [120].

Because our storage servers are almost cer-tainly heterogeneous, a fixed split criterion isundesirable for all. The load split thresholdshould be computed as a function of total spaceutilisation in the server. (This implies that eachserver manages logically variable sized buck-ets.) Vingralek et al. [198] examine a SDDSwith emphasis on server load management.They also use an interesting indirection betweenbucket numbers and server numbers. Not onlydoes this allow multiple logical buckets to behandled by the same server, but also it allowsload to be managed by bucket migration as wellas bucket splitting.

When a new disk is added to the cluster, wecan choose to split one or more existing bucketsto distribute load onto the new disk, or simplymigrate existing buckets from multiple-bucketservers based upon current individual serverload. When a disk is explicitly removed fromthe cluster, we can either migrate its bucketsonto remaining servers, or else trigger a merge(the opposite of splitting) of its buckets.

It is important to note that this distribution ofthe name space across the servers in a clusterimplies the hashing of an object (based upon itsURN) to a “home server” for that object. Thehome server of an object is responsible for main-taining the metadata associated with that object.Depending upon the object, its home server alsomay manage the storage associated with the ob-ject. We decouple metadata storage from objectstorage. The primary reason behind this is thatthe metadata can be used to give valuable hintsas to the most efficient location and layout for anobject, and also the fact that some objects may beunable to reside within a single storage server.

As well as managing the metadata of an ob-ject, we also assign the important task of under-taking the scheduling of the I/O transfer of anobject to its home server. In this way, as wellas distributing the name space over the storagenodes, we also distribute the request and buffer-

ing workload. It is our thesis that the uniformrandomisation of the URN name space will alsouniformly randomise the workload, especiallyas load increases.

For the purposes of this research, we willnot address encoding for fault tolerance, andassume a reliable communications and serverinfrastructure.

5.6 Exploiting Object Metadata

We described in the previous section an ap-proach to distribute the name space over thestorage cluster in a scalable fashion. In fact, thisactually distributes the object metadata and ob-ject retrieval workload over the storage cluster.It is another of our central contentions that ob-ject metadata can be exploited to improve thelayout and performance of digital objects.

In our scheme, we decouple metadata han-dling and data storage. That is, the layout andI/O of an object can be wholly independent ofhow its metadata is handled, and the metadatacontains information as to the disposition andretrieval method of the object to which it ap-plies. We can see this as an explicit separation ofpolicy and mechanism for object storage. In ef-fect, we view storage I/O methods as ones thatcan be overloaded on a per-object basis.

A storage system performs best when it istuned to its workload. However, file systemsusually are built under static assumptions anddo not adapt well to the objects they hold. Fur-thermore, much file system design has been un-dertaken under the assumptions of small disks,small memories, and limited processing power.This has lead to an emphasis on maximisingspace utilisation and has limited file allocationstrategies (which must execute under limitedCPU resources). The much larger disks of today,and high collective processing power of a clus-ter of “active disks” allows us to be much moreadventurous in storage management.

File system traces over the years point to thefact that most files are read, not written, andthat most I/O tends to be whole file I/O. Fur-thermore, many new files are deleted shortly af-ter creation, and most long-lived files tend to lie

Page 30

Page 31: Scalable Storage for Digital Librarieslitwin/cours98/Doc-cours...Scalable Storage for Digital Libraries Paul Mather Department of Computer Science Virginia Polytechnic Institute and

dormant. The “read-mostly” aspect is particu-larly true of digital libraries, in which new ob-ject creation is a considered act due to the ex-plicit cataloguing that accompanies it. Thereare few available usage traces of extant dig-ital libraries, though indicators of digital li-brary workloads may be inferred by the be-haviour of large scale distributed data reposi-tories such as the World Wide Web, and var-ious audio and generic peer-to-peer file shar-ing protocols. These systems are overwhelm-ingly read more than they are written, in thatthe ratio of object reads to writes is very high.This asymmetry is recognised to the point of theintroduction of communications systems suchas Asymmetric Digital Subscriber Line (ADSL)which model the observed phenomenon thatmost clients download far more data than theyupload.

Another significant observed impact on per-formance is to tune the block size to the objectbeing stored. Larger block sizes give better per-formance for large files because they mitigatethe tremendous impact that head seek move-ment has upon overall transfer time. Tradi-tionally, smaller block sizes have been used outof a desire to minimise wasted disk space dueto partial block usage. With larger disks andhigher-bandwidth controller interfaces, headmovement rather than disk space is likely to be-come a more precious commodity. Clearly, thetrend is towards larger block sizes.

Although it is not often tracked, file systemtraces that have attempted to track the type (se-mantics) of a file and the way it is accessed havefound a strong correlation between the two. Ex-cept for specialised types of file servers, such asmultimedia servers, file systems often cater to a“one size fits all” model. We recognise that in-dividual file semantics can have enormous per-formance implications, and that knowing the se-mantics and past history of a file can offer valu-able hints as to its performance optimisation.

We will use the information provided by dig-ital object metadata to track not only its high-level semantic information, but also to storeand help infer its low-level storage disposition.We also will consider access history and perfor-

mance statistics to be part of a digital object’smetadata, even though normally only a fractionof this information is explicitly catalogued (typ-ically the object’s creation date). Hence, in the“bucket” view of metadata we adopt here, asdiscussed in Section 4.4, the object layout andperformance information become two elementsof the bucket maintained by the system.

Buckets are accessed abstractly in an object-oriented fashion, in that all access to the bucketcontents is via interface methods provided bythe bucket. The same given method may be-have differently according to the contents of thebucket, i.e., the methods are inherently over-loaded by the elements of the bucket, and a gen-eral mechanism is provided to enable the ex-ported methods to be enumerated and henceused.

The storage system can take advantage ofthis abstract model by implicitly overloadingthe storage and retrieval of each element of thebucket. If not explicitly overloaded, a de factoglobal method can be used. Alternatively, theelement’s type can be used to obtain a storageand retrieval method for that type of object. Fi-nally, for elements that do not conform to type,or that require explicit handling, the element’sstorage and retrieval method can be overloadedindividually.

We can thus view the access methods as be-longing to an abstract class hierarchy accord-ing to the elements, and the storage metadata asproviding input data to those access methods.

In this research, we will focus on the follow-ing tunable storage parameters: block size andstriping threshold. We will be particularly in-terested in block size and when it is worth-while striping a file across more than one stor-age server.

Historical studies have shown that most re-quests are whole-file sequential reads of smallobjects. Ideally, then, the goal should be to ser-vice an entire request within a single node. Thismaximises the parallelism of the entire stor-age cluster, because a request for a given objectdoes not tie up (involve the use of) more thanone disk head assembly, and disk head move-ment is the major limiting factor on disk per-

Page 31

Page 32: Scalable Storage for Digital Librarieslitwin/cours98/Doc-cours...Scalable Storage for Digital Libraries Paul Mather Department of Computer Science Virginia Polytechnic Institute and

formance. Historical studies also show, how-ever, that most bytes stored are concentratedin relatively few objects. A way of achiev-ing a more equitable node utilisation, therefore,would be to stripe large objects over larger (lessutilised) storage nodes in preference to smallerones. This would not allow larger objects to mo-nopolise smaller storage nodes, but implies amore careful placement strategy when an objectbreaches the “striping threshold,” or the point atwhich it should be stored across more than onenode.

Fortunately, the aggregate computing poweravailable in a storage cluster, combined with theself-management of the active disk approach,assist greatly in placement of large objects. Sev-eral strategies are possible. One, pursued here,is for each storage node to maintain statisticson its own utilisation, as well as a rough im-age of the utilisation of the rest of the storagecluster. By “rough image” we mean that theutilisation need not be accurate, only approxi-mately accurate. Nor do we imply that it neces-sarily be complete: it may consist only of knowl-edge of a node’s “neighbourhood” (as deter-mined by network topology, say), or of a cacheof recent responses to utilisation queries. Util-isation need not be solely in terms of storage(space) utilisation. It may also factor in the hard-ware characteristics of the storage node, e.g., thedisk and CPU speed, amount of RAM, and soon. Abstractly, utilisation is a measure of load,weighted towards storage space, and each nodeis free to advertise its own utilisation accordingto its own local policy.

Based upon utilisation a storage node need-ing to stripe an object across several nodes willsolicit the assistance of sufficient extra nodes tostore the striped portions depending upon thechosen stripe size (decided based upon the ob-ject type), and the utilisation knowledge thisnode has of the other nodes in the cluster. Suchdispersal of an object is partly a contract be-tween the “home node” of an object and theother nodes across which it is striped. In-complete knowledge of true utilisation of othernodes will mean each node will operate an ad-mission policy for striped foreign objects, and,

necessarily, there will be a negotiation phase aseach node proferred a portion of an object be-ing striped will decide whether or not to acceptit. (Such a solicitation/negotiation phase is ripefor investigation of possible strategies, but wewill leave this largely for future research.)

Utilisation measures advertised by a nodewill be normalised by the node to a number be-tween 0 and 1. The makes it easier for a nodeto compare a set of returned utilisation figures.However, it also is useful for probablistic loadbalancing. By that, we mean a node can use theutilisation as a probability of rejecting a requestto store foreign data on that node. It can alsouse the utilisation as the probability of ignor-ing a request to broadcast its current load (andhence lower the probability of it being asked tostore further foreign objects). Unfortunately, us-ing the utilisation load as a probability of re-sponding to broadcasts directly has the disad-vantage that most nodes will respond when thestorage cluster is largely empty. We can com-bat this somewhat, for example, by threshold-ing the probability of ignoring a request to someminimum value, so as not to burden the net-work. So, the ignore probability could be setat max(t, u), 0 6 t, u 6 1, where t is the floorthreshold just mentioned, and u is the node util-isation. A refinement would be to have t decayfrom an initial elevated threshold value basedupon the elapsed time since the last reply to abroadcast load query.

We will develop a metadata set for storagelayout and retrieval purposes that encompassesboth disposition tracking and performance his-tory.

5.7 Improving Performance in a Read-Mostly Repository

Objects in a read-mostly environment are readmuch more than they are written. Typical en-gineering/office environments have measuredan 80:20 ratio of reads to writes for normal filesystems [144]. (Even then, a good percentageof writes are accounted for by short-lived ob-jects.) It is highly likely that digital library objectrepositories will exhibit an even higher ratio of

Page 32

Page 33: Scalable Storage for Digital Librarieslitwin/cours98/Doc-cours...Scalable Storage for Digital Libraries Paul Mather Department of Computer Science Virginia Polytechnic Institute and

reads to writes, and that objects deposited willtend to be longer-lived.

When an object is written in a file system,a decision about its placement must be madeunder relatively tight time and resource con-straints. Rarely is the object’s disposition sub-sequently reassessed, except at the explicit di-rection of performance improvement tools runby system administrators. Most file system reor-ganisation is directed towards reducing internalfragmentation: some file systems provide user-level defragmentation tools; for others, the re-course to fragmentation is to restore the file sys-tem’s contents anew from backup. Other filesystems—notably the family of log-structuredfile systems—feature an asynchronous “cleanerdaemon” that garbage collects the file system toensure free segments are available for writes.

With long-lived, read-mostly objects in a stor-age environment with relatively large amountsof computing resources available, it seemsworthwhile to take greater effort in determin-ing a good storage disposition of an object thatis newly deposited. It is also worthwhile to usethe available aggregate computing resources tomonitor and optimise the storage of the reposi-tory contents.

The major question regarding asynchronousperformance optimisation involves its impacton the I/O performance of the storage system.Seltzer et al. [172] report the deleterious perfor-mance effect of the LFS cleaner daemon underhigh utilisation. It is important, therefore, thatany background optimisation process operateonly under periods of low load, and be able tocheckpoint and restart its work.

Because our namespace approach fosters thenotion of a “home” server for an object, and alsobecause our storage approach is object-basedrather than block-based, this encourages local,decentralised, object-based optimisation ratherthan a central global optimisation strategy. Ac-tive disks and NASD promote local manage-ment of storage objects, and are naturally decen-tralised at the object level.

Some objects, however, are spread over morethan one storage server, so there is a need forsome cross-server optimisation. Such optimisa-

tion requires some knowledge of state (load) atother servers. The namespace SDDS estimatesglobal state in terms of server load, so that offerssome input. In addition, there are efficient dis-tributed global state techniques for optimisingwithin a distributed environment. Also, Arpaci-Dusseau [8] discusses implicitly controlled sys-tems in which the cooperating entities do notexplicitly exchange control or state informationbut infer it from various sources. Her particu-lar focus is on implicit job scheduling within adistributed system.

Our read-mostly assumption offers sev-eral possibilities for performance optimisation.Firstly, we can treat writes (object deposit) as aspecial case and forward such objects to a stag-ing area optimised for writing (as discussed pre-viously). The servers comprising the stagingarea would migrate out those objects that agesufficiently, or which consume sufficiently largespace resources to impact the performance ofthe staging area.

Secondly, we can use ongoing access statisticsfor objects to determine whether a given objectis a good candidate for optimisation. A success-ful measure of predictive usage of a file object isits “temperature,” which is a measure of how of-ten it has been accessed over a given time frame.The more accesses, the “hotter” an object willbecome. So, object temperature can be used as athreshold trigger to attempt to optimise a givenobject (e.g., by defragmenting it), or a mediantemperature of all the objects on a storage servermay be used to trigger optimisation of the stor-age server itself.

As discussed previously, the namespace isdominated by small files. There is an incentive,therefore, to keep the initial portion of a givenobject local to its home storage server (especiallyas most reads are sequential whole-file reads).This approach is often termed “inode stuffing,”because the inode size is increased to be largerthan needed (e.g., to the size of a standard datablock), and the tail portion is used to hold theinitial data of the file. This is a big performancewin for small files because the file metadata anddata can be transferred in a single transfer.

Inode stuffing can be used analogously in a

Page 33

Page 34: Scalable Storage for Digital Librarieslitwin/cours98/Doc-cours...Scalable Storage for Digital Libraries Paul Mather Department of Computer Science Virginia Polytechnic Institute and

cluster-based storage system in that it can pro-vide a threshold at which data remain local toa node as opposed to being striped over otherservers. The higher this threshold, the more lo-cal data will be. This will incur less networkusage (fetching striped portions from remoteservers), but will involve more local disk activ-ity.

Given that an object’s metadata is maintainedby its home node, there is good reason to storethe initial data at that same node, too. In thecase of small objects, this will improve perfor-mance, as both can be blocked together, as in in-ode stuffing. However, we will investigate howmuch initial data stored at a node (the off-nodethreshold) affects performance. Modern high-speed networking versus slow disks has madenetworking overheads perhaps less of an issue.Indeed, network RAM [58] is predicated on thenotion that it is now quicker to page data acrossa network to another system than to page it tolocal disk due to the orders of magnitude differ-ence in operational speed of the two media.

Another issue greatly affecting performanceaccording to file system trace studies is the ef-fect of block size on I/O performance. As men-tioned previously, larger block sizes favour bet-ter performance, the main disadvantage beingstorage space lost to fragmentation. Historicstudies show that most files are small, but mostbytes belong to large files. This suggests severalstrategies for choosing and adapting block sizes.

Firstly, we can favour large files by choosinga large block size. This has the disadvantageof wasting space for small files, and also of be-ing a static scheme. The impact on small filescould be mitigated by batching smaller files intothe larger blocks, e.g., similar to the approach offile systems such as ReiserFS [191] that are op-timised for small files yet still perform well forlarge files.

Secondly, we could choose a simple strategyof having two block sizes: one for objects whollylocal to the storage node, and the other—muchlarger—for objects that have been striped acrossseveral servers. Because big objects will bestriped, the larger block size will benefit theirperformance. Similarly, the smaller block size

for the smaller, local object will benefit perfor-mance by decreasing space lost to internal frag-mentation.

The disadvantage of this approach is chieflythat of the previous technique in that it is essen-tially static. However, it is also more difficultto implement in storage servers consisting of asingle disk (e.g., an “intelligent disk”). If thelarger block size is not some exact multiple ofthe smaller one, it becomes difficult to manageallocation within the same disk real estate. Gen-erally, the disk must be partitioned into multipleareas employing a different block size in eacharea (e.g., two partitions: one using the smallerand one using the larger block sizes). Choosinghow much to allocate to each partition is diffi-cult, as partitions are not easily resized once cre-ated.

Storage servers that consist of multiple disksper server node fare better, however. Each diskcan be used with a different block size. Onceagain, the ratio of local (small) to remote (large)objects is not easy to control, but in terms of theglobal cluster usage the effect is not so serious asnodes can “lease” the extra remote space for useby other servers. If most bytes reside in largefiles, space in the large block sized area will beused effectively by other nodes in the cluster.The question becomes one of sizing the smallblock sized storage, which is used by small ob-jects and metadata. However, the SDDS used toallocate the namespace (metadata) will tend tocompensate for undersizing or oversizing anyparticular storage node by factoring it into thesplitting or merging of the items stored at thatnode.

The previous two approaches have the ad-vantage of being relatively simple to imple-ment and require relatively few resources. Thegreater available dedicated processing poweravailable in the storage node suggests some-thing more ambitious should be attempted, andpoints to a third, more general approach: usevariable sized blocks.

In an object-based, read-mostly system whereI/O tends to be whole-file I/O, using large, con-tiguously allocated disk blocks is clearly advan-tageous (see Section 3.3.3). We shall thus adopt

Page 34

Page 35: Scalable Storage for Digital Librarieslitwin/cours98/Doc-cours...Scalable Storage for Digital Libraries Paul Mather Department of Computer Science Virginia Polytechnic Institute and

an extent-based layout approach. We shall usecontiguity and semantic file type as the criteriafor choosing the block size. As stated before,a threshold must be chosen at which to beginstriping an object so as not to monopolise localstorage space. Once again, this threshold can bedecided according to object file types, or may bedetermined according to the local server needs.

5.8 Summary

Analysis of digital libraries object repository re-quirements and storage system research has ledus to develop an architecture which meets thechallenges and requirements of the former byapplying the best appropriate practices of thelatter.

Digital objects are naturally object-based, notblock-based, and have rich hints available viaexplicitly catalogued metadata. By augment-ing this explicit metadata with workload statis-tics, we can add a new, semantic-driven in-put to storage access and layout policy andmechanisms. This is not present in conven-tional storage systems, which, aside from someapplication-specific domains, treat files as neu-tral linear byte streams and thus miss a potentialrich source of performance optimisations.

Our target architecture is a cluster of intel-ligent disk nodes that supports a repositoryaccess protocol. Higher-level digital librariesfunctions are assumed to be provided else-where.

We propose to use scalable hashing with ran-domisation as a means of handling both thename space and workload distribution acrossthe nodes of the cluster. Scalable hashing ad-dresses the problem of quickly and scalably lo-cating an object within the cluster, and allowingfor the growth or shrinking of the cluster withminimal re-mapping of existing objects to otherdisks.

Disk seeks are an expensive commodity inany disk access, because they are limited byphysical laws that severely limit their ongoingperformance improvement. However, most ob-jects are small in relation to the collection, butmost bytes are concentrated in relatively few ob-

jects. We thus propose to manage the tradeoffbetween minimising seeks as a threshold deci-sion of when to stripe an object across more thanone disk node. This serves two purposes: itallows even use of space across all disk nodes(and prevents large objects monopolising spaceon individual disk nodes), and it permits high-bandwidth transfers of large objects by involv-ing more disk head assemblies. The size thresh-old at which to go “off node” is a matter forinvestigation. In a simple fashion, it could bea fixed size for all objects. In a more compleximplementation, it could be based upon the se-mantic type of the object, as identified by the ob-ject’s metadata or access history.

Archival digital library collections are pre-dominantly read-mostly. Read mostly objectsbenefit from large transfer sizes, as they areoverwhelmingly accessed in a whole-object se-quential fashion. We propose to use a variableblock size extent-based allocation scheme to im-prove I/O performance by making objects moresequential on disk. This will minimise seeking,and improve performance.

Our proposed architecture targets directly theneeds of digital libraries object repositories. Be-yond all else, if offers a platform for futureexperimentation tied directly to that importantdomain.

6 Research Plan and ExpectedContributions

The proposed research centres around thetheme of designing a scalable cluster-based dig-ital library object repository based upon digitallibrary collection characteristics. The emphasisis on long-term scalability in an implicitly het-erogeneous storage environment. Unlike con-ventional approaches that are block-based, ouremphasis is on object-based storage.

No attempt will be made to implement aworking production prototype. Instead, the em-phasis will be on the simulation of the proposedconcepts and techniques. Simulation enablesthe investigation of a much wider size and arrayof hardware and networking topologies than is

Page 35

Page 36: Scalable Storage for Digital Librarieslitwin/cours98/Doc-cours...Scalable Storage for Digital Libraries Paul Mather Department of Computer Science Virginia Polytechnic Institute and

possible using actual hardware. This is espe-cially true when investigating scalability, whichinvolves large amounts of hardware resources.Tools such as SynRGen [54] and DiskSim [63]can greatly assist in the simulation process.

First, several real digital library collectionswill be analysed to obtain collection statisticsto use when generating synthetic workloads.One collection will be the repository of elec-tronic theses and dissertations at Virginia Tech.Although this collection does not have a largeamount of multimedia associated with it, it doeshave good metadata, which is important to ourstudies. It is also essentially a bucket-basedrepository in that each thesis or dissertationequates to a bucket.

Another good digital library to be examinedis the Perseus digital library at Tufts [41]. Thisis a digital library centred around the humani-ties that has been in development for over tenyears now. It contains eight main collectionscontaining a wide range of material both geo-graphically and in time, including collectionson the Classics; papyri; English Renaissance;London; California; the Upper Midwest; Chesa-peake; and Tufts history. The collections includeboth texts and images, as well as geographicalinformation in some cases. Some documentshave both text transcriptions and page images.The content of the digital library is both largeand diverse, though, by nature, somewhat lack-ing in digital video and audio.

The VARIATIONS digital library at IndianaUniversity [52] contains over 7000 audio titles,and is a good example of a large collection ofdigital audio. Its successor project, the Indi-ana University Digital Music Library, recentlywas launched. It seeks to synthesise many dif-ferent types of media related to music includ-ing digital audio, music notation files, score im-ages, and MIDI files. This presents the possi-bility of a more bucket-based approach than thecurrent VARIATIONS system, but lacks an ex-tant mature collection due to the early stages ofthe project.

The available digital library collections will beanalysed with respect to the criteria expoundedin Section 5.1. It is important to ascertain the

distribution of object sizes and accesses, bothglobally and according to semantic file type.Summary statistics will provide input as tochoice of striping threshold for various semanticfile types.

Using the collection analysis, we will thengenerate synthetic collections in order to “scaleup” the problem. The synthetic collections willmimic the characteristics of the exsiting ones,but be orders of magnitude larger. We intend tosimulate collections in these sizes: 100 GB, 1 TB,100 TB, and 1 PB. Each collection will be sim-ulated on clusters of varying sizes: 100 nodes,1000 nodes, and 10000 nodes. In addition, eachcluster configuration will employ disk sizes inthese random intervals at each node: 5–15 GB,20–50 GB, and 50–150 GB. (Random disk sizesper node are intended to simulate the hetero-geneity that comes over time. Note that some ofthe synthetic collection sizes will not fit in someof the cluster configurations, e.g., 10 TB willnot fit in a 100 node cluster whose maximumnode capacity is 15 GB.) We will also simulatetwo speeds of interconnection fabric: “slow” 10Mbps and “fast” 1 Gbps. Experimental resultswill be analysed using R [84].

When generating synthetic collections we willuse the characterisation of existing file typesand sizes as a basis, and then scale up the num-bers of those files and objects (with randomperturbation) to achieve larger collection sizeswithin the desired simulated ranges. It is thecollection size that will be scaled up, not thesizes of individual object types, or the percent-age with which that object makes up in the col-lection as a whole. We will randomly sampleobjects from existing collections until the de-sired collection size is attained in each case, e.g.,randomly sample until 1 TB is amassed for sim-ulating a 1 TB collection.

Using the synthetic collections as our reposi-tory contents, we will simulate a digital libraryobject repository storage cluster using the ideasdescribed previously. Figure 1 shows the gen-eral architecture. In the figure, “D” representsdisk nodes within the storage cluster, and “G”represents gateway nodes that handle client re-quests from external networks. The disk storage

Page 36

Page 37: Scalable Storage for Digital Librarieslitwin/cours98/Doc-cours...Scalable Storage for Digital Libraries Paul Mather Department of Computer Science Virginia Polytechnic Institute and

External Network

D

D D

G G G G

D D D D D

D

D

D

DDDDD

D

D

D D D D

DD

G

D

Figure 1: Architecture of digital library objectrepository storage cluster

nodes exist on a private network, not address-able by external networks. The gateway nodesare thus needed to facilitate a “hand-off” of re-quests to the disk nodes within the cluster. Theirfunctionality is lightweight, and analogous tothat of front-end request processors in scalableWWW server clusters.

External clients accessing the digital libraryobject repository are assumed to support arepository access protocol of the type proposedby Kahn and Wilensky [88]. Global URN res-olution is performed at the client by some ex-ternal, to locate the actual repository at whichthe object to which the URN refers is stored. Allobjects in the repository are accessed via theirURN handle. A typical access of a digital objectfrom the repository by an oblivious client pro-ceeds as follows:

1. The client issues an ACCESS DO requestto the repository, with the digital object’sURN handle as argument;

2. If the handle is resolvable to the repository,the repository replies with a reference to

the primitive disseminator/bucket associ-ated with the object;

3. The client requests a list of pack-ages/DataStreams contained in the digitalobject via the reference just returned;

4. The repository replies with a list of pack-ages;

5. The client requests a list of elements con-tained in a given package from the reposi-tory;

6. The repository responds with a list of ele-ments contained in the package;

7. The client requests a list of supported meth-ods of a given element desired to be ac-cessed;

8. The repository responds with the sup-ported methods of the element;

9. The client invokes the desired accessmethod of the element at the repository;

10. The repository returns a byte stream tothe client produced by invoking the accessmethod

Note that the access sequence just describedcan be collapsed tremendously depending uponthe knowledge the client possesses about thecontent of the digital object or the collection.For example, if a client knows that a certain ac-cess method is supported on an element of apackage, known a priori, then the client can in-voke that method on that element of that pack-age of the object directly. In this way, much ofthe worst-case “discovery phase” of accessing adigital object can be eliminated. We will assumeour clients are nonoblivious, and are driven bya higher-level digital library application layerthat is cognisant, somewhat, about the collec-tions it serves. Also, although the bucket-basedapproach allows for programmatic dissemina-tors, we shall restrict our simulation to the ac-cess of static content (i.e., disseminators that re-turn byte streams of stored data, not of data gen-erated algorithmically on-the-fly).

Page 37

Page 38: Scalable Storage for Digital Librarieslitwin/cours98/Doc-cours...Scalable Storage for Digital Libraries Paul Mather Department of Computer Science Virginia Polytechnic Institute and

The first, and most important aspect to sim-ulate is the effectiveness of the namespace andworkload distribution hashing scheme. It is im-portant to verify that this balances load ade-quately in the limit. We will measure the effec-tiveness according to three criteria: uniformityof name distribution; equitability of byte distri-bution; and equitability of workload distribu-tion.

The uniformity of name distribution is as-sessed by measuring the distribution of thenames of all stored objects across all nodes inthe storage cluster. Ideally, for a storage clus-ter of k nodes storing a total of n objects, eachnode should handle the stewardship of n

kob-

jects. The pseudo-random hashing of objectnames should ensure good uniform distributionacross the total cluster. This is reinforced bythe load-balancing nature of the LH∗ hashingscheme.

The equitability of byte distribution is analo-gous to that of name distribution. Again, bal-ancing is determined based upon node utilisa-tion relative to the entire cluster. In terms of bytestorage, two possible measures manifest imme-diately: for system of k nodes, each node couldstore 1

kth of the total bytes, or operate at p% of

its total storage capacity. In a storage cluster inwhich each storage node has the same capacity,both measures are in fact the same. In a hetero-geneous environment, though, they are not. Theformer measure will have smaller disks operat-ing closer to being full, and larger ones closerto being empty. This is not necessarily a badthing in terms of access performance, because,naively speaking, each disk has only a singlehead assembly, and head movement is the mostprecious commodity for disk performance. In auniformly distributed workload, 1

kth byte utili-

sation at each node equates to 1

kth of the object

space at that node. A p% utilisation strategy isgood only if proportionally bigger objects are beingstored at storage nodes with higher capacity.

Equitability of workload distribution is mea-sured by the proportion of time a disk nodespends servicing requests, i.e., the amount oftime it is “busy.” A well-balanced cluster shouldhave each storage node active the same percent-

age of time.To test load handling under various cluster

configurations as described above, we also willneed several synthetic workloads. As men-tioned, some of the collections have availableaccess logs that can be used as a basis for gener-ating synthetic workloads. The important char-acteristics we want to scale up from these are theproportion of objects within the collection thatare accessed frequently, the distribution of ac-cesses by object types, and the distribution oftimes between sucessive accesses of the sameobject.

As well as using synthetic trace workloadsto simulate real-world conditions, we will alsouse SynRGen [54] to generate user-driven work-loads. SynRGen can be used to model classes ofusers, and can algorithmically generate work-load traces. A particular user or access be-haviour can be modelled in SynRGen, and theoutput from the SynRGen run used as an in-put workload to the simulation. In this way,the numbers of users of the repository can bescaled easily, and different access workloads in-vestigated.

The next task will be to simulate the con-tent management and dispersal aspect. Partic-ular focus here is on the effect of the striping(local versus network) policy. A major goal isto identify whether the bottleneck lies withinthe nodes, or in the interconnection network.Although we will not implement the on-disklayout mechanism in detail in the cluster-levelsimulation, we will simulate its worst-case andaverage-case guarantees of contiguity to esti-mate seek and transfer times.

As mentioned previously, we will not addressthe issues of reliability, fault-tolerance, authen-tication, and security in this research. They areleft as topics for future examination. However,the “closeted” nature of a private storage areanetwork cluster does mitigate the ill effects ofdiscounting the areas of security and authenti-cation somewhat.

Table 1 shows the dependent variables usedand independent variables measured in oursimulation experiments. The access latency isthe amount of time between a client issuing a

Page 38

Page 39: Scalable Storage for Digital Librarieslitwin/cours98/Doc-cours...Scalable Storage for Digital Libraries Paul Mather Department of Computer Science Virginia Polytechnic Institute and

Dependent variablesName LevelsCollection size (TB) 0.1, 1, 100, 1000Disk nodes 100, 1000, 10000, 1000000Disk size (mean GB) 10, 40, 100Network speed 10 Mbps, 1 GbpsStriping threshold (KB) TBDWorkload TBD

Independent variablesAccess latencyDisk bandwidth usageNetwork bandwidth usageNode CPU utilisationNode disk utilisationNumber of network messages

Table 1: Simulation variables

request and receiving the first result byte. Diskbandwidth usage is the percentage of time spenttransferring actual data against the total amountof time spent on a disk operation. Hence, if adisk transfer operation spends 9 ms seeking and1 ms transferring data, the operation is utilisingonly 10% disk bandwidth; if it spends 9 ms seek-ing and 81 ms transferring data in fulfilling adisk transfer, it is utilising 90% disk bandwidth.

Network bandwidth usage is the percentageof available network capacity used, analogousto disk bandwidth usage. Node CPU usage isthe percentage of time a node spends activelyprocessing requests (from clients or other clus-ter nodes) versus the time it spends idle await-ing work to do. Node disk utilisation is the pro-portion of disk space allocated in a node againstthe total available disk space in that node. (So,if a node currently uses 5 GB of a 20 GB disk,it is at 25% node disk utilisation.) Number ofnetwork messages is the number of control andtransfer messages sent over the network (inter-nal and external) in the fulfillment of a client re-quest.

Our experiments will revolve around a simu-lation of a digital libraries object repository stor-age cluster based upon techniques and princi-ples described in Section 5. The simulation willbe instrumented so that its performance andfunctionality may be measured and modified.

Series 1: Object/Workload Distribution

Vary: Collection size; number of nodes

Measure: Node CPU utilisation; node disk utilisa-tion

Objective: Test equitable distribution of objectmetadata and client workload across cluster

Series 2: Striping Policy

Vary: Collection size; number of nodes; stripingthreshold/policy

Measure: Node CPU utilisation; node disk utilisa-tion; network bandwidth utilisation

Objective: Measure the effect of striping thresholdon disk and network usage

Series 3: Metadata-driven Layout

Vary: Collection size; number of nodes; metadatalayout policy

Measure: Node CPU utilisation; node disk utilisa-tion; network bandwidth utilisation; access la-tency; disk bandwidth

Objective: Determine the effects of metadata-driven layout specialisation

Table 2: Series of experiments for simulation

The simulation is intended as a “test bed” forthe ideas expressed herein.

We intend to run three main series of experi-ments to assess the impact of various key ideasput forth in this document. This series is listedin Table 2. The first two series of experimentsare largely static tests of storage distribution,as described previously. The third series is in-tended to explore the effects of using metadatahints on layout policy. Here, we will vary thestriping threshold on an object type basis. Forexample, a different striping threshold will beused for MPEG files than for PDF files, insteadof using the same for both.

Assessing the relative performance of the sys-tem is somewhat difficult given the scaling anddomain-specific nature of the area. When vary-ing the number of nodes, it is important to en-

Page 39

Page 40: Scalable Storage for Digital Librarieslitwin/cours98/Doc-cours...Scalable Storage for Digital Libraries Paul Mather Department of Computer Science Virginia Polytechnic Institute and

sure that it scales smoothly and is free of “hotspots” or overloaded nodes. Performance willbe assessed relative to the maximum theoreti-cal, to make it somewhat independent of actualgiven parameter values. Chen and Patterson[33] offer possible guidance on designing self-scaling benchmarks that are meaningful acrossplatforms.

There are several expected contributions ofthis research. Firstly, it will provide impor-tant metric data in a field that, to this date, haslargely ignored it. (Digital libraries “live andbreathe” on storage, yet extant metrics work hasfocused on user interfaces, queries, relevance,recall, and precision. Collection storage analy-sis is conspicuously missing.)

Secondly, and importantly, it will addressstorage from a digital libraries point of view. Ithas long been recognised that digital librariesare not file systems, yet at a low level they re-side on them. Not only does this rob them ofperformance opportunities, but it forces digitallibrary functionality through the needle eye of afile system API. This research will provide a dig-ital object, respository-based approach to digitallibrary storage.

Thirdly, the research addresses the criticalneed for scalability and evolution in digital li-brary storage. With content growing at analarming rate, storage architectures must meetthat growth. This necessitates a heterogeneous,incrementally growing storage architecture: ademand this research addresses.

Finally, another significant contribution ofthis research is a flexible handling of metadataat the storage level. In addition, it contributes anincorporation and usage of active storage per-formance data in ongoing archival operations,along with an exploitation of semantic contentin improving storage performance.

References

[1] ACHARYA, A., UYSAL, M., AND SALTZ,J. Active disks: Programming model, al-gorithms and evaluation. In Proceedings ofArchitectural Support for Programming Lan-

guages and Operating Systems (ASPLOS)VIII (Oct. 1998), ACM, pp. 81–91.

[2] ADAMIC, L. A., AND HUBERMAN, B. A.The Web’s hidden order. Commun. ACM44, 9 (Sept. 2001), 55–60.

[3] ALON, N., AND LUBY, M. A linear timeerasure-resilient code with nearly optimalrecovery. IEEE Transactions on InformationTheory 42, 6 (Nov. 1996), 1732–1736.

[4] ALVAREZ, G. A., BOROWSKY, E., GO,S., ROMER, T. H., BECKER-SZENDY, R.,GOLDING, R., MERCHANT, A., SPASO-JEVIC, M., VEITCH, A., AND WILKES, J.Minerva: an automated resource provi-sioning tool for large-scale storage sys-tems. ACM Trans. Comput. Syst. (2001). toappear.

[5] ANDERSON, D. Consideration forsmarter storage devices. Presentationgiven at the National Storage IndustryConsortium (NSIC) Network-AttachedStorage Device working group meeting,June 1998. Available from http://www.nsic.org/nasd/.

[6] ANDERSON, T. E., CULLER, D. E., PAT-TERSON, D. A., AND THE NOW TEAM.A case for NOW (Network Of Worksta-tions). Tech. rep., University of Californiaat Berkeley, 1994.

[7] ANDERSON, T. E., DAHLIN, M. D.,NEEFE, J. M., PATTERSON, D. A.,ROSELLI, D. S., AND WANG, R. Y. Server-less network file systems. Tech. rep., Uni-versity of California at Berkeley, 1995.

[8] ARPACI-DUSSEAU, A. C. ImplicitCoscheduling: Coordinated Scheduling withImplicit Information in Distributed Systems.PhD thesis, University of California atBerkeley, 1998.

[9] ARPACI-DUSSEAU, R. H., ANDERSON,E., TREUHAFT, N., CULLER, D. E.,HELLERSTEIN, J. M., PATTERSON, D.,AND YELICK, K. Cluster I/O with River:

Page 40

Page 41: Scalable Storage for Digital Librarieslitwin/cours98/Doc-cours...Scalable Storage for Digital Libraries Paul Mather Department of Computer Science Virginia Polytechnic Institute and

Making the fast case common. In Proceed-ings of IOPADS ’99: Input/Output for Paral-lel and Distributed Systems (Atlanta, Geor-gia, May 1999).

[10] ASAMI, S., TALAGALA, N., AND PATTER-SON, D. A. Designing a self-maintainingstorage system. In Proceedings of the 1999IEEE Symposium on Mass Storage Systems(1999).

[11] AVERSA, L., AND BESTAVROS, A. Loadbalancing a cluster of web servers usingdistributed packet rewriting. Tech. Rep.1999-001, Boston University, 6 Jan. 1999.urn:hdl:ncstrl.bu_cs/1999-001.

[12] BAKER, M. G., HARTMAN, J. H.,KUPFER, M. D., SHIRRIFF, K. W., AND

OUSTERHOUT, J. K. Measurements of adistributed filesystem. In Proceedings ofSymposium on Operating Systems Principles(SOSP) 13 (Oct. 1991), pp. 198–212.

[13] BARCLAY, T., GRAY, J., AND SLUTZ, D.Microsoft TerraServer: A spatial datawarehouse. In Proceedings of the 2000 ACMSIGMOD International Conference on Man-agement of Data (Austin, Texas, USA, May2000).

[14] BARU, C., MOORE, R., RAJASEKAR, A.,AND WAN, M. The SDSC Storage Re-source Broker. In Proceedings of CAS-CON ’98 (Toronto, Canada, 30 Nov.–3 Dec. 1998), IBM Conference. Availablevia http://npaci.edu/DICE/Pubs/srb.ps.

[15] BARVE, R., SHRIVER, E., GIBBONS, P. B.,HILLYER, B. K., MATIAS, Y., AND VIT-TER, J. S. Modeling and optimizing I/Othroughput of multiple disks on a bus. InProceedings of the 1999 Conference on Mea-surement and Modeling of Computer Systems(SIGMETRICS) (Altanta, Georgia, USA,May 1999), pp. 83–92.

[16] BECK, M., AND MOORE, T. The Internet2Distributed Storage Infrastructure project:

An architecture for Internet content chan-nels. Computer Networking and ISDN Sys-tems 30, 22-23 (1998), 2141–2148.

[17] BENNER, A. Fibre Channel: GigabitI/O and Communications for Computer Net-works. McGraw-Hill, New York, NY, 1996.

[18] BENNETT, J. M., BAUER, M. A., AND

KINCHLEA, D. Characteristics of files inNFS environments. In ACM Symposium onSmall Systems (1991), pp. 33–40.

[19] BERENBRINK, P., BRINKMANN, A., AND

SCHEIDELER, C. Design of the PRESTOmultimedia storage network. In Proceed-ings of the Workshop on Communication andData Management in Large Networks (IN-FORMATIK 99) (Oct. 1999).

[20] BERNERS-LEE, T., FIELDING, R. T., AND

MASINTER, L. Uniform resource iden-tifiers (URI): Generic syntax. RFC 2396,Aug. 1998. http://www.ietf.org/rfc/rfc2396.txt.

[21] BERNERS-LEE, T., FIELDING, R. T., AND

NIELSEN, H. F. Hypertext TransferProtocol – HTTP/1.0. RFC 1945, May1996. http://www.ietf.org/rfc/rfc1945.txt.

[22] BERNERS-LEE, T., MASINTER, L., AND

MCCAHILL, M. Uniform resource loca-tors (URL). RFC 1738, Dec. 1994. http://www.ietf.org/rfc/rfc1738.txt.

[23] BERSON, S., MUNTZ, R., AND WONG, W.Randomized data allocation for real-timedisks. In COMPCON ’96 (1996), pp. 286–290.

[24] BOROWSKY, E., GOLDING, R., JACOB-SON, P., MERCHANT, A., SCHREIER, L.,SPASOJEVIC, M., AND WILKES, J. Capac-ity planning with phased workloads. InWOSP ’98 (Santa Fe, New Mexico, USA,Oct. 1998).

[25] BRANDT, C., KYRIAKAKI, G., LAMOTTE,W., LULING, R., MARAGOUDAKIS, Y.,

Page 41

Page 42: Scalable Storage for Digital Librarieslitwin/cours98/Doc-cours...Scalable Storage for Digital Libraries Paul Mather Department of Computer Science Virginia Polytechnic Institute and

MAVRAGANIS, Y., MEYER, K., AND PAP-PAS, N. The SICMA multimedia serverand virtual museum application. In Pro-ceedings of the Third European Conference onMultimedia Applications, Services and Tech-niques (1998), pp. 83–96.

[26] BRODOWICZ, M., AND JOHNSON, O.PARADISE: An advanced featured paral-lel file system. Tech. rep., University ofHouston, 1998.

[27] CALZAROSSA, M., AND SERAZZI, G.Workload characterization: A survey. Pro-ceedings of the IEEE 81, 8 (Aug. 1993), 1136–1150.

[28] CAO, P., LIM, S., VENKATARAMAN, S.,AND WILKES, J. The TickerTAIP paral-lel RAID architecture. In Proceedings of the20th Symposium on Computer Architecture(May 1993), pp. 52–63.

[29] CAO, P., LIM, S. B., VENKATARAMAN, S.,AND WILKES, J. The TickerTAIP parallelRAID architecture. ACM Trans. Comput.Syst. 12, 3 (Aug. 1994), 236–269.

[30] CHANDY, J., AND REDDY, A. L. N. Fail-ure evaluation of disk array organiza-tions. In Proceedings of the InternationalConference on Distributed Computing Sys-tems (1993).

[31] CHEN, P., LEE, E., GIBSON, G., KATZ,R., AND PATTERSON, D. RAID: High-performance, reliable secondary storage.ACM Computing Surveys 26, 2 (June 1994),145–188.

[32] CHEN, P. M., NG, W. T., CHANDRA, S.,AYCOCK, C., RAJAMANI, G., AND LOW-ELL, D. The Rio file cache: Surviving op-erating system crashes. In Proceedings ofArchitectural Support for Programming Lan-guages and Operating Systems (ASPLOS)VII (Oct. 1996), pp. 74–83.

[33] CHEN, P. M., AND PATTERSON, D. A.A new approach to I/O performance

evaluation—self-scaling I/O bench-marks, predicted I/O performance. ACMTrans. Comput. Syst. 12, 4 (Nov. 1994),308–339.

[34] CHIUEH, T. Trail: a track-based log-ging disk architecture for zero-overheadwrites. In Proceedings of the 1993 IEEEInternational Conference on Computer De-sign (ICCD ’93) (Cambridge, MA, USA, 3–6 Oct. 1993), pp. 339–343.

[35] CORBETT, P., FEITELSON, D., FINEBERG,S., HSU, Y., NITZBERG, B., PROST, J.-P.,SNIR, M., TRAVERSAT, B., AND WONG,P. Overview of the MPI-IO parallelI/O interface. In Input/Output in Paralleland Distributed Computer Systems, R. Jain,J. Werth, and J. C. Browne, Eds., vol. 362of The Kluwer International Series in Engi-neering and Computer Science. Kluwer, Am-sterdam, 1996.

[36] CORBETT, P. F., AND FEITELSON, D. G.The Vesta parallel file system. ACM Trans-actions on Computer Systems 14, 4 (Aug.1996), 225–264.

[37] CORBETT, P. F., PROST, J.-P.,DEMETRIOU, C., GIBSON, G., REI-DEL, E., ZELENKA, J., CHEN, Y., FELTEN,E., LI, K., HARTMAN, J., PETERSON, L.,BERSHAD, B., WOLMAN, A., AND AYDT,R. Proposal for a common parallel filesystem programming interface. Tech.rep., Department of Computer Science,University of Arizona, Tucson, Arizona,USA, 1996. Available via the WWW athttp://www.cs.arizona.edu/sio/api1.0.ps.

[38] COURTRIGHT, W. V., GIBSON, G., HOL-LAND, M., AND ZELENKA, J. RAIDframe:Rapid prototyping for disk arrays. In SIG-METRICS ’96 (May 1996), pp. 268–269.

[39] COYNE, R. A., AND HULEN, H. An in-troduction to the Storage System Refer-ence Model, version 5. In Proceedings of

Page 42

Page 43: Scalable Storage for Digital Librarieslitwin/cours98/Doc-cours...Scalable Storage for Digital Libraries Paul Mather Department of Computer Science Virginia Polytechnic Institute and

the Twelfth IEEE Symposium on Mass Stor-age (Apr. 1993).

[40] COYNE, R. A., HULEN, H., AND WAT-SON, R. The High Performance StorageSystem. In Proceedings of the Conferenceon Supercomputing ’93 (Portland, Oregon,USA, 15–19 Nov. 1993), Association forComputing Machinery, pp. 83–92.

[41] CRANE, G., CHAVEZ, R. F., MAHONEY,A., MILBANK, T. L., RYDBERG-COX,J. A., AND SMITH, D. A. Drudgery anddeep thought: Designing digital librariesfor the humanities. Commun. ACM 44, 5(May 2001), 34–40.

[42] CRESPO, A., AND GARCIA-MOLINA, H.Archival storage for digital libraries. InProceedings of Digital Libraries 98 (Pitts-burgh, PA, 1998), ACM, p. 6978.

[43] DABEK, F., KAASHOEK, M. F., KARGER,D., MORRIS, R., AND STOICA, I. Wide-area cooperative storage with CFS. InProceedings of the 2001 Symposium on Op-erating Systems Principles (SOSP) (Banff,Canada, 21–24 Oct. 2001), Association forComputing Machinery.

[44] DANIEL, R. A trivial convention forusing HTTP in URN resolution. RFC2169, June 1997. http://www.ietf.org/rfc/rfc2169.txt.

[45] DANIEL, JR., R., AND LAGOZE, C. Dis-tributed Active Relationship in the War-wick Framework. In Proceedings ofthe Second IEEE Metadata Conference (Sil-ver Spring, Maryland, USA, 16–17 Sept.1997).

[46] DAVIS, J. R., AND LAGOZE, C. A protocoland server for a distributed digital techni-cal report library. Tech. rep., Cornell Uni-versity, June 1994. urn:hdl:ncstrl.cornell/TR94-1418.

[47] DEVINE, R. Design and implementationof DDH: A distributed dynamic hashing

algorithm. In Proceedings of the 4th Interna-tional Conference on Foundations of Data Or-ganization and Algorithms (FODO) (1993).

[48] DEVLIN, B., GRAY, J., LAING, B., AND

SPIX, G. Scalability terminology: Farms,clones, partitions and packs: RACS andRAPS. Tech. Rep. MS-TR-99-85, MicrosoftResearch, Dec. 1999.

[49] DOUCEUR, J. R., AND BOLOSKY, W. J.A large-scale study of file-system con-tents. In SIGMETRICS ’99 (Atlanta, Geor-gia, USA, 29 Apr.–6 May 1999), pp. 59–70.

[50] DRUSCHEL, P., AND ROWSTRON, A.Past: Persistent and anonymous stor-age in a peer-to-peer networking environ-ment. In Proceedings of the 8th IEEE Work-shop on Hot Topics in Operating Systems(HotOS 2001) (Elmau/Oberbayern, Ger-many, May 2001), pp. 65–70.

[51] DUBLIN CORE DIRECTORATE. DublinCore metadata initiative. WWW page,2000. http://purl.org/dc.

[52] DUNN, J. W., AND MAYER, C. A. VARI-ATIONS: A digital music library systemat Indiana University. In Proceedings of theFourth ACM Conference on Digital Libraries(Berkeley, California, USA, Aug. 1999).

[53] DURFEE, E., KISKIS, D., AND BIRMING-HAM, W. The agent architecture of theUniversity of Michigan Digital Library.IEE/British Computer Society Proceedings onSoftware Engineering 144, 1 (Feb. 1997).

[54] EBLING, M. R., AND SATYANARAYANAN,M. SynRGen: An extensible file refer-ence generator. In SIGMETRICS ’94 (SantaClara, CA, May 1994), pp. 108–117.

[55] ECS DATA MANAGEMENT ORGANISA-TION. ECS Data Handling System. WorldWide Web site, Nov. 1999. http://edhs1.gsfc.nasa.gov/.

[56] FIBRE CHANNEL INDUSTRY ASSOCIA-TION. Fibre Channel Industry Associa-

Page 43

Page 44: Scalable Storage for Digital Librarieslitwin/cours98/Doc-cours...Scalable Storage for Digital Libraries Paul Mather Department of Computer Science Virginia Polytechnic Institute and

tion. World Wide Web page, 2000. http://www.fibrechannel.org/.

[57] FIPS 180-1. Secure Hash Standard. U.S.Department of Commerce/NIST, Spring-field, Virginia, USA, Apr. 1995.

[58] FLOURIS, M. D., AND MARKATOS, E. P.Network RAM. In High Performance Clus-ter Computing, B. Rajkumar, Ed. PrenticeHall, 1999, pp. 383–408.

[59] GAIT, J. Phoenix: A safe in-memory filesystem. Commun. ACM 33, 1 (Jan. 1990),81–86.

[60] GANGER, G. R. Generating representa-tive synthetic workloads: An unsolvedproblem. In Proceedings of the Com-puter Measurement Group (CMG) Confer-ence (Dec. 1995), pp. 1263–1269.

[61] GANGER, G. R., AND PATT, Y. N. Theprocess-flow model: Examining I/O per-formance from the system’s point ofview. In Proceedings of the 1993 Confer-ence on Measurement and Modeling of Com-puter Systems (SIGMETRICS) (May 1993),pp. 86–97.

[62] GANGER, G. R., AND PATT, Y. N. Soft up-dates: A solution to the metadata updateproblem in file systems. Tech. Rep. CSE-TR-254-95, University of Michigan, 1995.

[63] GANGER, G. R., WORTHINGTON, B. L.,AND PATT, Y. N. The DiskSim simu-lation environment version 1.0 referencemanual. Tech. Rep. CSE-TR-358-98, De-partment of Electrical Engineering andComputer Science, University of Michi-gan, Ann Arbor, MI, 27 Feb. 1998.

[64] GHANDEHARIZADEH, S., AND IERARDI,D. An algorithm for disk space man-agement to minimize seeks. Tech. Rep.95–612, Department of Computer Science,University of Southern California, LosAngeles, California, 1995.

[65] GHANDEHARIZADEH, S., IERARDI, D.,AND ZIMMERMANN, R. Managementof space in hierarchical storage systems.Tech. Rep. 94–598, Department of Com-puter Science, University of Southern Cal-ifornia, Los Angeles, California, 1994.

[66] GHANDEHARIZADEH, S., IERARDI, D.,AND ZIMMERMANN, R. An online al-gorithm to optimize file layout in a dy-namic environment. Tech. rep., Depart-ment of Computer Science, University ofSouthern California, Los Angeles, Califor-nia, 1995.

[67] GIBSON, G. A. Redundant Disk Arrays: Re-liable Parallel Secondary Storage. PhD the-sis, University of California at Berkeley,1990.

[68] GIBSON, G. A., NAGLE, D. F., AMIRI,K., BUTLER, J., CHANG, F. W., GOBIOFF,H., HARDIN, C., RIEDEL, E., ROCHBERG,D., AND ZELENKA, J. A cost-effective,high-bandwidth storage architecture. InProceedings of Architectural Support for Pro-gramming Languages and Operating Systems(ASPLOS) VIII (Oct. 1998), ACM, pp. 92–103.

[69] GIBSON, G. A., NAGLE, D. F., COUR-TRIGHT II, W., LANZA, N., MAZAITIS, P.,UNANGST, M., AND ZELENKA, J. NASDscalable storage systems. In Proceedings ofUSENIX 1999 (Monterey, California, USA,9–11 June 1999).

[70] GIBSON, T. J., AND MILLER, E. L. Longterm file activity patterns in a unix work-station environment. In Proceedings of theFifteenth IEEE Symposium on Mass StorageSystems (College Park, Maryland, USA,23–26 Mar. 1998), pp. 355–371.

[71] GLADNEY, H. M., AND LOTSPIECH,J. B. Safeguarding digital library con-tents and users: Assuring convenient se-curity and data quality. D-Lib Maga-zine (May 1997). urn:hdl:cnri.dlib/may97-gladney.

Page 44

Page 45: Scalable Storage for Digital Librarieslitwin/cours98/Doc-cours...Scalable Storage for Digital Libraries Paul Mather Department of Computer Science Virginia Polytechnic Institute and

[72] GOLDING, R., SHRIVER, E., SULLIVAN,T., AND WILKES, J. Attribute-managedstorage. In Workshop on Modeling and Spec-ification of I/O (MSIO) (San Antonio, Texas,USA, 26 Oct. 1995).

[73] GRAY, J., AND SHENOY, P. Rules of thumbin data engineering. Tech. Rep. MS-TR-99-100, Microsoft Research, Advanced Tech-nology Division, Dec. 1999.

[74] GROCHOWSKI, E., AND HOYT, R. F. Fu-ture trends in hard disk drives. IEEETransactions on Magnetics 32, 3 (May 1996),1850–1854.

[75] HARTMAN, J. H., MURDOCK, I., AND

SPALINK, T. The Swarm scalable storagesystem. In Proceedings of the 19th IEEE In-ternational Conference on Distributed Com-puting Systems (Austin, Texas, USA,31 May–4 June 1999), IEEE Computer So-ciety, pp. 74–81.

[76] HARTMAN, J. H., AND OUSTERHOUT,J. K. The Zebra striped network filesystem. ACM Trans. Comput. Syst. 13, 3(1995), 274–310.

[77] HASKIN, R. L., AND LORIE, R. A. Onextending the functions of a relationaldatabase system. In Proceedings of the 1982ACM SIGMOD International Conference onManagement of Data (Orlando, Florida,USA, 2–4 June 1982), M. Schkolnick, Ed.,ACM Press, pp. 207–212.

[78] HIRSCHBERG, D. S. A class of dynamicmemory allocation algorithms. Commun.ACM 16, 10 (Oct. 1973), 615–618.

[79] HOLLAND, M., AND GIBSON, G. A. Par-ity declustering for continuous operationin redundant disk arrays. In Proceedings ofArchitectural Support for Programming Lan-guages and Operating Systems (ASPLOS) V(1992), pp. 23–35.

[80] HU, Y., AND YANG, Q. DCD—DiskCaching Disk: A new approach for boost-ing I/O performance. In Proceedings of the

23rd International Symposium on ComputerArchitecture (Philadelphia, PA, USA, May1996), pp. 169–178.

[81] HUNT, G., NAHUM, E., AND TRACEY,J. Enabling content-based load distri-bution for scalable services. Tech. rep.,IBM TJ Watson Research Center, York-town Heights, NY 10598, 1998.

[82] IEEE STORAGE SYSTEM STANDARDS

WORKING GROUP. Mass storage refer-ence model version 5. Tech. rep., IEEEComputer Society Mass Storage Systemsand Technology Committee, 5 Aug. 1993.Unapproved draft.

[83] IETF SECRETARIAT. UniformResource Names (URN) charter.World Wide Web page, Sept. 2000.http://www.ietf.org/html.charters/urn-charter.html.

[84] IHAKA, R., AND GENTLEMAN, R. R:A language for data analysis and graph-ics. Journal of Computational and GraphicalStatistics 5, 3 (1996), 299–314.

[85] INTERNATIONAL BUSINESS MACHINES.IBM Digital Library, 1998. http://www.software.ibm.com/is/dig-lib/.

[86] INTERNATIONAL BUSINESS MACHINES.IBM Deskstar 25GP and Deskstar 22GXPhard disk drives. Product Data Sheet, Apr.1999. TECHFAX 7103.

[87] INTERNATIONAL BUSINESS MACHINES.IBM Ultrastar 36LZX hard disk drives.Product Data Sheet, Mar. 1999. TECHFAX7111.

[88] KAHN, R., AND WILENSKY, R. A frame-work for distributed digital object ser-vices. urn:hdl:cnri.dlib/tn95-01,13 May 1995.

[89] KARGER, D. R., LEHMAN, E., LEIGHTON,F. T., LEVINE, M. S., LEWIN, D., AND

PANIGRAHY, R. Consistent hashing and

Page 45

Page 46: Scalable Storage for Digital Librarieslitwin/cours98/Doc-cours...Scalable Storage for Digital Libraries Paul Mather Department of Computer Science Virginia Polytechnic Institute and

random trees: Distributed caching proto-cols for relieving hot spots on the WorldWide Web. In Proceedings of the 29th An-nual ACM Symposium on Theory of Comput-ing (El Paso, Texas, USA, 4–6 May 1997),pp. 654–663.

[90] KARLSSON, J. S., LITWIN, W., AND

RISCH, T. LH*lh: A scalable high perfor-mance data structure for switched mul-ticomputers. IDA Technical Report 1995LiTH-IDA-R-95-25, Department of Com-puter and Information Science, LinkopingUniversity, S-581 83 Linkping, Sweden,1995. ISSN-0281-4250.

[91] KEETON, K., PATTERSON, D. A., AND

HELLERSTEIN, J. M. A case for intelli-gent disks (IDISKs). SIGMOD Record 27,3 (Aug. 1998), 42–52.

[92] KNOWLTON, K. C. A fast storage alloca-tor. Commun. ACM 8, 10 (Oct. 1965), 623–625.

[93] KOBLER, B., AND BERBERT, J. NASAEarth Observing System Data Informa-tion System (EOSDIS). In Proceedings ofthe Eleventh IEEE Symposium on Mass Stor-age Systems (Monterey, California, USA,7–10 Oct. 1991), pp. 18–19.

[94] KOCH, P. D. L. Disk file allocation basedon the buddy system. ACM Trans. Comput.Syst. 5, 4 (Nov. 1987), 352–370.

[95] KOHL, U., LOTSPIECH, J. B., AND KA-PLAN, M. A. Safeguarding digital li-brary contents and users: Protecting doc-uments rather than channels. D-Lib Maga-zine (Sept. 1997). urn:hdl:cnri.dlib/september97-lotspiech.

[96] KOTZ, D. Disk-directed I/O for MIMDmultiprocessors. ACM Trans. Comput.Syst. 15, 1 (Feb. 1997), 41–74.

[97] KRIEGER, O., AND STUMM, M. HFS:A performance-oriented flexible file sys-tem based on building-block composi-

tions. ACM Trans. Comput. Syst. 15, 3(Aug. 1997), 286–321.

[98] KROLL, B., AND WIDMAYER, P. Distribut-ing a search tree among a growing num-ber of processors. In Proceedings of the 1994ACM SIGMOD International Conference onManagement of Data (Minneapolis, Min-nesota, USA, 24–27 May 1994), pp. 265–276.

[99] KRONENBERG, N. P., LEVY, H. M., AND

STRECKER, W. D. VAXcluster: a closely-coupled distributed system. ACM Trans.Comput. Syst. 4, 2 (May 1986), 130–146.

[100] KUBIATOWICZ, J., BINDEL, D., CHEN,Y., EATON, P., GEELS, D., GUMMADI,R., RHEA, S., WEATHERSPOON, H.,WEIMER, W., WELLS, C., AND ZHAO,B. Oceanstore: An architecture forglobal-scale persistent storage. In Proceed-ings of ACM ASPLOS (Cambridge, Mas-sachusetts, USA, Nov. 2000).

[101] LAGOZE, C. A secure repository de-sign for digital libraries. D-Lib Maga-zine (Dec. 1995). urn:hdl:cnri.dlib/december95-lagoze.

[102] LAGOZE, C., LYNCH, C. A., AND

DANIEL, JR., R. The Warwick framework:A container architecture for aggregatingsets of metadata. Computer Science Tech-nical Report TR96-1593, Cornell Univer-sity, June 1996.

[103] LAGOZE, C., AND PAYETTE, S. Aninfrastructure for open-architecture digi-tal libraries. Tech. Rep. TR98-1690, De-partment of Computer Science, CornellUniversity, 1998. urn:hdl:ncstrl.cornell/TR98-1690.

[104] LAGOZE, C., SHAW, E., DAVIS, J. R., AND

KRAFFT, D. B. Dienst: Implementationreference manual. Tech. rep., Cornell Uni-versity, May 1995. urn:hdl:ncstrl.cornell/TR95-1514.

Page 46

Page 47: Scalable Storage for Digital Librarieslitwin/cours98/Doc-cours...Scalable Storage for Digital Libraries Paul Mather Department of Computer Science Virginia Polytechnic Institute and

[105] LARSON, P. Dynamic hashing. BIT (1978),184–201.

[106] LARSON, P. Dynamic hash tables. Com-munications of the ACM 31, 4 (Apr. 1988),446–457.

[107] LARSON, P. Linear hashing withseparators—a dynamic hashing schemeacheiving one-access retrieval. ACMTrans. Database Syst. 13, 3 (Sept. 1988),366–388.

[108] LEE, E. K., AND KATZ, R. H. Per-formance consequences of parity place-ment in disk arrays. In Proceedings ofArchitectural Support for Programming Lan-guages and Operating Systems (ASPLOS) IV(1991), pp. 190–199.

[109] LEE, E. K., AND THEKKATH, C. A. Petal:Distributed virtual disks. In Proceedings ofArchitectural Support for Programming Lan-guages and Operating Systems (ASPLOS)VII (Oct. 1996), pp. 84–92.

[110] LEINER, B. M. Metrics and the dig-ital library. urn:hdl:cnri.dlib/july98-editorial, July 1998. Guesteditorial.

[111] LEINER, B. M. The NCSTRL ap-proach to open architecture for the con-federated digital library. D-Lib Maga-zine (Dec. 1998). urn:hdl:cnri.dlib/december98-leiner.

[112] LISKOV, B., GHEMAWAT, S., GRUBER,R., JOHNSON, P., SHRIRA, L., AND

WILLIAMS, M. Replication in the Harpfile system. In Proceedings of the 1991Symposium on Operating Systems Principles(SOSP) (Oct. 1991), pp. 226–238.

[113] LITWIN, W. Linear hashing: A new toolfor file and table addressing. In Pro-ceedings of Very Large Databases (Montreal,Canada, 1980).

[114] LITWIN, W., MENON, J., AND RISCH,T. LH* schemes with scalable availabil-

ity. Research Report RJ 1021 (91937), IBMAlmaden, May 1998.

[115] LITWIN, W., MENON, J., RISCH, T., AND

SCHWARZ, T. J. E. Design issues forscalable availability LH* schemes withrecord grouping. In DIMACS Workshop onDistributed Data and Structures (PrincetonUniversity, May 1999), Carleton Scientific.

[116] LITWIN, W., AND NEIMAT, M.-A. High-availability LH* schemes with mirroring.In Proceedings of the Conference on Coopera-tive Information Systems (COOPIS ’96) (Jan.1996).

[117] LITWIN, W., NEIMAT, M.-A., LEVY, G.,NDIAYE, S., AND SECK, T. LH*s: Ahigh-availability and high-security scal-able distributed data structure. In Proceed-ings of IEEE RIDE ’97 (1997).

[118] LITWIN, W., NEIMAT, M.-A., AND

SCHNEIDER, D. A. LH*—linear hashingfor distributed files. In Proceedings of SIG-MOD ’93 (May 1993), ACM, pp. 327–336.

[119] LITWIN, W., NEIMAT, M.-A., AND

SCHNEIDER, D. A. RP*: A family oforder-preserving scalable distributed datastructures. In Proceedings of Very LargeDatabases (Sept. 1994).

[120] LITWIN, W., NEIMAT, M.-A., AND

SCHNEIDER, D. A. LH*—a scalable, dis-tributed data structure. ACM Transactionson Database Systems 21, 4 (Dec. 1996), 480–525.

[121] LITWIN, W., AND SCHWARZ, T. J. E.LH*rs: A high-availability scalabledistributed data structure using reedsolomon codes. In Proceedings of the 2000ACM SIGMOD International Conference onManagement of Data (Dallas, Texas, USA,2000).

[122] LUSTRE DEVELOPERS. Lustre home.WWW page, 2000. http://www.lustre.org.

Page 47

Page 48: Scalable Storage for Digital Librarieslitwin/cours98/Doc-cours...Scalable Storage for Digital Libraries Paul Mather Department of Computer Science Virginia Polytechnic Institute and

[123] MALY, K., NELSON, M. L., AND ZUBAIR,M. Smart Objects, Dumb Archives: Auser-centric, layered digital library frame-work. D-Lib Magazine 5, 3 (Mar. 1999).urn:doi:10.1045/march99-maly.

[124] MATTHEWS, J. N., ROSELLI, D.,COSTELLO, A. M., WANG, R. Y., AND

ANDERSON, T. E. Improving the per-formance of log-structured file systemswith adaptive methods. In Proceedings ofSymposium on Operating Systems Principles(SOSP) 16 (Saint-Malo, France, Oct. 1997),pp. 238–251.

[125] MCCOY, K. VMS File System Internals.Digital Press, 1990.

[126] MCKUSICK, M. K. A fast file systemfor UNIX. ACM Trans. Comput. Syst. 2, 3(1984), 181–197.

[127] MCKUSICK, M. K., AND KOWALSKI, T. J.Fsck ffs—the UNIX file system check pro-gram. In BSD System Manager’s Manual.Computer Systems Research Group, Uni-versity of California, Berkeley, July 1985,ch. 3.

[128] MEALLING, M. URI resolution servicesnecessary for URN resolution. RFC 2483,Jan. 1999. http://www.ietf.org/rfc/rfc2483.txt.

[129] MEALLING, M., AND DANIEL, R. TheNaming Authority Pointer (NAPTR) DNSresource record. RFC 2915, Sept.2000. http://www.ietf.org/rfc/rfc2915.txt.

[130] MENON, J., MATTSON, D., AND NG, S.Distributed sparing for improved perfor-mance of disk arrays. Tech. Rep. RJ 7943,IBM Almaden Research Center, 1991.

[131] MENON, J., ROCHE, J., AND KASSON, J.Floating parity and data disk arrays. Jour-nal of Parallel and Distributed Computing 17(1993), 129–139.

[132] MERCHANT, A., AND YU, P. Design andmodeling of clustered RAID. In Proceed-ings of the International Symposium on FaultTolerant Computing (1992).

[133] MILLER, E. An introduction to the re-source description framework. D-LibMagazine (May 1998). urn:hdl:cnri.dlib/may98-miller.

[134] MOATS, R. URN syntax. RFC 2141,May 1997. http://www.ietf.org/rfc/rfc2141.txt.

[135] MOORE, R., BARU, C., RAJASEKAR,A., LUDAESCHER, B., MARCIANO, R.,WAN, M., SCHROEDER, W., AND GUPTA,A. Collection-based persistent digi-tal archives - part 1. D-Lib Magazine6, 3 (Mar. 2000). urn:doi:10.1045/march2000-moore-pt1.

[136] MOORE, R., BARU, C., RAJASEKAR,A., LUDAESCHER, B., MARCIANO, R.,WAN, M., SCHROEDER, W., AND GUPTA,A. Collection-based persistent digi-tal archives - part 2. D-Lib Magazine6, 4 (Apr. 2000). urn:doi:10.1045/april2000-moore-pt2.

[137] MORAN, J., SANDBERG, R., COLEMAN,D., KEPECS, J., AND LYON, B. Breakingthrough the NFS performance barrier. InProceedings of the EUUG Spring 1990 (Apr.1990).

[138] NELSON, M. L., MALY, K., SHEN, S.N. T., AND ZUBAIR, M. NCSTRL+:Adding multi-discipline and multi-genresupport to the Dienst protocol using clus-ters and buckets. In Proceedings of theIEEE Forum on Reasearch and TechnologyAdvances in Digital Libraries, IEEE ADL’98 (Santa Barbara, California, USA, 22–24 Apr. 1998), IEEE Computer Society,pp. 128–136. ISBN 0-8186-8464-X.

[139] NG, S. W. Crosshatch disk array for im-proved reliability and performance. In

Page 48

Page 49: Scalable Storage for Digital Librarieslitwin/cours98/Doc-cours...Scalable Storage for Digital Libraries Paul Mather Department of Computer Science Virginia Polytechnic Institute and

Proceedings of the 21st International Sym-posium on Computer Architecture (1994),pp. 255–264.

[140] NIEUWEJAAR, N., AND KOTZ, D. Per-formance of the Galley parallel file sys-tem. In Proceedings of IOPADS ’96: In-put/Output for Parallel and Distributed Sys-tems (Philadelphia, Pennsylvania, USA,1996), pp. 83–94.

[141] OPEN SOFTWARE FOUNDATION. File sys-tems in a distributed computing environ-ment: A white paper. Tech. rep., OpenSoftware Foundation, Cambridge, MA,1991.

[142] ORACLE CORPORATION. Oracle InternetFile System (iFS). Product Overview,2000. http://www.oracle.com/database/options/ifs/iFSFO.html.

[143] OUSTERHOUT, J. K., CHERENSON, A.,DOUGLIS, F., NELSON, M., AND WELCH,B. The Sprite network operating system.IEEE Computer (Feb. 1988), 23–36.

[144] OUSTERHOUT, J. K., DA COSTA, H.,HARRISON, D., KUNZE, J. A., KUPFER,M., AND THOMPSON, J. G. A trace-driven analysis of the UNIX 4.2 BSD filesystem. Operating System Review 19, 4(1985), 15–24.

[145] OUSTERHOUT, J. K., AND DOUGLIS, F.Beating the I/O bottleneck: A case for log-structured file systems. Operating SystemsReview 23, 1 (1989), 11–27.

[146] PAEPCKE, A., BALDONADO, M. Q. W.,CHANG, C.-C. K., COUSINS, S., AND

GARCIA-MOLINA, H. Using distributedobjects to build the Stanford digital li-brary Infobus. IEEE Computer 32, 2 (Feb.1999), 80–87.

[147] PAI, V. S., ARON, M., BANGA, G.,SVENDSEN, M., DRUSCHEL, P.,ZWAENEPOL, W., AND NAHUM, E.Locality-aware request distribution

in cluster-based network servers. InProceedings of Architectural Support forProgramming Languages and Operating Sys-tems (ASPLOS) VIII (San Jose, California,USA, 2–7 Oct. 1998), pp. 205–216.

[148] PAI, V. S., DRUSCHEL, P., AND

ZWAENEPOEL, W. IO-Lite: A uni-fied I/O buffering and caching system.In Proceedings of the Third Symposium onOperating Systems Design and Implementa-tion (OSDI ’99) (New Orleans, Louisiana,USA, 22–25 May 1999), pp. 15–28.

[149] PASKIN, N. DOI: Current status and out-look. D-Lib Magazine (May 1999). urn:doi:10.1045/may99-paskin.

[150] PASKIN, N. Toward unique identifiers.Proceedings of the IEEE 87, 7 (July 1999).

[151] PAYETTE, S., AND LAGOZE, C. Flexi-ble and Extensible Digital Object andRepository Architecture (FEDORA). InSecond European Conference on Researchand Advanced Technology for Digital Li-braries (Heraklion, Crete, 21–23 Sept.1998), vol. 1513 of Lecture Notes inComputer Science, Springer. http://www2.cs.cornell.edu/payette/papers/ECDL98/FEDORA.html.

[152] PIERCE, P. A concurrent file system fora highly parallel mass storage system.In Fourth Conference on Hypercube Con-current Computers and Applications (1989),pp. 155–160.

[153] PLAXTON, C. G., RAJARAMAN, R., AND

RICHA, A. W. Accessing nearby copiesof replicated objects in a distributed en-vironment. In ACM Symposium on Paral-lel Algorithms and Architectures (June 1997),pp. 311–320.

[154] POWELL, A. Resolving DOI basedURNs using Squid: An experimentalsystem at UKOLN. D-Lib Magazine(June 1998). urn:hdl:cnri.dlib/june98-powell.

Page 49

Page 50: Scalable Storage for Digital Librarieslitwin/cours98/Doc-cours...Scalable Storage for Digital Libraries Paul Mather Department of Computer Science Virginia Polytechnic Institute and

[155] RABIN, M. O. Efficient dispersal of infor-mation for security, load balancing, andfault tolerance. J. ACM 36, 2 (Apr. 1989),335–348.

[156] RAMAKRISHNAN, K. K., BISWAS, P., AND

KAREDLA, R. Analysis of file I/O tracesin commercial computing environments.Performance Evaluation Review 20, 1 (June1992), 78–90.

[157] RATNASAMY, S., FRANCIS, P., HANDLEY,M., KARP, R., AND SHENKER, S. A scal-able content-addressable network. In SIG-COMM ’01 (San Diego, California, USA,27–31 Aug. 2001), Association for Com-puting Machinery, pp. 161–172.

[158] RIEDEL, E., AND GIBSON, G. A. Ac-tive disks - remote execution for network-attached storage. Tech. Rep. CMU-CS-97-198, Carnegie Mellon University, Dec.1997.

[159] ROSENBLUM, M. The Design and Imple-mentation of a Log-Structured File System.Kluwer, 1995.

[160] ROSENBLUM, M., AND OUSTERHOUT,J. K. The design and implementation of alog-structured file system. In Proceedingsof the thirteenth Symposium on OperatingSystems Principles (SOSP) (Pacific Grove,California, USA, 13–16 Oct. 1991), pp. 1–15.

[161] RUEMMLER, C., AND WILKES, J. An in-troduction to disk drive modeling. IEEEComputer 27, 3 (Mar. 1994), 17–29.

[162] SANDBERG, R., GOLDBERG, D.,KLEIMAN, S., WALSH, D., AND LYON,B. Design and implementation of theSun Network File system. In Proceedingsof the Summer USENIX Conference (1985),pp. 119–130.

[163] SANTOS, J. R., AND MUNTZ, R. Per-formance analysis of the RIO multimediastorage system with heterogeneous diskconfigurations. In Proceedings of the sixth

ACM international conference on Multime-dia (Bristol, United Kingdom, 13–16 Sept.1998), Association for Computing Ma-chinery, pp. 303–308.

[164] SANTOS, J. R., MUNTZ, R., AND BENSON,S. A parallel disk storage system for real-time multimedia applications. In SpecialIssue on Multimedia Computing Systems ofthe International Journal of Intelligent Sys-tems (1998).

[165] SATYANARAYANAN, M. A study of filesizes and functional lifetimes. In Pro-ceedings of Symposium on Operating SystemsPrinciples (SOSP) 8 (Dec. 1981), Associa-tion for Computing Machinery, pp. 96–108.

[166] SATYANARAYANAN, M. Coda: A highlyavailable file system for a distributedworkstation environment. In Proceedingsof the Second IEEE Workshop on WorkstationOperating Systems (Sept. 1989).

[167] SATYANARAYANAN, M. Scalable, secure,and highly available distributed file ac-cess. IEEE Computer (May 1990), 9–20.

[168] SCHINDLER, J., AND GANGER, G. R. Au-tomated disk drive characterization. InProceedings of the 2000 Conference on Mea-surement and Modeling of Computer Systems(SIGMETRICS) (Santa Clara, California,USA, June 2000), pp. 112–113.

[169] SCHULZE, M., GIBSON, G., KATZ, R.,AND PATTERSON, D. How reliable is aRAID? In Proceedings of the 34th IEEEComputer Society International Conference(1989), pp. 118–123.

[170] SELTZER, M. File System Performance andTransaction Support. PhD thesis, Univer-sity of California, Berkeley, 1992.

[171] SELTZER, M., BOSTIC, K., MCKUSIC,M. K., AND STAELIN, C. An implemen-tation of a log-structured file system forUNIX. In 1993 Winter USENIX Conference(San Diego, CA, 25–29 Jan. 1993).

Page 50

Page 51: Scalable Storage for Digital Librarieslitwin/cours98/Doc-cours...Scalable Storage for Digital Libraries Paul Mather Department of Computer Science Virginia Polytechnic Institute and

[172] SELTZER, M., SMITH, K. A., BALAKRISH-NAN, H., CHANG, J., MCMAINS, S., AND

PADMANABHAN, V. File system loggingversus clustering: A performance com-parison.

[173] SELTZER, M., AND STONEBRAKER, M.Read optimized file system designs: Aperformance evaluation. In Proceedings ofthe 7th International Conference on Data En-gineering (Kobe, Japan, 8–12 Apr. 1991).

[174] SHAFER, K., WEIBEL, S., JUL, E., AND

FAUSEY, J. Introduction to Persistent Uni-form Resource Locators. WWW page,1996. http://purl.oclc.org/OCLC/PURL/INET96.

[175] SHIERS, J. Building a database for theLHC: The exabyte challenge. In Proceed-ings of the Fifteenth IEEE Symposium onMass Storage Systems (College Park, Mary-land, USA, 23–26 Mar. 1998), pp. 385–395.

[176] SHRIVER, E. A formalization of the at-tribute mapping problem. Tech. Rep.HPL-SSP-95-10 Rev D, Hewlett-PackardLaboratories, Palo Alto, California, USA,15 July 1996.

[177] SHRIVER, E. Performance Modeling for Re-alistic Storage Devices. PhD thesis, De-partment of Computer Science, New YorkUniversity, New York, New York, USA,May 1997.

[178] SHRIVER, E., MERCHANT, A., AND

WILKES, J. An analytical behaviourmodel for disk drives with readaheadcaches and request reordering. In Proceed-ings of the 1998 Conference on Measurementand Modeling of Computer Systems (SIG-METRICS) (June 1998), pp. 182–191.

[179] SIBERT, O., BERNSTEIN, D., AND VAN

WIE, D. Securing the content, notthe wire, for information commerce.Tech. rep., InterTrust Technologies Corp.,1996. http://www.intertrust.com/architecture/stc.html.

[180] SOLLINS, K. Architectural principles ofuniform resource name resolution. RFC2276, Jan. 1998. http://www.ietf.org/rfc/rfc2276.txt.

[181] SOLLINS, K., AND MASINTER, L. Func-tional requirements for uniform resourcenames. RFC 1737, Dec. 1994. http://www.ietf.org/rfc/rfc1737.txt.

[182] SOLTIS, S. R. The Design and Implemen-tation of a Distributed File System based onShared Network Storage. PhD thesis, Uni-versity of Minnesota, Aug. 1997.

[183] STODOLSKY, D., AND GIBSON, G. A. Par-ity logging: overcoming the small writeproblem in redundant disk arrays. In Pro-ceedings of the 1993 Symposium on Operat-ing Systems Principles (SOSP) (1993).

[184] STOICA, I., MORRIS, R., KARGER, D.,KAASHOEK, M. F., AND BALAKRISHNAN,H. Chord: A scalable peer-to-peer lookupservice for internet applications. In SIG-COMM ’01 (San Diego, California, USA,27–31 Aug. 2001), Association for Com-puting Machinery.

[185] STORAGE NETWORKING INDUS-TRY ASSOCIATION. A dictionaryof storage networking terminol-ogy. World Wide Web page, 2000.http://www.snia.org/English/Resources/Dictionary.html.

[186] SUN, S. X., AND LANNOM, L.Handle system overview. IETFDraft, Aug. 2000. http://www.ietf.org/internet-drafts/draft-sun-handle-system-05.txt.

[187] SUN, S. X., REILLY, S., AND LAN-NOM, L. Handle system names-pace and service definition. IETFDraft, Aug. 2000. http://www.ietf.org/internet-drafts/draft-sun-handle-system-def-03.txt.

Page 51

Page 52: Scalable Storage for Digital Librarieslitwin/cours98/Doc-cours...Scalable Storage for Digital Libraries Paul Mather Department of Computer Science Virginia Polytechnic Institute and

[188] SZALAY, A. S., KUNSZT, P., THAKAR, A.,GRAY, J., SLUTZ, D., AND BRUNNER, R. J.Designing and mining multi-terabyte as-tronomy archives: The Sloan Digital SkySurvey. In Proceedings of the 2000 ACMSIGMOD International Conference on Man-agement of Data (Austin, Texas, USA, May2000).

[189] TALAGALA, N., ASAMI, S., ANDERSON,T., AND PATTERSON, D. Tertiary disk:Large scale distributed storage. Tech. rep.,University of California at Berkeley, 1996.

[190] TALAGALA, N., ASAMI, S., PATTERSON,D. A., HART, D., AND FUTERNICK, B.The art of massive storage: A case studyof a web archive. IEEE Computer (Feb.1999). submitted for publication.

[191] THE NAMING SYSTEM VENTURE. Reis-erFS. World Wide Web, Jan. 2001. http://www.reiserfs.org/.

[192] THEKKATH, C. A., MANN, T., AND LEE,E. K. Frangipani: A scalable distributedfile system. In Proceedings of Symposiumon Operating Systems Principles (SOSP) 16(Oct. 1997), pp. 224–237.

[193] THIELE, H. The Dublin Core andWarwick Framework. D-Lib Magazine(Jan. 1998). urn:hdl:cnri.dlib/january98-thiele.

[194] UPFAL, E., AND WIDGERSON, A. Howto share memory in a distributed system.J. ACM 34, 1 (Jan. 1987), 116–127.

[195] URN IMPLEMENTORS. Uniform resourcenames: A progress report. D-Lib Maga-zine (Feb. 1996). urn:hdl:cnri.dlib/february96-urn_implementors.

[196] UYSAL, M., ALVAREZ, G. A., AND

MERCHANT, A. A modular, analyticalthroughput model for modern disk ar-rays. In Ninth International Symposiumon Modeling, Analysis and Simulation ofComputer and Telecommunication Systems

(MASCOTS 2001) (Cincinnati, Ohio, USA,15–18 Aug. 2001).

[197] VAHDAT, A., ANDERSON, T., AND

DAHLIN, M. Active Names: Pro-grammable location and transport ofwide-area resources. Tech. rep., Univer-sity of California at Berkeley, 1998.

[198] VINGRALEK, R., BREITBART, Y., AND

WEIKUM, G. Distributed file organiza-tion with scalable cost/performance. InProceedings of ACM-SIGMOD (May 1994),pp. 253–264.

[199] VIRTUAL INTERFACE ARCHITECTURE

CONSORTIUM. Virtual Interface Archi-tecture. World Wide Web page, 2000.http://www.viarch.org/.

[200] VOGELS, W. File system usage in Win-dows NT 4.0. In Proceedings of the 17thSymposium on Operating Systems Princi-ples (SOSP) (Charleston, USA, 12–15 Dec.1999), Association for Computing Ma-chinery, pp. 93–109.

[201] WANG, R. Y., AND ANDERSON, T. E.xFS: A wide area mass storage file sys-tem. Tech. rep., University of Californiaat Berkeley, 1993.

[202] WILKES, J. The Pantheon storage-systemsimulator. Tech. Rep. HPL-SSP-95-14,Hewlett-Packard Laboratories, Palo Alto,California, USA, May 1996.

[203] WILKES, J., GOLDING, R., STAELIN, C.,AND SULLIVAN, T. The HP AutoRAID hi-erarchical storage system. In Proceedings ofSymposium on Operating Systems Principles(SOSP) 15 (Dec. 1995), pp. 96–108.

[204] WILSON, P. R., JOHNSTONE, M. S.,NEELY, M., AND BOLES, D. Dynamicstorage allocation: A survey and criticalreview. In Proceedings of the 1995 Inter-national Workshop on Memory Management(Kinross, Scotland, United Kingdom, 27–29 Sept. 1995).

Page 52

Page 53: Scalable Storage for Digital Librarieslitwin/cours98/Doc-cours...Scalable Storage for Digital Libraries Paul Mather Department of Computer Science Virginia Polytechnic Institute and

[205] WINTER, R., AND AUERBACH, K. Gi-ants walk the earth: The 1997 VLDB sur-vey. Database Programming and Design 10,9 (Sept. 1997).

[206] WINTER, R., AND AUERBACH, K. Thebig time: The 1998 VLDB survey. DatabaseProgramming and Design 11, 8 (Aug. 1998).

[207] WORTHINGTON, B. L., GANGER, G. R.,PATT, Y. N., AND WILKES, J. On-lineextraction of SCSI disk drive parame-ters. In Proceedings of the 1995 Conferenceon Measurement and Modeling of ComputerSystems (SIGMETRICS) (Ottawa, Canada,May 1995), pp. 146–156.

Page 53