pairtrees for object storage

1
Pairtrees for object storage John Kunze and Stephen Abrams, California Digital Library (CDL) A pairtree maps ids to paths, two characters at a time A pairtree is a filesystem hierarchy that uses an identifier string to derive an object directory (or folder) location The derivation takes successive pairs of characters and creates a succession of directories, called a pairpath ab2def3 ab/2d/ef/3/ A pairpath ends at directory containing an object’s files; most systems do variation of this (is variation needed?) Reverse the mapping to find all ids/objects in a pairtree; pairpath termination rules permit variable length ids Pre-converting problematic characters Some identifier characters are inconvenient or illegal in filenames and must be hex-encoded (e.g., *^2a) id: what-the-*@?#! what-the-^2a@^3f#! wh/at/-t/he/-^/2a/@^/3f/#! But to keep paths short, 3 common chars are converted to 3 rare chars (at cost of complexity): /= :+ ., id: ark:/13030/xt12t3 ark+=13030=xt12t3 ar/k+/=1/30/30/=x/t1/2t/3/ The deadly embrace Digital repositories tend to require a surrender of storage transparency that creates unhealthy system dependency Internally objects are often broken up so that they can be difficult to piece together in case of trouble Fig. 1. Object storage should not need a fearful entanglement with software. Since objects have to be parked in a filesystem before repository software upgrade, what if we left them in there and built our repositories around them? Pairtree credits and details Pairtree specification: www.ietf.org/internet-drafts/draft-kunze-pairtree-01.txt www.cdlib.org/inside/diglib/pairtree/pairtreespec.html Authors from CDL and University of Michigan (UM): Martin Haye, Erik Hetzner, John Kunze, Mark Reyes, and Cory Snavely; many thanks to Stephen Abrams, Sebastien Korner, Brian Tingle, et al Summary Pairtree is the thinnest smear we can add to our very well- understood filesystems and their universal tools (the universal “API”) to create a very well-understood, platform-independent object storage substrate Pairtree is not a complete repository system, but it is complete for object storage and makes it easier to build systems and to share objects between institutions Why pairs of characters? Taking two chars at a time balances path depth and fanout (number of possible entries in any directory) Example: ab2def3 ab/2d/ef/3/ Each pair, letters+digits, has 36x36 possibilities Compared to taking one char at a time Only 36 possibilities, but path depth grows rapidly Example: ab2def3 a/b/2/d/e/f/3/ At another extreme, taking seven characters at a time Short paths, but 78 billion (36 7 ) possible items Example: ab2def3 ab2def3/ For further information Please contact jak@ucop .edu or [email protected] For information on CDL’s Preservation Program, see http://www. cdlib . org/programs/digital_preservation .html Jim B L Pairtree origins include • Prototype: UCSF tobacco control documents and CDL digitized books • Early production: digitized books for UM and Hathi Trust cyocum Objects in a pairtree A pairtree is especially useful if, for each contained object, all of the object’s parts, and nothing but its parts, are enclosed in the object’s directory Import such a pairtree and, knowing nothing about the objects’ structure and semantics, you can reliably Enumerate all objects and their identifiers Produce any object by requested id Maintain and back it up with ordinary OS tools Rebuild the collection in case of database corruption simply by walking the pairtree To walk a pairtree requires knowing path termination rules A pairpath terminates when you reach a file or reach a directory name with 1 char or more than 2 chars ab/ \--- cd/ |--- foo/ | | README.txt | | thumbnail.gif | |--- master_images/ | | | ... | | | \--- gh/ \--- e/ \--- bar/ | metadata | 54321.wav | index.html Fig. 2. Example pairtree containing two objects: abcd and abcde. The first object is enclosed in directory foo/, the second in bar/. While foo/ does not subsume e/ at the same level, by enclosure, it does subsume the gh/ underneath it. Sample software implementation http://search.cpan.org/~jak/Pairtree-0.2/lib/File/Pairtree.pm A Perl module that implements two mappings: id2ppath() takes an id into a pairpath and ppath2id() performs the inverse mapping.

Upload: john-kunze

Post on 25-May-2015

991 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Pairtrees for object storage

Pairtrees for object storageJohn Kunze and Stephen Abrams, California Digital Library (CDL)

A pairtree maps ids to paths,two characters at a time

A pairtree is a filesystem hierarchy that uses an identifierstring to derive an object directory (or folder) location

• The derivation takes successive pairs of characters andcreates a succession of directories, called a pairpath

ab2def3 ⇒ ab/2d/ef/3/• A pairpath ends at directory containing an object’s files;

most systems do variation of this (is variation needed?)• Reverse the mapping to find all ids/objects in a pairtree;

pairpath termination rules permit variable length ids

Pre-converting problematic charactersSome identifier characters are inconvenient or illegal in

filenames and must be hex-encoded (e.g., *→^2a) id: what-the-*@?#! → what-the-^2a@^3f#! ⇒ wh/at/-t/he/-^/2a/@^/3f/#!

But to keep paths short, 3 common chars are converted to 3rare chars (at cost of complexity): /→= :→+ .→,

id: ark:/13030/xt12t3 → ark+=13030=xt12t3 ⇒ ar/k+/=1/30/30/=x/t1/2t/3/

The deadly embrace• Digital repositories tend to require a surrender of storage

transparency that creates unhealthy system dependency• Internally objects are often broken up so that they can be

difficult to piece together in case of trouble

Fig. 1. Object storage should notneed a fearful entanglement withsoftware. Since objects have tobe parked in a filesystem beforerepository software upgrade, whatif we left them in there and builtour repositories around them?

Pairtree credits and detailsPairtree specification:

www.ietf.org/internet-drafts/draft-kunze-pairtree-01.txtwww.cdlib.org/inside/diglib/pairtree/pairtreespec.html

Authors from CDL and University of Michigan (UM):Martin Haye, Erik Hetzner, John Kunze, Mark Reyes,and Cory Snavely; many thanks to Stephen Abrams,Sebastien Korner, Brian Tingle, et al

SummaryPairtree is the thinnest smear we can add to our very well-

understood filesystems and their universal tools (theuniversal “API”) to create a very well-understood,platform-independent object storage substrate

Pairtree is not a complete repository system, but it iscomplete for object storage and makes it easier to buildsystems and to share objects between institutions

Why pairs of characters?Taking two chars at a time balances path depth and

fanout (number of possible entries in any directory)• Example: ab2def3 ⇒ ab/2d/ef/3/• Each pair, letters+digits, has 36x36 possibilitiesCompared to taking one char at a time• Only 36 possibilities, but path depth grows rapidly• Example: ab2def3 ⇒ a/b/2/d/e/f/3/At another extreme, taking seven characters at a time• Short paths, but 78 billion (367) possible items• Example: ab2def3 ⇒ ab2def3/

For further informationPlease contact [email protected] or [email protected] information on CDL’s Preservation Program, see http://www.cdlib.org/programs/digital_preservation.html

Jim B L

Pairtree origins include• Prototype: UCSF tobacco controldocuments and CDL digitized books• Early production: digitized booksfor UM and Hathi Trust

cyocum

Objects in a pairtreeA pairtree is especially useful if, for each contained object,

all of the object’s parts, and nothing but its parts, areenclosed in the object’s directory

Import such a pairtree and, knowing nothing about theobjects’ structure and semantics, you can reliably

• Enumerate all objects and their identifiers• Produce any object by requested id• Maintain and back it up with ordinary OS tools• Rebuild the collection in case of database corruption

simply by walking the pairtreeTo walk a pairtree requires knowing path termination rules• A pairpath terminates when you reach a file or reach a

directory name with 1 char or more than 2 chars ab/ \--- cd/ |--- foo/ | | README.txt | | thumbnail.gif | |--- master_images/ | | | ... | | | \--- gh/ \--- e/ \--- bar/ | metadata | 54321.wav | index.html

Fig. 2. Example pairtree containing two objects:abcd and abcde. The first object is enclosed indirectory foo/, the second in bar/. While foo/does not subsume e/ at the same level, byenclosure, it does subsume the gh/ underneath it.

Sample software implementationhttp://search.cpan.org/~jak/Pairtree-0.2/lib/File/Pairtree.pm

A Perl module that implements two mappings: id2ppath() takes anid into a pairpath and ppath2id() performs the inverse mapping.