1 archival storage for digital libraries arturo crespo hector garcia-molina stanford university
Post on 21-Dec-2015
219 views
TRANSCRIPT
2
Motivation
Digital information already lost:– Early NASA records
– U.S. Census Information
– Toxic Waste records
Decay Time for common media:– Magnetic Tapes: 10-20 years
– CD-ROM: 5-50 years
– Hard Drive: 3-5 years
Obsolescence of Digital Media is even faster
3
Preservation of Digital Objects
Data Preservation
Meaning Preservation
Our work only addresses Data Preservation
4
A Case Study: Stanford/MIT CSTR
Stanford MIT
CSTR Scenario:– Need for on-line access of documents
– But also for long-term archival of document
5
Is This a Solved Database Problem?
Database systems can reliably store objects However:
– Need same or compatible system
– Migration is problematic
Our architecture coordinates database systems, it does not replace them.
6
Contribution
An architecture and algorithms for:– Long-term Archival Storage of Digital Objects
– Allowing on-line access to Digital Objects
– Preserving data as technology and organizations evolve
7
Key Concepts
Signatures as Object Handles Deletions are not allowed Reliability is achieved through Replication Layered Architecture Awareness Everywhere Disposable Auxiliary Structures
8
Signatures as Object Handles
Object Handles identify objects– Internal to the Digital Library Repository
– Users may need high level naming facilities
– Traditional approaches
Signatures:– Checksum or CRC of the object
f ( )signature
object
9
Properties of Signatures as Object handles
Each site can generate handles independently Handles can be reconstructed from the object Copies automatically have same handle Objects with different content have high probability
of having different handles Cannot modify objects
s1 s2 s1 s3 s2 s4 = s1
10
Signature Collisions A very rare event if signatures are 128 bits or more.
Assumes uniform distribution of handles and objects bigger than signatures CollectionSize
Probabilityof havingCollisions
SignatureSize
107 10-9 76 bits
107 10-24 128 bits
1010 10-18 128 bits
1010 10-57 256 bits
11
No Deletions
Objects are never (voluntarily) deleted This simplifies many things:
– Distinguishes between deleted and corrupted objects (improving reliability)
– Easier recovery from failures
However, it complicates others:– “Wasted” space
– Version management
12
No Deletions
No deletion rule is natural in Digital Libraries Wasted space is not critical as:
– Storage cost is low
– Only archiving library objects, not all possible data
14
Versions
• How can we find the latest version?
Object O1
Object O2Version2(latest)
Version1
Version Object
tuple
tuple
16
Reliability Service
Long term persistence is achieved by replication Sites establish Replication Agreements to maintain
copies of objects in a given Replication Group
Stanford MIT
22
Layered Architecture
User Access
Security and Accounting
Import
Metadata and Indexing
Reliability
Complex Object
Identity
Object Store
Data Store
23
Awareness Everywhere
Awareness services: standing orders, subscriptions, alerts, etc.
Critical for Digital Libraries Should be part of the interface of every layer.
24
Disposable Auxiliary Structures
Auxiliary Structures can be reconstructed from the Digital Objects– Avoid potential inconsistencies
– Easier to migrate objects
Example: Index of disk-ids to object handles
D1 D2IdentityLayer
Handle D1
Index
25
Related Work
Other Digital Library architectures Report of the task force on preserving digital
information Petal and Frangipani projects COLD systems