centro ricerche e innovazione tecnologica tape workshop on the curation and preservation of...
TRANSCRIPT
Centro Ricerche e Innovazione Tecnologica
TAPE workshop on the curation and preservation of audiovisual collections
University of Glasgow, Scotland, UK
Monday 12th – Friday 16th May 2008
Giorgio Dimino
RAI Research Centre
Storage and repositories
Centro Ricerche e Innovazione Tecnologica
Reference Model for an Open Archival Information System (OAIS)
Consultative Committee for Space Data Systems (CCSDS)
This document is a technical Recommendation for use in developing a broader consensus on what is required for an archive to provide permanent, or indefinite long-term, preservation of digital information.
This Recommendation establishes a common framework of terms and concepts which comprise an Open Archival Information System (OAIS). It allows existing and future archives to be more meaningfully compared and contrasted. It provides a basis for further standardization within an archival context and it should promote greater vendor awareness of, and support of, archival requirements.
Centro Ricerche e Innovazione Tecnologica
OAIS environment model
Producer Consumer
Management
OAISarchive
Provides content to archive
Uses the archive content
Decides archive strategic objectives
Centro Ricerche e Innovazione Tecnologica
Data vs. InformationOAIS definition
Dataobject
Informationobject
Representationinformation
yeldsInterpretedusing its
10010111
What we store
What we want
Knowledge about data interpretation
Centro Ricerche e Innovazione Tecnologica
Video data formats
Uncompressed raster formats YUV and RGB Standard definition 4:2:2 video, 270 Mb/s, requires 120 GB per hour
Lossless compression (e.g. JPEG2000) Variable efficiency, on average ½ of the uncompressed
Compressed formats (e.g. MPEG2, MPEG4, VC1,DV) Compression depends on the final quality expected, typical bit rates
from 3 Mb/s to 50 Mb/s, up to 100 times reduction The “Representation Information” needed to interpret
compressed formats is generally extremely complex. Rendering is done using specific software or hardware. The written specification must be seen only as a last resort disaster recovery option
Centro Ricerche e Innovazione Tecnologica
Video quality, some considerations
Digital master Result of digitisation of analogue tapes. It becomes the new
master to replace the corresponding analogue tape. It should be stored at maximum quality
Publication master If keeping the all the digital masters on line is too expensive, a
surrogate master can be generated in some cases at lower quality from which all the subsequent publication copies will be derived by transcoding
Publication version The version that is delivered to the user of a particular service
(an archive can offer several services based on the same content)
Viewing version A version at reduced quality used for content selection
Centro Ricerche e Innovazione Tecnologica
OAIS Information Package
ContentInformation
PreservationDescriptionInformation
Packaging Information
DescriptiveInformation
•Provenance•Context•Reference•Fixity
•Data object•Representation information
Centro Ricerche e Innovazione Tecnologica
Video packaging (wrappers)
SMPTE MXFMPEG2 TSMicrosoft ASFAVIApple QuicktimeAdobe Flash FLV SWF
For reference see http://www.digitalpreservation.gov/formats/fdd/descriptions.shtml
Centro Ricerche e Innovazione Tecnologica
OAIS collabration diagram
Centro Ricerche e Innovazione Tecnologica
OAIS functional entities
Centro Ricerche e Innovazione Tecnologica
Storage technologies
Data tapes IBM LTO Ultrium 4 800 GB Quantum DLT-S4 800 GB Sony SAIT 800 GB StoragetekT10000 500 GB
Hard disk Up to 1 TB per disk 3.5” Several RAID configurations possible
Solid State Disks Still expensive but becoming interesting Capacity still lower than hd 128 GB (announced products) 2.5”
Optical Disks DVD RW 9 GB Blu-Ray 50GB
Centro Ricerche e Innovazione Tecnologica
Some remarks
The choice of storage technologies depends on many factors, including: Total amount of data Expected increase rate Desired throughput Access performance Data security
No storage media can last forever No technology can be considered 100% reliable Never keep single copies! Obsolescence occurs very rapidly Data migration must be considered part of the
management process, not an emergency operation
Centro Ricerche e Innovazione Tecnologica
Digital Vs Analogue Archive(Bookshelf meters required for 1000 hours of audio data)
0,00
5,00
10,00
15,00
20,00
25,00
30,00
1/4"ShortTape
(News)
1/4"Standard
Tape
33" Vinyl CD DAT 9GbyteHardDisk
20GbyteDLT
35GbyteDLT
800 GB today
1 TB today
Centro Ricerche e Innovazione Tecnologica
Flat storage
File server
UserFront end
Selection
Content
Data base
NAS
Centro Ricerche e Innovazione Tecnologica
Storage hierarchy
Near-Line
On line
Fast Hard Disk/RAID
Tape (robot)
Solid State Disk
RAM
RAID
Centro Ricerche e Innovazione Tecnologica
Hierarchical Storage Management (HSM)
HD cacheTape robotic storage
File server
UserFront end
Selection
Content
Data base
Centro Ricerche e Innovazione Tecnologica
Federated storage (GRID)
Based on GRID concepts of distributed computing and file system over a WAN
Multiple self-contained storage nodes interconnected Each storage node contains its own storage medium,
microprocessor, indexing capability, and management layer, generally based on commodity pc
Advantages Fault tolerance Scalability Throughput
Example: Google File System, Apache HADOOP
Centro Ricerche e Innovazione Tecnologica
Basic functionalities
Virtualization The user sees a single file system
Data replication The system automatically manages the desired redundancy
Direct access to data Data move from storage node to client without intermediation
Dynamic reconfiguration Nodes can be switched on and off while the system is in
operation Automatic load balancing
Exploiting data replication and direct node access
Centro Ricerche e Innovazione Tecnologica
Data blocking and replication
A data file is divided into fixed length blocks Each block is replicated n times on different nodes
File
data data datadata data datadata data datadata data data data data data data data data data data data data data data
Block 1Block 1
Block 1
Block 2Block 2
Block 2
Block 3
Block 3
Block 3
Block 4 Block 4
Node 1
Node 2Node 3
Node 4
Node 5
Centro Ricerche e Innovazione Tecnologica
Architecture
NodeNode
NodeNode
NodeNode
Node
Node
DataNodes
NameNode
NameNode
userFilename
Nodes list
Data chunksCluster 1 Cluster 2
Node NodeNode Node Node NodeNode Node
Centro Ricerche e Innovazione Tecnologica
Digital Asset Management (1)
A software system that implements all the archive management policies
Provides the archive administrator the necessary tools to Monitor the preservation state of the media Restore backup copies when primary media is damaged Monitor the use of the storage Monitor software/hardware failures Define ingestion and access policies
Should provide support for technology/system migration
Centro Ricerche e Innovazione Tecnologica
Digital Asset Management (2)
Provides the necessary functionalities to implement the ingestion workflow Receive the SIP (or a batch of) Analyse the SIP, verify that all the vital metadata are valid Assign UMIDs Transcode SIP into AIP Generate proxies (low resolution video, key frames) Provide content documentation
Provides the functionalities to implement the access workflow Verify that the user has access rights Provide content selection functionalities (search retrieval and browsing) Verify content associated rights Transcode AIP into DIP (it can depend on user request) Deliver the DIP
Centro Ricerche e Innovazione Tecnologica
OAIS Functions of Archival Storage
Centro Ricerche e Innovazione Tecnologica
Business rights management
A BRM is a system that manages content associated usage rights
Without an automated BRM system the reuse of content can be slowed down by manual rights clearing operations
Depending on the type of archive it can be convenient to have BRM closely coupled with DAM
Centro Ricerche e Innovazione Tecnologica
Digital archive design (1)
Analyse and state clearly your business requirements What is your archive primary goal Who are your users
Producers Consumers
… and what are their needs
Assess your content Amount of items Conservation status Increase rate usage
Centro Ricerche e Innovazione Tecnologica
Select archive video formats and quality Target archived quality depends on foreseen usage and
preservation issues Define the AIP (Archive Information Package)
Video coding File formats Associated metadata
Extimate storage requirements Amount of data Level of security of data Increase rate Input output performace
Digital archive design (2)
Centro Ricerche e Innovazione Tecnologica
Define ingestion workflow and SIP Ingestion procedures are particularly critical if your content needs
digitization and restoration
Define access workflow and DIP Access is heavily dependent on proper documentation and
retrieval tools Properly dimension throughput
Affected by video bitrate and transcoding from AIP to DIP
Define archive maintenance procedures Consistency check Media replacement Disaster recovery
Digital archive design (3)
Centro Ricerche e Innovazione Tecnologica
Consider migration Storage technology
Media capacity follows Moore’s law… but sometime there is a technology leap (e.g. from tape
library to hd arrays) Coding formats
Compression schemes become more efficient allowing grater bit saving at a given quality
– Older formats become obsolete– Transcoding generally implies possible loss of quality
Software/hardwareProprietary formats often pose upgrade constraints
Digital archive design (4)
Centro Ricerche e Innovazione Tecnologica
Consider needs to interfacing to other systems Federated libraries Account systems Production Digital rights management
… and finally design or commission a system
Digital archive design (5)