preserving digital media

45
© 2005 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Preserving Digital Media Dr. Robert TANSLEY Digital Media Systems Lab, HP Labs

Upload: others

Post on 03-Oct-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

untitledPreserving Digital Media
200571 2
Preserving Digital Media: The Problem • Much of humanity’s intellectual output is now digital
• Much at risk of being lost forever
• Or being left beyond viable use
• An unsolved problem HP and HP’s customers care about
200571 3
− U.S. − U.K. − Israel − Japan − India
Bristol
• Architect of DSpace digital asset management system
• Research focus: Long-term preservation of digital media
200571 5
• TIME Magazine Archive
• ARKive
• DSpace
Digital Remastering of TIME Magazine
July 1, 2005 7
July 1, 2005 8
Problem statement • To digitize all TIME magazines from 1923 to
2003, (document structure analysis, article reconstruction for ~500K pages).
• Automatically extract articles and related metadata for web presentation (http://www.timearchive.com) −Accurate enough to deliver an excellent reading
experience to a paying user
• Extract images and page layout for future uses
July 1, 2005 9
Cross-page link
Challenges
• High accuracy requirements (99.95% text, 100% of articles) − Well beyond today’s out-of-the-box commercial OCRs
• Article extraction and zone/page tagging techniques − Must detect advertisements and other non-article content. − Must deal with:
• multiple articles per page, • multiple pages per article, • unknown column/row article layout, • insets, etc.
• Computing + storage requirements: − 500K pages x ~3m/page = ~1,000 days of CPU − 500K pages x 30MB/page = ~15 TB storage
• Combination of automated extraction / manual correction − Requires identifying “error suspects” and implementing extensive
“sanity checks” (to identify rescan/reprocessing candidates)
July 1, 2005 11
The End Result • TIME launched the archive in Dec’04
− You can read all articles (samples and subscription) at http://www.timearchive.com
• Complete structural accuracy of all articles with high text accuracy.
• Leading edge end-to-end digitization system − Resilient Process Control of recognition components − Custom Recognition algorithms as necessary − Off the Shelf Components where available − Adaptive orchestration of components. − Extension of previous work on
national gallery
= everything is possible hp
July 1, 2005 14
July 1, 2005 15
National Gallery (NG) • UK’s premiere art gallery and museum • Collection of (esp.) western European fine art dating to early
1900’s • E.g. Raphael, Titian, Rembrandt, da Vinci, Monet, van Gogh
• 2,300 paintings in the collection • Small by many standards – but acknowledge quality collection • Virtually all paintings owned by NG (on behalf of nation) • Virtually all paintings on show (not in storage)
• UK’s most popular tourist site • ~5 M visitors per year
• Location: Trafalgar Sq. London • London’s most visited location
• Free access to main galleries • Partial govt. funding + endowments + donations • Critical income from (NFP) commercial activities
− Special exhibitions, shop, publishing
July 1, 2005 16
3D pictures? An example of HP Labs + NG • For most purposes paintings are
treated as 2D (i.e. flat)
• For capture they are lit with 45° lighting for even illumination
• but 3D structure of paintings can reveal much about them − incisions (for outlining) − impasto (thick paint texture) − panel deformation (climate)
• 3D structure can be revealed by “raking light” imaging
• BUT – this is static
July 1, 2005 17
increasing the photorealism of texture maps and adding interactive computer generated lighting. Developed by HPL
• The Dome is designed to light a painting from a number of different positions during image capture including low-angle (raking light) positions
July 1, 2005 18
Print-on-Demand (POD)
• Previously NG shops offered only limited prints and postcards • Full collection digitised at extreme resolution & colour accuracy • Entire collection now available on HP-developed POD system
July 1, 2005 19
200571 20
"Over the past few decades a vast treasury of wildlife images has been steadily accumulating, yet no one has known its full extent - or its gaps - and no one has had a comprehensive way of getting
access to it. ARKive will put that right. It will become an invaluable tool for all concerned with
the well-being of the natural world." Sir David Attenborough CH FRS
July 1, 2005 21
=+hp ARKive The world’s digital library of images & recordings of endangered species, digitally preserved, and freely accessible to all online
Requirements An end-to-end system for media capture, storage, management and publishing
Rich Media Challenges • Scale of media
– 40MB/s video, 60-100MB stills, 100TB • Complexity of metadata
– Descriptive, rights, technical, provenance • Mix of media types
– Video, audio, stills, & structured text • Storage mgt & preservation • Repurposing of media
– Many formats & bandwidths for publication
July 1, 2005 22
catalogue, select and edit high quality media
2. A large scale Media Vault – Core media services to store,
manage, preserve and transcode media & metadata
3. Media Publishing systems – To repurpose and present the
media to different audiences
Media Digitisation
Video Editing
July 1, 2005 24
Media Production : Integrated Tools • An integrated web application tool set for media acquisition,
cataloguing & workflow handling rich media, video, audio, image, text, structured data …
July 1, 2005 25
Media Vault : Software Services
ia P
ub lis
hi ng
Export/ Sync
Workflow
“the most ambitious and closely watched program of its kind”
- Chronicle of Higher Education
Numerous research projects extending
Vibrant open source community: dozens of developers and
researchers
July 1, 2005 28
‘born digital’
• Organizations must preserve this investment
• Many types of asset −Documents −Datasets − Rich media (audio, video) − Teaching material − Interactive content −Software
July 1, 2005 29
DSpace Approach • Initial phase − Joint HP and MIT team to build DSpace digital asset
management system 1.0 −HP-funded 2-year project, 2000-2002 − “Breadth-first” attempt −DSpace platform version 1.0 released November 2002
• Current phase −Open source community maintained and developed −Seven ‘committers’ from different organisations (incl HP) −HP working with dozens of developers and researchers
from around the globe to add depth
July 1, 2005 30
Current DSpace features • End-to-end Digital Asset Management system • Open source, standards-based • Multiple creation/import options, including Web UI, XML
batch import, Web Services, Java API • Index/search of metadata and full-text • Easy integration with other systems via OAI-PMH, SRW/U
(Z39.50), Web Services or Java API • Can store any file format, including multi-file formats like
complex Web pages • File formats recorded for future migration to newer formats • Flexible and powerful authorization system
July 1, 2005 31
community • Focal point for research and development in: − Long-term digital preservation −Scalable repositories −Managing complex digital objects
• Draw on wide pool of expertise • Avoid ‘lock-in’ • Widespread adoption assists longevity
200571 32
DSpace 2.0 • HP working with the open source community on
improving DSpace architecture
backup, replication and restoration − Richer representation information: More than just formats
• More modular; better support for ‘plug-ins’
200571 42
HP Labs and China MoE Digital Museum Project • University Museums in China are digitising a wide
variety of objects
• Problem is to manage this variety of categories of object and geographic location
• HP Labs, China MoE & universities collaboration to use and enhance DSpace to build a large, distributed digital asset management system −100 universities, ~2Tb per university museum
Summary
Summary • HP Labs has much experience in preserving digital media
• Including creation − TIME magazine archive − National Gallery
• And archiving − ARKive − DSpace