the digital curation centre experience
TRANSCRIPT
Digital | Curation | Centre
The Digital Curation Centre Experience
(Science data & CCLRC experience)David Giaretta & David Corney
26 July 2005 DCC/DPC Workshop on Cost Models for preserving digital assets
2
Digital | Curation | Centre
Outline
• Science data characteristics• CCLRC experience• Costs• Benefits• Trends• Conclusions
26 July 2005 DCC/DPC Workshop on Cost Models for preserving digital assets
3
Digital | Curation | Centre
Science Data Characteristics
• Mostly numbers – objects often complex and interrelated• Representation not Presentation
– Not just to be looked at by humans (i.e. emulation of associatedsoftware usually not enough)
• Often needs processing– Different levels of processing & trends of access– On-the-fly processing from raw
• Often freely available (e.g. after 1 year)• Often large volumes
– Automated systems• Unforgiving
– Need to beware of “junk” science• Needs to be usable in current tools (i.e. emulation is not
enough)
26 July 2005 DCC/DPC Workshop on Cost Models for preserving digital assets
4
Digital | Curation | Centre
CCLRC Recent New Users & Potential New Users
• National Crystallography Service, Southampton University (2 TB/yr)
• VIRGO Consortium (3 TB/yr?)• Integrative Biology (15 TB/yr?)• WASP (Astronomy) (30TB/yr?)• BBSRC ? (50 TB/yr?)• Diamond (1 PB/yr?)• GRID-PP (1 PB/yr)
26 July 2005 DCC/DPC Workshop on Cost Models for preserving digital assets
5
Digital | Curation | Centre
Datastore Usage by Family
0
50
100
150
200
250
Jun-97 Dec-97 Jun-98 Dec-98 Jun-99 Dec-99 Jun-00 Dec-00 Jun-01 Dec-01 Jun-02 Dec-02 Jun-03 Dec-03 Jun-04 Dec-04 Apr-05
Tbytes
CR-AFRCCRAYSUPCR-EPSRCCR-NERCCR-PPARCDCI-ISEDCI-NETDCI-OHDCI-PCDCI-VISDL-SRDEDGESCIENCEEXTERNALFACILMANFUJISUPITD-SERITD-SUPNUCPHYSRAL-ADMRAL-ENGRAL-SCIRAL-TECHSCALSUPSCALUSERSSDSSD-EODSSD-PPAR
26 July 2005 DCC/DPC Workshop on Cost Models for preserving digital assets
6
Digital | Curation | Centre
Data Growth per period
-10
0
10
20
30
40
50
60
70
80
Jun-97 Dec-97 Jun-98 Dec-98 Jun-99 Dec-99 Jun-00 Dec-00 Jun-01 Dec-01 Jun-02 Dec-02 Jun-03 Dec-03 Jun-04 Dec-04 Apr-05
Tbyt
es
26 July 2005 DCC/DPC Workshop on Cost Models for preserving digital assets
7
Digital | Curation | Centre
Expected future demand
0.20.20.100.05External
5.63.11.20.55Total (PB)
1.00.70.50.2CCLRC (data volume PB)
1.01.000Diamond (data volume (PB)
3.41.20.60.3LHC data volume (PB)
60040025050LHC bandwidth (MB/sec)
2008200720062005Year
26 July 2005 DCC/DPC Workshop on Cost Models for preserving digital assets
8
Digital | Curation | Centre
Actual Growth 1997-2003
-20000
0
20000
40000
60000
80000
100000
Jun-9
7Sep
-97Dec
-97Mar-
98Ju
n-98
Sep-98
Dec-98
Mar-99
Jun-9
9Sep
-99Dec
-99Mar-
00Ju
n-00
Sep-00
Dec-00
Mar-01
Jun-0
1Sep
-01Dec
-01Mar-
02Ju
n-02
Sep-02
Dec-02
Mar-03
Jun-0
3Sep
-03Dec
-03
Time years
Dat
a Vo
lum
e (G
B)
Cumulative Data Volume (GB)Actual Growth (GB)
26 July 2005 DCC/DPC Workshop on Cost Models for preserving digital assets
9
Digital | Curation | Centre
Atlas Storage: Predicted Demand (TB)
0
500
1000
1500
2000
2500
3000
3500
4000
2003 2004 2005 2006
Upper bound datavolume (TB)
Lower bound datavolume (TB)
26 July 2005 DCC/DPC Workshop on Cost Models for preserving digital assets
10
Digital | Curation | Centre
Capacity & performance - Hardware
• Hardware– Defines both performance and capacity– Changing fast but well understood; (buy as late as possible)– Tied into technology futures of manufacturers and HEP
community;– Currently hardware is effectively “infinitely” scalable
• Future estimated storage capacity & bandwidth for a 6000 slot robot:
1000 GB500 GB200GBTape capacity
Titanium2Titanium 19940BTechnology
~20080 -10030 - 40Bandwidth (MB/sec)
1.2 PB
2003/04
6PB3PBCapacity (PB)
2008/92006/7Year
26 July 2005 DCC/DPC Workshop on Cost Models for preserving digital assets
11
Digital | Curation | Centre
Data Growth
- observatory archives growing as detectors grow
- world area of 3m+ (sq.m.)- largest detectors (Mpix)
19701975
19801985 1990 1995 2000
0.1
1
10
100
1000
CCDs Glass
- VISTA will have a Gpixel array
26 July 2005 DCC/DPC Workshop on Cost Models for preserving digital assets
12
Digital | Curation | Centre
STK 9310
8 x 9940 tape drives
ADS_switch_1 ADS_Switch_2
Brocade FC switches
4 drives to each switch
ermintrudeAIX
dataserver
florenceAIX
dataserver
zebedeeAIX
dataserver
dougalAIX
dataserver
mchenry1AIXTest flfsys
basilAIXtest
dataserver
brianAIXflfsys
ADS0CNTRRedhatcounter
ADS0PT01Redhat
pathtape
ADS0SB01Redhat
SRB interface
dylanAIX
Import/exportbuxtonSunOSACSLS
User
array4 array3 array2 array1
catalogue
cache
catalogue
cache
Test system
SRB Inq; S commands; MySRB
Tape devices
ADStape
ADS sysreq
admin commandscreate query
User pathtapecommandsLogging
Physical connection (FC/SCSI)
Sysreq udp command
User SRB command
VTP data transfer
SRB data transfer
STK ACSLS command
All sysreq, vtp andACSLS connections to dougal also apply tothe other dataserver machines, but are left out for clarity
Production system
SRB pathtape commands
Thursday, 04 November 2004
26 July 2005 DCC/DPC Workshop on Cost Models for preserving digital assets
13
Digital | Curation | Centre
Tape Drive Performance as a Function of File Size
0
5
10
15
20
25
30
35
40
0 100 200 300 400 500 600 700 800
File Size (MB)
Tape
Driv
e Th
roug
hput
(MB
/sec
)
26 July 2005 DCC/DPC Workshop on Cost Models for preserving digital assets
14
Digital | Curation | Centre
Types of costs
• Captures costs• Storage costs• Maintenance costs• Access/Dissemination costs
• Total cost of ownership
26 July 2005 DCC/DPC Workshop on Cost Models for preserving digital assets
15
Digital | Curation | Centre
Trends
• 1986 disk 5MB/£250 = 20KB/£• 1994 disk/DAT 3GB/£3K = 1MB/£• 1995 disk 420MB/£40 = 10MB/£• 1998 disk 5GB/£250 = 20MB/£• 2004 disk 60GB/£60 = 1000MB/£Doubles every year
» Data from Byte new products
26 July 2005 DCC/DPC Workshop on Cost Models for preserving digital assets
16
Digital | Curation | Centre
• The expected cost of the Atomic Holographic DVR disc drive will be from $570 to $750 with the replacement discs for $45.
One 10 terabyte to 100 terabyte 3.5 in FEdisk
26 July 2005 DCC/DPC Workshop on Cost Models for preserving digital assets
17
Digital | Curation | Centre
Issues
• System changes• Collection migration to new systems
– Descriptive Information– Finding Aids
26 July 2005 DCC/DPC Workshop on Cost Models for preserving digital assets
18
Digital | Curation | Centre
Consideration of service quality
• bit preservation• currently aiming to be self funding• aim to cover costs only• lower storage costs are dependant on
increased usage • increased usage is hard to predict • current charge of £1k/Tb/yr
26 July 2005 DCC/DPC Workshop on Cost Models for preserving digital assets
19
Digital | Curation | Centre
Costs and charging• H/W Costs
– Total ~ £1m every 4-5 years, equiv to ~ £250K/yr– H/W upgrades are costly – installation, configuration, test; and
associated data migration - many months– Example component costs:
• Robot (6000 slots) ~ £300K• Media £420K (@ £70 per unit)• Disk ~ 1.5K/TB? ~ £50K for 75TB commodity?• Tape drives £20K each. (est. T1s and T2s) Total ~ £200K for 10• Data Servers:
– Linux: £3K each. Total ~ £30K for 10– AIX: £14K each. Total ~ £140K for 10
• Network/switches ~ £50K– Numbers are the Key to flexible performance – esp. data servers
and tape drives.• S/W Costs – Currently limited to staff development costs• Staff 2.5 FTE: system manager + system developer + 0.5
operations staff
26 July 2005 DCC/DPC Workshop on Cost Models for preserving digital assets
20
Digital | Curation | Centre
ADS Running Costs 04/05. (Option 1).
H/W maintenance11%
S/W maintenance3%
Hardware15%
Network0%
Other5%
Staff costs66%
26 July 2005 DCC/DPC Workshop on Cost Models for preserving digital assets
21
Digital | Curation | Centre
SRB-ADS architectureSRB MCATDatabase
SRB MCATServer
SRB ADSServer
SRBClient
SRB DiskServer (Local Server)
Atlas Data Store SRB ADS Server
SRB-ISIS server
instance
SRB-BADC server
instance
SRB-CCLRC server instance
Port 5600
Port 5601
Port 5602
26 July 2005 DCC/DPC Workshop on Cost Models for preserving digital assets
22
Digital | Curation | Centre
BADC Team
BADC Team
Authorising Authority(BADC or external data manager)
BADC Support Team
External User
Administration
User Database
Metadata
Data
Generate metadata
Ingest files
Volume plans
Format descr.
Discovery Search
Data Access via FTP & HTTP
Handle queries
Manage user
accounts
Corrected files re-ingested
Submitted files
BADC team add metadata
Harvest
New and updated files
Data submission authorisation
Authentication and authorisation
Registration details and updates
Query and response
Query and response
Access request and authorisation
Report on user details
Query, update database
User details
Search & results
Data requests & data
Authentication
Functional Diagram of BADC/APS
26 July 2005 DCC/DPC Workshop on Cost Models for preserving digital assets
23
Digital | Curation | Centre
OAIS Functional Model
4-1.
2
MANAGEMENT
Ingest
Data Management
SIP
AIPDIP
queriesresult sets
Access
PRODUCER
CONSUMER
Descriptive Info
AIP
orders
Descriptive Info
Archival Storage
Administration
Preservation Planning
26 July 2005 DCC/DPC Workshop on Cost Models for preserving digital assets
24
Digital | Curation | Centre
BADC mapped to OAISPreservation Planning
IngestAccess
BADC Team
BADC Team
Authorising Authority(BADC or external data manager)
BADC Support Team
External User
Administration
User Database
Metadata Management
Metadata
Archival Storage
Data
Generate metadata
Ingest files
Volume plans
Format descr.
Discovery Search
Data Access via FTP & HTTP
Handle queries
Manage user
accounts
Corrected files re-ingested
Submitted files
BADC team add metadata
Harvest
New and updated files
Data submission authorisation
Authentication and authorisation
Registration details and updates
Query and response
Query and responseAccess
request and authorisation
Report on user details
Query, update database
User details
Search & results
Data requests & data
Authentication
Preservation PlanningPreservation Planning
IngestIngestAccessAccess
BADC TeamBADC Team
BADC TeamBADC Team
Authorising Authority(BADC or external data manager)
Authorising Authority(BADC or external data manager)
BADC Support TeamBADC Support Team
External User
AdministrationAdministration
User DatabaseUser Database
Metadata Management
Metadata
Metadata Management
MetadataMetadata
Archival Storage
Data
Archival Storage
DataData
Generate metadataGenerate metadata
Ingest files
Ingest files
Volume plans
Volume plans
Format descr.Format descr.
Discovery Search
Discovery Search
Data Access via FTP & HTTP
Data Access via FTP & HTTP
Handle queriesHandle queries
Manage user
accounts
Manage user
accounts
Corrected files re-ingested
Submitted files
BADC team add metadata
Harvest
New and updated files
Data submission authorisation
Authentication and authorisation
Registration details and updates
Query and response
Query and responseAccess
request and authorisation
Report on user details
Query, update database
User details
Search & results
Data requests & data
Authentication
Digital | Curation | Centre
Space Missions - special features
• Space missions are very expensive (100’s of Millions of dollars/euros)– Specialised hardware and software
• Information if usually the only thing left after the mission
• Data Exploitation costs are usually small
26 July 2005 DCC/DPC Workshop on Cost Models for preserving digital assets
26
Digital | Curation | Centre
Costs of Preparation
• IUE Final Archive – IUE launched in 1978– Early example of long-term preservation
• 12 years after launch– New processing algorithms– New products
• Trends in access– New Formats– Translation of telemetry– Dictionaries for keywords in header– Capture of hand-written Observer logs– New catalogues
26 July 2005 DCC/DPC Workshop on Cost Models for preserving digital assets
27
Digital | Curation | Centre
Cost Sharing
• Shared archival storage – economies of scale• Shared discovery/access• Shared Preservation Planning
– Technology watch– Representation Information – Registries
• Abstraction and virtualisation• Automated migration
– Preservation Description Information - tools• Bring benefits forward
– Curation– Interoperability
• Distance in discipline is like Distance in time
26 July 2005 DCC/DPC Workshop on Cost Models for preserving digital assets
28
Digital | Curation | Centre
Metrics for Benefits
• National/organisational pride• Scientific
– Number of references– Number of publications– Number of requests
• Financial– Sale of data– Investment in information systems
• Legal– Avoid penalties
26 July 2005 DCC/DPC Workshop on Cost Models for preserving digital assets
29
Digital | Curation | Centre
Archive Research
1994.8 1995.3 1995.8 1996.3 1996.8 1997.3 1997.8 1998.3 1998.8 1999.3
Ingest
0
5
10
15
20
25
30
Gby
tes/
Day
Year
Ingest
Retrievals
Already more retrieval than ingest!Already more retrieval than ingest!
- large fraction of astro-papers based on archives
- HST archive use growing faster than archive
26 July 2005 DCC/DPC Workshop on Cost Models for preserving digital assets
30
Digital | Curation | Centre
Conclusions• Preservation costs of any item:
– Storage costs of the bits will fall– Migration can be automated (and done on request)– Costs to keep information usable (as in OAIS) could
grow but can be shared• Sharing nationally and internationally
• Ingest costs can be reduced by forward planning by/agreements with producers
• Benefits can be brought forward– Link to widening Interoperability
• Benefits must be measured