the data challenge in astronomy archives technology problems solution dcc conference bath andy...
TRANSCRIPT
The data challenge in astronomy
The data challenge in astronomy
• archives
• technology
• problems
• solution
DCC conference Bath Andy Lawrence Sep 2005
DCC conference Bath Andy Lawrence Sep 2005
the virtual observatory
astronomical archives
(1)
IT in astronomy : key areasIT in astronomy : key areas• (1) facility operations• (2) facility output processing• (3) shared supercomputers for theory• (4) science archives• (5) end-user tools
(1-3) : big bucks(4-5) : smaller bucks but
- produces the final science output
- sets requirements for (1-2)
astronomical archives astronomical archives
• major archives growing at TB/yr
ESO Archive Volume (GB)
1
10
100
1000
10000
ESO Archive Volume (GB)
1
10
100
1000
10000
astronomical archives astronomical archives
• major archives growing at TB/yr
• issue not storage but management (curation)
• improving quality of data access and presentation
• needs specialist data centres
ESO Archive Volume (GB)
1
10
100
1000
10000
ESO Archive Volume (GB)
1
10
100
1000
10000
end users end users
• increasing fraction of archive re-use
end users end users
• increasing fraction of archive re-use• increasing multi-archive use • most download small files and analyse at home• some users process whole databases• reduction standardised; analysis home grown
needles in a haystackneedles in a haystackHambly et al 2001
- faint moving object is a cool white dwarf- may be solution to the dark matter problem- but hard to find : one in a million- even harder across multiple archives
failed starsfailed stars
compare optical and infra-red
extra object is very cold
a "brown dwarf" orfailed star
multi- views of a Supernova Remnant
Shocks seen in the X-ray
Heavy elementsseen in the optical
Dust seen in the IR
Relativistic electrons seen in the radio
solar-terrestrial linkssolar-terrestrial links
Coronal mass ejection imaged by space-based
solar observatory
Effect detected hours later bysatellites and ground radar
background technology
(2)
dogs and fleas dogs and fleas
• there is a very large dog
hardware trends hardware trends
• ops, storage, bw : all 1000x/decade– can get 1TB IDE = $5K
– backbones and LANS are Gbps
1.E-06
1.E-03
1.E+00
1.E+03
1.E+06
1.E+09
1880 1900 1920 1940 1960 1980 2000
doubles every 7.5 years
doubles every 2.3 years
doubles every 1.0 years
ops per second/$
hardware trends hardware trends
• ops, storage, bw : all 1000x/decade– can get 1TB IDE = $5K– backbones and LANS are Gbps
• but device bw 10x/decade– real PC disks 10MB/s; fibre channel SCSI poss 100MB/s
• and last mile problem remains– end-end b/w typically 10Mbps
1.E-06
1.E-03
1.E+00
1.E+03
1.E+06
1.E+09
1880 1900 1920 1940 1960 1980 2000
doubles every 7.5 years
doubles every 2.3 years
doubles every 1.0 years
ops per second/$
operations on a TB database operations on a TB database
• searching at 10 MB/s takes a day– solved by parallelism– but development non-trivial ==> people
• transfer at 10 Mbps takes a week– leave it where it is
• ==> data centres provide search and analysis services
network development network development • higher level protocols ==> transparency
• TCP/IP message exchange
• HTTP doc sharing (web)
• grid suite CPU sharing
• XML/SOAP data exchange
==> service paradigm
next up on the internet next up on the internet
• workflow definition
• dynamic semantics (ontology)
• software agents
the problems
(3)
data growth data growth
• astronomical data is growing fast
• but so is computing power
• so whats the problem ?
(1) Heterogeneity(2) End user delivery(3) End user demand
data rich future data rich future • heritage
– Schmidt, IRAS, Hipparcos
• current hits– VLT, SDSS, 2MASS, HST, Chandra, XMM, WMAP
• coming up : – UKIDSS, VISTA, ALMA, JWST, Planck, Herschel
• cross fingers : – LSST, ELT, Lisa, Darwin,SKA, XEUS, etc.
• plus lots more
• issue is archive interoperability– need standards and transparent infrastructure
archive data rates archive data rates • map the sky : 0.1" x 16 bits = 100 TB• process to find objects : billion row tables• VISTA 100 TB/yr by 2007• SKA datacubes 100PB/yr by 2020• not a technical or financial problem
– LHC doing 100PB/yr by 2007
• issue is logistic : data management • need professional data centres
data rates : user delivery data rates : user delivery
• disk I/O and bandwidth – end-user bottlenecks will get WORSE– but links between data centres can be good
• move from download to service paradigm– leave the data where it is– operations on data (search, cluster analysis, etc) as services– shift the results not the data– networks of collaborating data centres (datagrid or VO)
user demands user demands
• bar constantly raising– online ease– multi-archive transparency– easy data intensive science
• new requirements – automated resource discovery (intelligent Google)– cheap I/O and CPU cycles – new standards and software infrastructure
the virtual observatory
(4)
the VO concept the VO concept
• web all docs in the world inside your PC
• VO all databases in the world inside your PC
Generic science driversGeneric science drivers
• data growth• multi-archive science• large database science
can do all this now, but needsto be fast and easy
• empowerment
Beijing as good as Berkeley
whats its notwhats its not
• not a monolith
• not a warehouse
VO frameworkVO framework
• framework + standards
• inter-operable data
• inter-operable software modules
• no central VO-command
- its not a thing- its a way of life
VO geometryVO geometry
• not a warehouse
• not a hierarchy
• not a peer-to-peer system
• small set of service centresand large population of end users
– note : latest hot database lives with creators / curators
yesterdayyesterday
browserfrontend
CGIrequest
html
web page
DBengine
SQL
data
todaytoday
appl
icat
ion
webservice
SOAP/XML request
SOAP/XML data
DBengine
SQL
nativedata
anyt
hin
g
standard formats
tomorrowtomorrow
appl
icat
ion
webservice
job
results
anyt
hin
g
webservice
webservice
webservice
webservice
webservice
Registry Workflow
GLUE Certification VO Space
standard semantics
publ
ish W
SDL
grid
con
nec
ted
publishing metaphorpublishing metaphor
• facilities are authors
• data centres are publishers
• VO portals are shops
• end-users are readers
• VO infrastructure is distribution system.
International VO alliance (IVOA) International VO alliance (IVOA)
IVOA standardsIVOA standards
• formal process modelled on W3C
• technical working groups and interop workshops
• agreed functionality roadmap
IVOA standardsIVOA standards
• key standards so far– table formats– resource and service metadata definitions– semantic dictionary– protocols for image and spectrum access
• coming year– grid and web service interfaces– authentication– storage sharing protocols– application metadata and interfaces
state of implementationsstate of implementations
• key projects : AstroGrid, US-NVO, Euro-VO
• many compliant data services
• VO aware tools
• mutually harvesting registries
• workflow system
• simple shared storage
• AstroGrid has ~100 registered users
• first science results coming out
coming yearcoming year
• single sign on
• internationally shared storage
• NGS link up
• many more tools
next stepsnext steps
• intelligent glue– ontology, agents
• analysis services– cluster analysis, multi-D visualisation, etc
• theory services – simulated data, models on demand
• embedding facilities– VO ready facilities
– links to data creation
lessons
lessonslessons
• drivers: end user bottleneckend user demandempowerment
• need network of healthy data centres
• need last mile investment
• need facilities to be VO ready
• need continuing technology development
• need continuing standards programme
FIN