cern open data and data analysis knowledge preservation · 2015. 4. 30. · cern open data and data...
TRANSCRIPT
![Page 1: CERN Open Data and Data Analysis Knowledge Preservation · 2015. 4. 30. · CERN Open Data and Data Analysis Knowledge Preservation Tibor Šimko Digital Library 2015 21–23 April](https://reader035.vdocuments.us/reader035/viewer/2022071421/611a3da599da551dcd30ca09/html5/thumbnails/1.jpg)
CERN Open Data and Data AnalysisKnowledge Preservation
Tibor Šimko
Digital Library 2015 · 21–23 April 2015 · Jasná, Slovakia
@tiborsimko · @inveniosoftware 1 / 26
![Page 2: CERN Open Data and Data Analysis Knowledge Preservation · 2015. 4. 30. · CERN Open Data and Data Analysis Knowledge Preservation Tibor Šimko Digital Library 2015 21–23 April](https://reader035.vdocuments.us/reader035/viewer/2022071421/611a3da599da551dcd30ca09/html5/thumbnails/2.jpg)
1
Invenio
@tiborsimko · @inveniosoftware 2 / 26
![Page 3: CERN Open Data and Data Analysis Knowledge Preservation · 2015. 4. 30. · CERN Open Data and Data Analysis Knowledge Preservation Tibor Šimko Digital Library 2015 21–23 April](https://reader035.vdocuments.us/reader035/viewer/2022071421/611a3da599da551dcd30ca09/html5/thumbnails/3.jpg)
What is Invenio?digital library and document repository software
– mature platform: first public release in 2002– rich data: articles, books, notes, photos, videos, software, data
some Invenio-based services at CERN:
co-developed by an international collaboration
participating in EU FP7/H2020 projects
@tiborsimko · @inveniosoftware 3 / 26
![Page 4: CERN Open Data and Data Analysis Knowledge Preservation · 2015. 4. 30. · CERN Open Data and Data Analysis Knowledge Preservation Tibor Šimko Digital Library 2015 21–23 April](https://reader035.vdocuments.us/reader035/viewer/2022071421/611a3da599da551dcd30ca09/html5/thumbnails/4.jpg)
“Data behind plots”
@tiborsimko · @inveniosoftware 4 / 26
![Page 5: CERN Open Data and Data Analysis Knowledge Preservation · 2015. 4. 30. · CERN Open Data and Data Analysis Knowledge Preservation Tibor Šimko Digital Library 2015 21–23 April](https://reader035.vdocuments.us/reader035/viewer/2022071421/611a3da599da551dcd30ca09/html5/thumbnails/5.jpg)
“Code you can cite”automated GitHub↔ Zenodo bridgepush new release to GitHub→ automatic archival on Zenodosoftware preserved, minted with a DOI, made citable
https://guides.github.com/activities/citable-code
@tiborsimko · @inveniosoftware 5 / 26
![Page 6: CERN Open Data and Data Analysis Knowledge Preservation · 2015. 4. 30. · CERN Open Data and Data Analysis Knowledge Preservation Tibor Šimko Digital Library 2015 21–23 April](https://reader035.vdocuments.us/reader035/viewer/2022071421/611a3da599da551dcd30ca09/html5/thumbnails/6.jpg)
Code↔ Data↔ Paper
link data (DATAVERSE) to code (ZENODO) to papers (INSPIRE)example: hep-ex/0011057, arXiv:1401.0080
@tiborsimko · @inveniosoftware 6 / 26
![Page 7: CERN Open Data and Data Analysis Knowledge Preservation · 2015. 4. 30. · CERN Open Data and Data Analysis Knowledge Preservation Tibor Šimko Digital Library 2015 21–23 April](https://reader035.vdocuments.us/reader035/viewer/2022071421/611a3da599da551dcd30ca09/html5/thumbnails/7.jpg)
2
Data Analysis
@tiborsimko · @inveniosoftware 7 / 26
![Page 8: CERN Open Data and Data Analysis Knowledge Preservation · 2015. 4. 30. · CERN Open Data and Data Analysis Knowledge Preservation Tibor Šimko Digital Library 2015 21–23 April](https://reader035.vdocuments.us/reader035/viewer/2022071421/611a3da599da551dcd30ca09/html5/thumbnails/8.jpg)
CERN LHC Experiments
@tiborsimko · @inveniosoftware 8 / 26
![Page 9: CERN Open Data and Data Analysis Knowledge Preservation · 2015. 4. 30. · CERN Open Data and Data Analysis Knowledge Preservation Tibor Šimko Digital Library 2015 21–23 April](https://reader035.vdocuments.us/reader035/viewer/2022071421/611a3da599da551dcd30ca09/html5/thumbnails/9.jpg)
Large Scale Solutions
Primary site: 100k cores (10k nodes), 100k disks (50 PB), 21k NICGrid: 13 Tier-1 sites, 155 Tier-2 sites, 10 Gbps optical fibre
@tiborsimko · @inveniosoftware 9 / 26
![Page 10: CERN Open Data and Data Analysis Knowledge Preservation · 2015. 4. 30. · CERN Open Data and Data Analysis Knowledge Preservation Tibor Šimko Digital Library 2015 21–23 April](https://reader035.vdocuments.us/reader035/viewer/2022071421/611a3da599da551dcd30ca09/html5/thumbnails/10.jpg)
Preserve an Analysis?
@tiborsimko · @inveniosoftware 10 / 26
![Page 11: CERN Open Data and Data Analysis Knowledge Preservation · 2015. 4. 30. · CERN Open Data and Data Analysis Knowledge Preservation Tibor Šimko Digital Library 2015 21–23 April](https://reader035.vdocuments.us/reader035/viewer/2022071421/611a3da599da551dcd30ca09/html5/thumbnails/11.jpg)
Big Data?
data scale knowledgeraw ∼GB / sec calibration, conditioningreconstructed ∼PB / year filtering, selectionreduced ∼TB / analysis user code, physics objectspublication ∼GB / analysis correlation, data behind plots
filteringinput...
code
output ...
Analysis Train
@tiborsimko · @inveniosoftware 11 / 26
![Page 12: CERN Open Data and Data Analysis Knowledge Preservation · 2015. 4. 30. · CERN Open Data and Data Analysis Knowledge Preservation Tibor Šimko Digital Library 2015 21–23 April](https://reader035.vdocuments.us/reader035/viewer/2022071421/611a3da599da551dcd30ca09/html5/thumbnails/12.jpg)
Knowledge Capture
@tiborsimko · @inveniosoftware 12 / 26
![Page 13: CERN Open Data and Data Analysis Knowledge Preservation · 2015. 4. 30. · CERN Open Data and Data Analysis Knowledge Preservation Tibor Šimko Digital Library 2015 21–23 April](https://reader035.vdocuments.us/reader035/viewer/2022071421/611a3da599da551dcd30ca09/html5/thumbnails/13.jpg)
System Architecture
Analysis
analysis-preservation.cern.ch
file storage abstraction layer
CASTORCephBoxAFS Drive EOS S3
GitHub
SVN
TWiki
SharePoint
INSPIRE
CDS
...
CADI
@tiborsimko · @inveniosoftware 13 / 26
![Page 14: CERN Open Data and Data Analysis Knowledge Preservation · 2015. 4. 30. · CERN Open Data and Data Analysis Knowledge Preservation Tibor Šimko Digital Library 2015 21–23 April](https://reader035.vdocuments.us/reader035/viewer/2022071421/611a3da599da551dcd30ca09/html5/thumbnails/14.jpg)
Knowledge Representationrecord format: extended MARC21
– “technical” metadata: beyond bytes
e.g. 256 �computer file characteristics�
$a characteristics $e events $t text
$b bytes $f files ...
– “knowledge” metadata: semantics
e.g. 505 �formatted contents note� CSV column information
$t title $g miscellaneous
internal format: JSON
JSON
MARC21
EAD
schema model
@tiborsimko · @inveniosoftware 14 / 26
![Page 15: CERN Open Data and Data Analysis Knowledge Preservation · 2015. 4. 30. · CERN Open Data and Data Analysis Knowledge Preservation Tibor Šimko Digital Library 2015 21–23 April](https://reader035.vdocuments.us/reader035/viewer/2022071421/611a3da599da551dcd30ca09/html5/thumbnails/15.jpg)
3
Open Data
@tiborsimko · @inveniosoftware 15 / 26
![Page 16: CERN Open Data and Data Analysis Knowledge Preservation · 2015. 4. 30. · CERN Open Data and Data Analysis Knowledge Preservation Tibor Šimko Digital Library 2015 21–23 April](https://reader035.vdocuments.us/reader035/viewer/2022071421/611a3da599da551dcd30ca09/html5/thumbnails/16.jpg)
Opening Up
Data policies:restricted→ embargo period→ open“[...] Data with high abstraction, such as AOD, will be conditionally made publiclyavailable after an embargo period of 5 years after publication for 10% of the dataand 10 years for 100% of the data [...]” —ALICE Data Policy
Challenges:audience:
– data miners– citizen scientists– high-school students– general public
computing:– exploring in the browser– specialised VMs
@tiborsimko · @inveniosoftware 16 / 26
![Page 17: CERN Open Data and Data Analysis Knowledge Preservation · 2015. 4. 30. · CERN Open Data and Data Analysis Knowledge Preservation Tibor Šimko Digital Library 2015 21–23 April](https://reader035.vdocuments.us/reader035/viewer/2022071421/611a3da599da551dcd30ca09/html5/thumbnails/17.jpg)
CERN Open Data Portal
@tiborsimko · @inveniosoftware 17 / 26
![Page 18: CERN Open Data and Data Analysis Knowledge Preservation · 2015. 4. 30. · CERN Open Data and Data Analysis Knowledge Preservation Tibor Šimko Digital Library 2015 21–23 April](https://reader035.vdocuments.us/reader035/viewer/2022071421/611a3da599da551dcd30ca09/html5/thumbnails/18.jpg)
Education
@tiborsimko · @inveniosoftware 18 / 26
![Page 19: CERN Open Data and Data Analysis Knowledge Preservation · 2015. 4. 30. · CERN Open Data and Data Analysis Knowledge Preservation Tibor Šimko Digital Library 2015 21–23 April](https://reader035.vdocuments.us/reader035/viewer/2022071421/611a3da599da551dcd30ca09/html5/thumbnails/19.jpg)
Visualise detector events
@tiborsimko · @inveniosoftware 19 / 26
![Page 20: CERN Open Data and Data Analysis Knowledge Preservation · 2015. 4. 30. · CERN Open Data and Data Analysis Knowledge Preservation Tibor Šimko Digital Library 2015 21–23 April](https://reader035.vdocuments.us/reader035/viewer/2022071421/611a3da599da551dcd30ca09/html5/thumbnails/20.jpg)
Basic histogramming
@tiborsimko · @inveniosoftware 20 / 26
![Page 21: CERN Open Data and Data Analysis Knowledge Preservation · 2015. 4. 30. · CERN Open Data and Data Analysis Knowledge Preservation Tibor Šimko Digital Library 2015 21–23 April](https://reader035.vdocuments.us/reader035/viewer/2022071421/611a3da599da551dcd30ca09/html5/thumbnails/21.jpg)
Research
@tiborsimko · @inveniosoftware 21 / 26
![Page 22: CERN Open Data and Data Analysis Knowledge Preservation · 2015. 4. 30. · CERN Open Data and Data Analysis Knowledge Preservation Tibor Šimko Digital Library 2015 21–23 April](https://reader035.vdocuments.us/reader035/viewer/2022071421/611a3da599da551dcd30ca09/html5/thumbnails/22.jpg)
CMS Primary Datasets 2010
@tiborsimko · @inveniosoftware 22 / 26
![Page 23: CERN Open Data and Data Analysis Knowledge Preservation · 2015. 4. 30. · CERN Open Data and Data Analysis Knowledge Preservation Tibor Šimko Digital Library 2015 21–23 April](https://reader035.vdocuments.us/reader035/viewer/2022071421/611a3da599da551dcd30ca09/html5/thumbnails/23.jpg)
CernVM Virtual Machine
@tiborsimko · @inveniosoftware 23 / 26
![Page 24: CERN Open Data and Data Analysis Knowledge Preservation · 2015. 4. 30. · CERN Open Data and Data Analysis Knowledge Preservation Tibor Šimko Digital Library 2015 21–23 April](https://reader035.vdocuments.us/reader035/viewer/2022071421/611a3da599da551dcd30ca09/html5/thumbnails/24.jpg)
Open Data? Who cares?
82,000 distinct users visited the site21,000 distinct users viewed data records16,000 distinct users used event display
3,000 distinct users used histogramming@tiborsimko · @inveniosoftware 24 / 26
![Page 25: CERN Open Data and Data Analysis Knowledge Preservation · 2015. 4. 30. · CERN Open Data and Data Analysis Knowledge Preservation Tibor Šimko Digital Library 2015 21–23 April](https://reader035.vdocuments.us/reader035/viewer/2022071421/611a3da599da551dcd30ca09/html5/thumbnails/25.jpg)
7
Conclusions
@tiborsimko · @inveniosoftware 25 / 26
![Page 26: CERN Open Data and Data Analysis Knowledge Preservation · 2015. 4. 30. · CERN Open Data and Data Analysis Knowledge Preservation Tibor Šimko Digital Library 2015 21–23 April](https://reader035.vdocuments.us/reader035/viewer/2022071421/611a3da599da551dcd30ca09/html5/thumbnails/26.jpg)
CERN (Open) Data
Capturing and disseminating knowledgeof data, code, platform, processes
to enable future data reuse
(Open) Data Analysis Preservation Frameworkhttp://opendata.cern.ch/
CERN IT J. Cowton, P. Fokianos, J. Kuncar, T. Smith, T. Šimko
CERN Library S. Dallmeier-Tiessen, P. Herterich, L. Rueda
ALICE M. Gheata, C. Grigoras
ATLAS K. Cranmer, L. Heinrich, D. Rousseau, F. Socher
CMS A. Calderon, A. Huffman, K. Lassila-Perini, T. McCauley, A. Rao, A. Rodriguez Marrero
LHCb S. Amerio, B. Couturier, A. Trisovic
CERN CernVM J. Blomer CERN EOS L. Mascetti DASPOS M. Hildreth DPHEP F. Berghaus
@tiborsimko · @inveniosoftware 26 / 26