helping scientists do science•volatile and velocity –evolving, reanalysis •variant...
TRANSCRIPT
Helping Scientists do Science
Confessions of an Applied Computer Scientist
Professor Carole Goble CBE FREng FBCSThe University of Manchester, [email protected]
and the myGrid Teamhttp://www.mygrid.org.uk
ACM womENcourage Europe 01 March 2014, Manchester, UK
Examples from
e-Science, Computational ScienceScientific Computing
• Support global scientific collaboration, enable large scale resource, tools and results sharing, assist scientific processing, avoid unnecessary repeated work.
• Accelerate scientific discovery, improving scientific productivity, stimulate technological innovation.
• Cope with scales and speed of scientific innovation and data.
http://research.microsoft.com/en-us/collaboration/fourthparadigm/
Models of Human Physiology
VPH-Share
Next Generation Genome Sequencing based Patient DiagnosticsEagle Genomics
Astronomy & HelioPhysics analytical pipelines
HELIO, Wf4ever
Document
Preservation Digitisation
SCAPE
Systems Biology of Micro-Organisms data & model management
SysMO
Drug discovery, small molecules, targets, compounds OpenPHACTS
Ecological Niche and Population Modelling
BioVeL
Computational, data intensive
problemsmanaged
worlds / in the wild
Metagenomics
Ocean Sampling Day
Distributed Computing
Linking up different codes, resources, platforms & e-infrastructure.
Social Computing
Sharing different science stuff. Collaborations between different scientists.
Knowledge Computing
Describing, finding and linking up different data, models, methods, science stuff…
ComputerScience
Software Engineering
Scientific InformaticsComputational Science
THEORY PRACTICEAPPLICATIONfundamental applied
PRODUCT(Open Source)
PRINCIPLE
Science
“USE CASE”
Biodiversity marine monitoring and health assessment
ecological niche modelling
Data Intensive ScienceCollaborative Science
Pilumnus hirtellusEnclosed sea problem (Ready et al., 2010)
Sarah Bourlat
http://www.catalogueoflife.org/
Lots of different resources
Lots of different software
Including other researcher’s software
Zeeya Merali , Nature 467, 775-777 (2010) | doi:10.1038/467775a
Computational science: ...Error…why scientific programming does not compute.
Aleksandra Pawlik
Devasena Inupakutika
Ghaithaa Manla
Data discoveryData discovery
Data assembly, cleaning, and refinement
Data assembly, cleaning, and refinement
Ecological Niche Modeling
Ecological Niche Modeling
Statistical analysisStatistical analysis
Analytical cycle
Data collectionData collection
InsightsInsights Scholarly Communication & Reporting
Scholarly Communication & Reporting
• Volume• Variety
– Integrative Multi-*– Multi-step, repetitive process
• Volatile and Velocity– evolving, reanalysis
• Variant– Comparable: sweep across data
& parameters– different experiments.
• Valid– Reporting & Replication
Data discoveryData discovery
Data assembly, cleaning, and refinement
Data assembly, cleaning, and refinement
Ecological Niche Modeling
Ecological Niche Modeling
Statistical analysisStatistical analysis
Analytical cycle
Data collectionData collection
InsightsInsights Scholarly Communication & Reporting
Scholarly Communication & Reporting
data, parameters, configurations
E.Science laboris
Scientific Workflow Management Systems• Coordinate execution of
services and codes.• Dataflow at scale• Reusable variants• Comparable repetitions
• Import own data / codes + public libraries/datasets
• Honour hosted codes
• Shield operational complexity• Auto-document provenance• Package up dependencies
data, parameters, configurations
E.Science laboris
Scientific Workflow Management Systems
•Visual Programming•Computational Lambda Calculus•Process mining•Adaptive & parallel computing•Cloud computing•SOA, Semantic Web Services•Automated wrapping of codes•Data integration, knowledge modelling•Reporting & tracking•…..
E.Science laboris
Tools
Standards
Services
Design tools
and practic
es for morta
ls
Shielding vs
Obfuscation
Auto assembl
y. Guided as
sembly
Fragilitychanges in infrastructures & resourcesautomated adaption
Reproducible executions…Packaging, preservation & portability
Workfows as commodities
Security
Provenance
d1
S0
d2
S1
w
S2
y
S4
df
d1'
S0
d2
S1
z w
S'2
y'
S4
df'
(i) Trace A (ii) Trace B
• How, What, Where, When, Why, Who
• Trace lineage, Process history, Accountability
• The link between computation and results
• Transparency
[Woodman et al, 2011]
Social
[Cheney, 2012]
Provenance Week June 9-13, 2014 , Cologne http://provenanceweek.dlr.de
Mind the Provenance Gap
Summarisation, Labelling,Distillation
Fine grainBigA White box
One SystemSpecial toolsCollectionA Big Graph
What do I cite?What did I do?N Black boxes
Many SystemsMy Lab BookAnalyticsSmart in situ Presentation
Sarah Cohen-BoulakiaPinar Alper
Juliana FreireSusan Davidson
Primacy of Method (a la Code)What code was run? – which executable?
Where can I get hold of the code / script / workflow?
How does it work? What are its assumptions?
How do I version it? What’s its licence?
How fragile is it? How do we repair it?
Who authored it? How do I cite it?How do I get credit for it?
Which options did you set? What was the input data?
Primacy of Methods
Systems BiologySharing and interlinking Methods, Models, Data…
Data
ModelArticle
ExternalDatabases
Metadata
experimentalists, modellers, X-informaticians, computational Xs, software engineers, computer scientists, systems administrators, resource providers, tool builderssocial scientists, librarians, curators
Social ComputationStoring, Sharing and Reusing data, methods, models, between collaborating and competing scientists
e-Laboratories, collaboratories, VREs, repositories
An ego-system
“Startup-Like” Balance Innovation with Usefulness
[Josh Sommer]
Knowledge Turns amongst Scientists
E.Science Sociam
• HCI, Human Factors• Security• Data and Knowledge
management• Distributed Computing• Digital Preservation• Social Machines• Information Systems• Social Science
Platforms
Standards
Services
Policies/Practices
Scientists Share Strategically and Sparingly
Data Hugging
Sharing
Creep
Data Flirting
Data Voyerism
Collaborating to Compete
RewardCost Risk
Tools
Computer Scientist
Software Engineer
Social Engineer
credit is like ♥ not £$€¥• Universal identity• Inter-platform tracking • Auto-tracking• Credit recommendation• Credit recognition• Standards• Tools• Socio-Technical development• Credit for Developers!!
credit is like ♥ not £$€¥
Liz Lyon
Kaitlin Thaney
Heather Piwowar
Katy Borner
Victoria Stodden
Christine Borgman
Anita De Waard
RebeccaLawrence
• Universal identity• Inter-platform tracking • Auto-tracking• Credit recommendation• Credit recognition• Standards• Tools• Socio-Technical development• Credit for Developers!!
Describing X well enough to share it, find it, understand it, reuse it, combine it with Y & Z
X, Y, Z = data, models, methods, workflows, services, codes, *
Knowledge Computation•Accurate, intelligible and comparable descriptions•Data interoperability•Machine readable metadata
Semantic technologies, Ontologies, Linked Data, Data schema
Semantic DescriptionDescribing and linking data in terms of
shared concepts, relationships and identifiers
Data
object propertydata propertysubClassOf
Ontology
Person
Organization
Place
Statename
birthdatebornIn
worksFor state
namephone
namelivesIn
CityEvent
ceolocation
organizer
nearby
startDate
endDatetitle
isPartOf
postalCode
Column 1 Column 2 Column 3 Column 4 Column 5Bill Gates Oct 1955 Microsoft Seattle WA
Mark Zuckerberg May 1984 Facebook White Plains NYLarry Page Mar 1973 Google East Lansing MI
[Taheriyan et al
adapted]
Environment Ontology shared, controlled, structured vocabulary for biomes, environmental features, and environmental materials.
Common source of names and synonyms for matching, linking, searching, indexing, structuring data
Web Ontology Language OWL
E.Science Semantii
• Database theory• Query Answering• Description Logics• Reasoners• Artificial Intelligence• Automated annotation• Data integration & Search• Crowd sourcing
knowledge• Knowledge elicitation
Tools
Standards
Resources
Scalability
Changes in data & metadata
Crowd sourced Annotation
Rich knowledge representation and reasoning
Pay as you Go Integration
Adding Semantics to DataCapturing metadata
security
Semantic ETL pipelines
Smart Search
Curation Knowledge Ramps
Populoushttp://www.rightfield.org.uk
Katy Wolstencroft
http://www.economist.com/printedition/2013-10-19
Lemberger T Mol Syst Biol 2014;10:715
©2014 by European Molecular Biology Organization
Born Reproducible | Exchangeable | ReusableRich descriptions
Open & Available
Transparent Method
Re-executable
Research Objects• Bundles and relate multi-hosted digital resources of a
scientific experiment or investigation using standard mechanisms
• Exchange, Releasing paradigm for publishing
http://www.researchobject.org/
Jun Zhao
Research is like software. Release research
Jennifer Schopf, Treating Data Like Software: A Case for Production Quality Data, JCDL 2012
"To be a proper professional you need to think about the context and motivation and justifications of what you're doing... once you see how important computing is for life you can't just leave it as a blank box and assume that somebody reasonably competent and relatively benign will do something right with it."
Karen Spärck Jones
IEEE Spectrum, Computer Science, A Woman's Work May 2007
• myGrid– http://www.mygrid.org.uk
• Taverna– http://www.taverna.org.uk
• myExperiment– http://www.myexperiment.org
• BioCatalogue– http://www.biocatalogue.org
• Biodiversity Catalogue– http://www.biodiversitycatalogue.org
• Seek– http://www.seek4science.org
• Rightfield– http://www.rightfield.org.uk
• Open PHACTS– http://www.openphacts.org
• Wf4ever– http://www.wf4ever-project.org
• Software Sustainability Institute– http://www.software.ac.uk
• BioVeL– http://www.biovel.eu
• Force11– http://www.force11.org