11 curation of chemistry data from the laboratory to publication jeremy frey & simon coles...
TRANSCRIPT
11
Curation of Chemistry Data from the Laboratory
to Publication
Curation of Chemistry Data from the Laboratory
to Publication
Jeremy Frey & Simon ColesSchool of Chemistry
University of Southampton
Jeremy Frey & Simon ColesSchool of Chemistry
University of Southampton
AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 22
The CombeChem Project
The CombeChem Project
End to End linking of data and informationLaboratory to publication and back againVery long data chains can be involved e.g.
from a chemistry lab to mouse genetic expression
The exponential world of combinatorial synthesis and high throughput analysis meets the exponentially growing power of computing “Automation, Semantics & the Grid”
End to End linking of data and informationLaboratory to publication and back againVery long data chains can be involved e.g.
from a chemistry lab to mouse genetic expression
The exponential world of combinatorial synthesis and high throughput analysis meets the exponentially growing power of computing “Automation, Semantics & the Grid”
AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 33
Plan & COSHH
Digital Model
InformationIntegration
Report
Knowledge
Goal
Literature
Synthesis
not just one laboratory but many co-laboratories
working together
Analysis
Smart Laboratory
Smart Storage Smart Dissemination
Smart HCI
AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 44
Problems with ‘Small Laboratory’ Working Practice
Problems with ‘Small Laboratory’ Working Practice
“Data from experiments conducted as recently as six months ago might be suddenly deemed important, but those researchers may never find those numbers – or if they did might not know what those numbers meant”
“Lost in some research assistant’s computer, the data are often irretrievable or an undecipherable string of digits”
“To vet experiments, correct errors, or find new breakthroughs, scientists desperately need better ways to store and retrieve research data”
“Data from Big Science is … easier to handle, understand and archive. Small Science is horribly heterogeneous and far more vast. In time Small Science will generate 2-3 times more data than Big Science.”
‘Lost in a Sea of Science Data’ S.Carlson, The Chronicle of Higher Education (23/06/2006)
AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 55
The concept of Publication@Source
The concept of Publication@Source
Trace all the way back from publication to the original data – provenance
The data is the key - DataGridStart as you mean to go on – ELNs are a
necessityCuration of subsequently produced data
Trace all the way back from publication to the original data – provenance
The data is the key - DataGridStart as you mean to go on – ELNs are a
necessityCuration of subsequently produced data
AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 66
Observations are nevercollected on note pads,
filter paper or other temporary paper for later transfer into a
notebook
If you are caught using the “scrap of paper” technique,
your improperly recorded data may be confiscated by your TA
AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 77
Lab books are a big block to publication@source: if it’s not digital, it is more difficult to share
Need a usable digital lab book. Design by analogy to help Chemists and Computer Scientists work together.
Only some equipment is networked
This is where it all starts: The Lab & The Lab Book
AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 88
COSHHleverage off things we already have to do
COSHHleverage off things we already have to do
AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 99
1 1 2 2 1 3 1 4
Sample of 4-flourinatedbiphenyl
Add CoolReflux
Butanone Sample ofK2CO3Powder
Weigh
grammes0.9031
Measure
40 ml
Add
Weigh
2.0719 g
text
3 5
Add
g
Sample ofBr11OCB
2 6
Reflux
2 7
Cool
Water
Measure
30 ml
9
Liquid-liquid
extraction
DCM
Measure
3 of 40 ml
10
Dry
MgSO4
11
Filter(Buchner)
12
RemoveSolvent
by RotaryEvaporation
13
Fuse
Silica
14
ColumnChromatography
Ether/PetrolRatio
Butanone dried via silica column andmeasured into 100ml RB flask.
Used 1ml extra solvent to wash outcontainer.
Started reflux at 13.30. (Had tochange heater stirrer) Only reflux
for 45min, next step 14:15.
Inorganics dissolve 2layers. Added brine
~20ml.
Organics are yellowsolution
Washed MgSO4 withDCM ~ 50ml
Measure
excess
Observation Types
weight - grammes
measure - ml, drops
annotate - text
temperature - K, °C
Key
Process
Input
Literal
Observation
Add CoolRefluxAddAdd Reflux Cool Dry Filter Remove
Solventby Rotary
Evaporation
Fuse ColumnChromatography
Dissolve 4-flourinatedbiphenyl inbutanone
Add K2CO3powder
Heat at refluxfor 1.5 hours
Cool and addBr11OCB
Heat atreflux untilcompletion
Cool and addwater (30ml)
Combine organics,dry over MgSO4 &filter
Removesolvent invacuo
Liquid-liquid
extraction
Extract withDCM(3x40ml)
Fuse compound to silica &column in ether/petrol
4 8
Add
Add
text
Annotate
Annotate
text
Weigh
Annotate
g
Annotate Annotate
text text
Future Questions
Whether to have many subclasses of processes or fewer with annotations
How to depict destructive processes
How to depict taking lots of samples
What is the observation/process boundary? e.g. MRI scan
1.5918
Combechem
30 January 2004gvh, hrm, gms
Ingredient List
Fluorinated biphenyl 0.9 gBr11OCB 1.59 gPotassium Carbonate 2.07 gButanone 40 ml
image
To
Do
Lis
tP
lan
Pro
ce
ss
Re
co
rd
PLAN
Process Record
AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 1010
1 1 2 2 1 3
Sample of 4-flourinatedbiphenyl
Add Reflux
Butanone Sample ofK2CO3Powder
Weigh
grammes0.9031
Measure
40 ml
Add
Weigh
2.0719 g
text
Butanone dried via silica column andmeasured into 100ml RB flask.
Used 1ml extra solvent to wash outcontainer.
Started reflux at 13.30. (Had tochange heater stirrer) Only reflux
for 45min, next step 14:15.
Add RefluxAdd
Dissolve 4-flourinatedbiphenyl inbutanone
Add K2CO3powder
Heat at refluxfor 1.5 hours
text
Annotate
Annotate
Ingredient List
Fluorinated biphenyl 0.9 gBr11OCB 1.59 gPotassium Carbonate 2.07 gButanone 40 ml
1 1 2 2 1 3
Sample of 4-flourinatedbiphenyl
Add Reflux
Butanone Sample ofK2CO3Powder
Weigh
grammes0.9031
Measure
40 ml
Add
Weigh
2.0719 g
text
Butanone dried via silica column andmeasured into 100ml RB flask.
Used 1ml extra solvent to wash outcontainer.
Started reflux at 13.30. (Had tochange heater stirrer) Only reflux
for 45min, next step 14:15.
Add RefluxAdd
Dissolve 4-flourinatedbiphenyl inbutanone
Add K2CO3powder
Heat at refluxfor 1.5 hours
text
Annotate
Annotate
Ingredient List
Fluorinated biphenyl 0.9 gBr11OCB 1.59 gPotassium Carbonate 2.07 gButanone 40 ml
AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 1111
Key
Process
Input
Literal
Observation
pla
n-t
o-
hea
t_te
a_in
_wat
er
plan-to-add_tea_to_water
Add tea to hotwater
Heat tea for5 minutes
Filter off tealeaves
File: combechem/process/tea.rdfOntology: combechem/process/process-record.rdfs
13:41:36 14 July 2004© 2004 University of Southampton
Ste
ps
Pla
nP
roc
ess R
ec
ord
planned-weight_of_tea_leaves
5
planned_tea_leaves
plan-to-weigh_tea_leaves
processed-by-iv
material-observed-by
produces-observation
has-unitvalue
produces-substance
pla
n-t
o-f
ilter
_tea
produces-substance
300
has-unitvalue
processed-by-iv
material-observed-by
planned_some_water
plan-to-measure_some_water
produces-observation
planned-volume_of_some_water
processed-by
processed-by
next-step next-step
hea
t_te
a_in
_wa
ter
add_tea_to_water
weight_of_tea_leaves
5.021
tea_leaves
weighing_tea_leaves
processed-by-iv
material-observed-by
produces-observation
has-unitvalue
produces-
substance
filt
er_
tea
produces-substance
&cec;volumeunit-millilitre310
has-unitvalue
processed-by-iv
material-observed-by
some_water
measuring_some_water
produces-observation
volume_of_some_water
processed-by
processed-by
pla
n-t
o-t
ea_i
n_w
ater
pla
n-t
o-h
ot_
tea
tea_
in_w
ate
r
ho
t_te
a
step-text step-text step-text
experiment-pretty-name
The basic teaexperiment
experiment-description
Add tea leaves tohot water, refluxing,
filtering, drinking(maybe)
experimenter
starting-process
MakingTea
http://www.ecs.soton.ac.uk/info/#person-00389
process-record-of
material-record-of
process-record-of
produces-substance
pla
n-t
o-f
inis
he
d_t
ea
produces-substance
fin
ish
ed_t
ea
<tabletscribble>
value
process-observed-by
watching_tea_boil
produces-observation
heat_tea_notes
&cec;massunit-gramme
&cec;volumeunit-millilitre
&cec;massunit-gramme
Smarttea.org
Making Tea
Namespaces
rdf http://www.w3.org/1999/02/22-rdf-syntax-ns#rdfs http://www.w3.org/2000/01/rdf-schema#xsd http://www.w3.org/2001/XMLSchema#akt http://www.aktors.org/ontology/portal#cml http://www.xml-cml.org/schema/cml2/corecec http://www.combechem.org/ontology/process/0.1#st http://smarttea.org/#
part-of-step
part-of-step
part-of-step
step1 step2 step3
experiment-goal
material-is-ingredient-of
material-is-ingredient-of
material-record-of
process-record-of
process-record-of
process-record-of
material-record-of
material-record-of
starting-step
getRecord()
There is a potential containment problem in pulling back partial RDF graphs from the triple store.
Solved by using multiple triple stores but boundaries are a major issue for the future.
AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 1212
ArchitectureArchitecture
SURIGSURIGSURIGData stores
SemanticData
Otherservices
Weights &Measures
Bench
Planner0
Viewer0
PH
PJava
“Client” LibrariesSOAP
JenaSURIG
Applications
Institutional archivesand m
etadata publication
AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 1313
The Analytical LaboratoryThe Analytical Laboratory
Capture information from places you would not want to put your eyes
Capture environmental data automatically
Capture people and movements
Provide this information in real time as well as for the laboratory record
Capture information from places you would not want to put your eyes
Capture environmental data automatically
Capture people and movements
Provide this information in real time as well as for the laboratory record
AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 1414
Data Source
ArchiveClient
WebClient
Mobilephone
Data Source
PDA
MessageBroker
TranslatorService
Pub-Sub systems provide the flexible & extensible approach to distribution
BLOG
AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 1515
Temperature – room, laser
Door & interlock, Motion Sensors
Air Conditioning failed
AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 1616
Databases - Our experienceDatabases - Our experience
What do you do when the actual users keep changing their mind?
Is a traditional relational database suitable?Danger of re-enforcing scientific bias against
relational database for laboratory data.RDF & Triple stores were again the solution
What do you do when the actual users keep changing their mind?
Is a traditional relational database suitable?Danger of re-enforcing scientific bias against
relational database for laboratory data.RDF & Triple stores were again the solution
AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 1717
RDF/RDFS High level Schema for chemical properties
AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 1818
AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 1919
Triple Stores - The Heart of the Semantic WebScaling - 3Store response
Memory leak in testing program!
AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 2020
Scaling the triplestoresScaling the triplestores
Moved from…A model of harvesting data from multiple
sources into one scalable storetoA model of distributed RDF sources and
caching what is needed for the task at hand into multiple stores fit-for-purpose
Moved from…A model of harvesting data from multiple
sources into one scalable storetoA model of distributed RDF sources and
caching what is needed for the task at hand into multiple stores fit-for-purpose
The Semantic Web!
AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 2121
Experiments on the Grid: The NCS Service
Experiments on the Grid: The NCS Service
HTTPS
AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 2222
Binary raw data archived in Atlas Datastore
x300
ADS£’s
AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 2323
A Data-Rich Subject – the Crystallography ProblemA Data-Rich Subject – the Crystallography Problem
Cl
Cl
Cl
Cl
Cl
Cl
ClCl Cl
Cl
Cl
ClCl
O
O
O
O
N
N
N
N
N+
O
O
O
N+
O
O
O
30,000,000
1.5,000,000
450,000
AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 2424
The eCrystals Digital RepositoryThe eCrystals Digital Repository
http://ecrystals.chem.soton.ac.uk
AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 2525
Access to the underlying dataAccess to the underlying data
AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 2626
Aggregator services
Institutional data repositoriesValidation
Deposit
Publishers: peer-review journals, conference proceedings, etc
Publication
Validation
Data analysis, transformation, mining, modelling
Search, harvest
Presentation services / portals
Data discovery, linking, citation
Laboratory repositoryDeposit
The eCrystals ‘Global’ ModelThe eCrystals ‘Global’ Model
Preservation and curation
AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 2727
Laboratory Repositories and Information Management
Laboratory Repositories and Information Management
AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 2828
Need for a data archive in the laboratory
Need for a data archive in the laboratory
Not just the published spectra!
AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 2929
Deposit
The R4L RepositoryThe R4L Repository
Search / Browse
Create new compound Add experiment data and metadata
AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 3030
Several groups making and analysing; the library Administrative Domains transfer or share the data
Several groups making and analysing; the library Administrative Domains transfer or share the data
Researcher
NationalArchive
ResearchGroup
InstitutionInternational
Database
ResearchGroup
AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 3131
SVG “active” graphics
Link to data, follow links back to the raw data archive
Link to simulation, full simulation data archived in BioSimGrid
R4L
Paper organized using RDF
AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 3232
Summary:Summary:Making sure other people can find,
understand and re-use your data easily and with confidence (even when there is a huge amount of it!)
Make use of Plans to inform the digital context - metadata in advance
Have concern for the “End-to-End life cycle” of chemistry information from the start.
Understanding Usability and Human Computer Interaction is vital for adoption
Making sure other people can find, understand and re-use your data easily and with confidence (even when there is a huge amount of it!)
Make use of Plans to inform the digital context - metadata in advance
Have concern for the “End-to-End life cycle” of chemistry information from the start.
Understanding Usability and Human Computer Interaction is vital for adoption