how chemists use data dr william g town president, kilmorie consulting [email protected] fourth...
TRANSCRIPT
How chemists use dataHow chemists use data
Dr William G TownPresident, Kilmorie Consulting
[email protected] Bloomsbury Conference on E-publishing and E-
publications24th and 25th June 2010
OverviewOverview
• Chemistry documentation perspectiveChemistry documentation perspective
• Case study – CCDCCase study – CCDC
• Study of chemists behaviour – JISCStudy of chemists behaviour – JISC
• RSC Project ProspectRSC Project Prospect
• RSC ChemSpiderRSC ChemSpider
• OreChem projectOreChem project
Chemists have a long tradition of Chemists have a long tradition of documenting chemistrydocumenting chemistry
• Gmelin Handbook of Chemistry (1817- )Gmelin Handbook of Chemistry (1817- )• Beilstein Handbook of Organic Chemistry (1881 - ) Beilstein Handbook of Organic Chemistry (1881 - ) • Chemical Abstracts (1907 - )Chemical Abstracts (1907 - )
– CAS Online (1983 - )CAS Online (1983 - )– STN Express (1987 - )STN Express (1987 - )– SciFinder (1995 - )SciFinder (1995 - )
• ChemWeb (1996 - )ChemWeb (1996 - )• Reaxys (2009 - )Reaxys (2009 - )
Chemists have a long tradition of Chemists have a long tradition of documenting chemistrydocumenting chemistry
• Data centres (e.g. CCDC) started in 1960sData centres (e.g. CCDC) started in 1960s• Extensive chemical database activitiesExtensive chemical database activities
– Bibliographic databases (1960s – )(e.g. CAS)Bibliographic databases (1960s – )(e.g. CAS)– Factual databases (1980s – )(e.g. Beilstein)Factual databases (1980s – )(e.g. Beilstein)– Open access databases (2000s – )(e.g. Crystal Eye)Open access databases (2000s – )(e.g. Crystal Eye)
What’s the status of chemistry online?What’s the status of chemistry online?
• Encyclopaedic articles (Wikipedia)Encyclopaedic articles (Wikipedia)• Chemical vendor databasesChemical vendor databases• Metabolic pathway databasesMetabolic pathway databases• Virtual Screening databasesVirtual Screening databases• Property databasesProperty databases• Screening assay resultsScreening assay results• Patents with chemical structures (IBM & SureChem)Patents with chemical structures (IBM & SureChem)• ADME/Tox dataADME/Tox data• Scientific publications Scientific publications • Compound aggregatorsCompound aggregators• Blogs/Wikis and Open Notebook ScienceBlogs/Wikis and Open Notebook Science• Commercial databasesCommercial databases
Cambridge Crystallographic Data Cambridge Crystallographic Data Centre (CCDC)Centre (CCDC)
Founded in 1965 with grant funding in the Department Founded in 1965 with grant funding in the Department of Chemistry, University of Cambridge of Chemistry, University of Cambridge
Self financing, self administering Institution since 1987Self financing, self administering Institution since 1987
– Not-for-profit, charitable, research InstituteNot-for-profit, charitable, research Institute
– Recognized institute for postgraduate degrees of Recognized institute for postgraduate degrees of the University of Cambridgethe University of Cambridge
ObjectivesObjectives
– ““advancement and promotion of the science of advancement and promotion of the science of chemistry and crystallography for the public benefit”chemistry and crystallography for the public benefit”
Cambridge Structural DatabaseCambridge Structural Database
CSD Growth 1970-2010
Worldwide repository of validated small-molecule crystal structures
Dec 09 – 500,000th structure milestone reached
LamotrigineActa Cryst., Sect.C:Cryst Struct. Commun. (2009), 65, o460Refcode: EFEMUX01
Knowledge mining using the CSDKnowledge mining using the CSD “Crystals are windows on the world of atoms”
(Chet Raymo, Boston Globe, Science Musings)
CSD System search and analysis software CSD System search and analysis software permit structural knowledge in the CSD to be permit structural knowledge in the CSD to be mined from the raw data, to generate:mined from the raw data, to generate: Crystallographic knowledgeCrystallographic knowledge Intra-molecular structural knowledgeIntra-molecular structural knowledge Inter-molecular structural knowledgeInter-molecular structural knowledge
Knowledge mining using the CSDKnowledge mining using the CSDScientific ApplicationsScientific Applications
• Structural chemistry and crystal engineeringStructural chemistry and crystal engineering• Rational drug discovery and designRational drug discovery and design• Protein – ligand interactions & ligand dockingProtein – ligand interactions & ligand docking• Drug development, formulation and deliveryDrug development, formulation and delivery• Materials research and developmentMaterials research and development• Crystal structure predictionCrystal structure prediction• Crystal structure determinationCrystal structure determination
A study of scholarly communication between A study of scholarly communication between chemists and of their use of Web 2.0 chemists and of their use of Web 2.0
technologiestechnologies
• Study commissioned by JISC (UK Joint Study commissioned by JISC (UK Joint Information Systems Committee)Information Systems Committee)
• Principal contractor was Publishing Directions Principal contractor was Publishing Directions (Deborah Kahn – project leader)(Deborah Kahn – project leader)
• Project team composed of Nicki Dennis, Lara Project team composed of Nicki Dennis, Lara Burns and meBurns and me
• Started November ‘08, reported in April ’09Started November ‘08, reported in April ’09
http://www.jisc.ac.uk/media/documents/aboutus/workinggroups/scadvocacyfinal%20report.pdf
Background to the studyBackground to the study
Methods of scholarly communication have changed rapidly Methods of scholarly communication have changed rapidly in the past decade. Improvements in computing and in the past decade. Improvements in computing and social networking technologiessocial networking technologies, digital data capture , digital data capture techniques, powerful techniques, powerful data and text mining techniquesdata and text mining techniques and other technological changes enable practices that and other technological changes enable practices that
are are collaborative, network based and highly intensivecollaborative, network based and highly intensive..
Background to the studyBackground to the study
• We researched the needs of academics in two specific We researched the needs of academics in two specific areas, economics and chemistry. areas, economics and chemistry.
• Recommendations were made on advocacy Recommendations were made on advocacy programmes for each discipline which will be most programmes for each discipline which will be most effective for encouraging effective for encouraging optimum take up of useful optimum take up of useful technologiestechnologies and other developments which and other developments which improve improve scholarly communicationscholarly communication..
Use of information resourcesUse of information resources
Use of information resources by research chemists - top ten
0
20
40
60
80
100
120
Publis
hed
journ
al pa
pers
Books
Chemica
l stru
ctur
es
Exper
imen
tal d
ata se
ts
Power
Point p
rese
ntatio
ns
Imag
es
Wor
king
pape
rs
Mod
els CIFs
Simul
atio
ns or
mac
ros
Pe
rce
nta
ge
of
sa
mp
le
Use of information resourcesUse of information resources
Online resources used at least weekly in chemistry - top ten
0
1020
3040
50
6070
8090
100
e-Jo
urnals
Web o
f Kno
wledg
e
Wiki
pedia
SciFind
er Sch
olar
Structu
ral d
ataba
ses
Article
aler
ts
scho
lar
Reactio
n da
tabas
es
Scopus
Spectr
al dat
abas
es
Per
cen
tag
e o
f sa
mp
le
Use of information resourcesUse of information resources
• High use of High use of WikipediaWikipedia and and Google ScholarGoogle Scholar but chemists use alerting services and but chemists use alerting services and more specialised subject based services more specialised subject based services – This is likely to reflect the fact that chemists This is likely to reflect the fact that chemists
are taught information skills as part of their are taught information skills as part of their degree coursedegree course
Data storage and sharingData storage and sharing
• Chemists share datasets since they Chemists share datasets since they work work collaborativelycollaboratively across institutes across institutes
• Despite considerable work around repositories Despite considerable work around repositories and storage, data are still being stored locally and storage, data are still being stored locally rather than in rather than in institutionalinstitutional or or subject based subject based repositoriesrepositories. .
• Concerns around Concerns around ownership of resultsownership of results and of and of “competitors” obtaining the results need to be “competitors” obtaining the results need to be addressed before this will change significantly. addressed before this will change significantly.
Three years of semantic publishing – Three years of semantic publishing – RSC Project ProspectRSC Project Prospect
What were they trying to improve?What were they trying to improve?
– DiscoverabilityDiscoverability
– UseUse
– UnderstandingUnderstanding
– LinkingLinking
And why...And why...
• What chemistry on the web may become...What chemistry on the web may become...
• Prolonged exposure to Prolonged exposure to Peter Murray-RustPeter Murray-Rust
Quick, what can we mark up?Quick, what can we mark up?What standards did we have in 2007?What standards did we have in 2007?
• InChI – for some compoundsInChI – for some compounds• ChEBI for some compounds and groups of compoundsChEBI for some compounds and groups of compounds• Gene/Sequence/Cell OntologiesGene/Sequence/Cell Ontologies• IUPAC Gold Book (dictionary, really, but online)IUPAC Gold Book (dictionary, really, but online)
And RDF/OWL as distribution formatAnd RDF/OWL as distribution format
30-40% of RSC publishing30-40% of RSC publishing
What did RSC learn with Prospect?What did RSC learn with Prospect?
• This is probably the way to go – 4000 articles so farThis is probably the way to go – 4000 articles so far• How do they cover all subjects?How do they cover all subjects?
– Standards not well defined in all areasStandards not well defined in all areas
• Scale up in manual QAScale up in manual QA• Scale up during huge growth and scope of RSC Scale up during huge growth and scope of RSC
publishing activitiespublishing activities• How to use all that real chemistry data?How to use all that real chemistry data?• Pump prime to change what is asked from authorsPump prime to change what is asked from authors• Is the vision the day-glo article? (“Free headache for Is the vision the day-glo article? (“Free headache for
every user”)every user”)
• Phil Bourne
• Lynn Fink
Source code and binary:http://research.microsoft.com/ontology/
Relationships: Ontology browser
Relationships: Ontology browser
Intent: Term recognition & disambiguation based on OBO or OWL formats
Intent: Term recognition & disambiguation based on OBO or OWL formats
John WilbanksServices: Ontology download web service
Ontology Add-in for Word 2007Ontology Add-in for Word 2007
<?xml version="1.0" ?><cml version="3" convention="org-synth-report" xmlns="http://www.xml-cml.org/schema"> <molecule id="m1"> <atomArray> <atom id="a1" elementType="C" x2="-2.9149999618530273" y2="0.7699999809265137" /> <atom id="a2" elementType="C" x2="-1.5813208400249916" y2="1.5399999809265137" /> <atom id="a3" elementType="O" x2="-0.24764171819695613" y2="0.7699999809265134" /> <atom id="a4" elementType="O" x2="-1.5813208400249912" y2="3.0799999809265137" /> <atom id="a5" elementType="H" x2="-4.248679083681063" y2="1.5399999809265137" /> <atom id="a6" elementType="H" x2="-2.914999961853028" y2="-0.7700000190734864" /> <atom id="a7" elementType="H" x2="-4.248679083681063" y2="-1.907348645691087E-8" /> <atom id="a8" elementType="H" x2="1.0860374036310796" y2="1.5399999809265132" /> </atomArray> <bondArray> <bond atomRefs2="a1 a2" order="1" /> <bond atomRefs2="a2 a3" order="1" /> <bond atomRefs2="a2 a4" order="2" /> <bond atomRefs2="a1 a5" order="1" /> <bond atomRefs2="a1 a6" order="1" /> <bond atomRefs2="a1 a7" order="1" /> <bond atomRefs2="a3 a8" order="1" /> </bondArray> </molecule></cml>
<?xml version="1.0" ?><cml version="3" convention="org-synth-report" xmlns="http://www.xml-cml.org/schema"> <molecule id="m1"> <atomArray> <atom id="a1" elementType="C" x2="-2.9149999618530273" y2="0.7699999809265137" /> <atom id="a2" elementType="C" x2="-1.5813208400249916" y2="1.5399999809265137" /> <atom id="a3" elementType="O" x2="-0.24764171819695613" y2="0.7699999809265134" /> <atom id="a4" elementType="O" x2="-1.5813208400249912" y2="3.0799999809265137" /> <atom id="a5" elementType="H" x2="-4.248679083681063" y2="1.5399999809265137" /> <atom id="a6" elementType="H" x2="-2.914999961853028" y2="-0.7700000190734864" /> <atom id="a7" elementType="H" x2="-4.248679083681063" y2="-1.907348645691087E-8" /> <atom id="a8" elementType="H" x2="1.0860374036310796" y2="1.5399999809265132" /> </atomArray> <bondArray> <bond atomRefs2="a1 a2" order="1" /> <bond atomRefs2="a2 a3" order="1" /> <bond atomRefs2="a2 a4" order="2" /> <bond atomRefs2="a1 a5" order="1" /> <bond atomRefs2="a1 a6" order="1" /> <bond atomRefs2="a1 a7" order="1" /> <bond atomRefs2="a3 a8" order="1" /> </bondArray> </molecule></cml>
Relationships: Navigate and link referenced chemistry
Relationships: Navigate and link referenced chemistry
Available soon:
http://research.microsoft.com/chem4word/ Data: Semantics stored in Chemistry Markup Language
Data: Semantics stored in Chemistry Markup Language
Intent: Recognizes chemical dictionary and ontology terms
Intent: Recognizes chemical dictionary and ontology terms
Author and edit 1D and 2D chemistry. Author and edit 1D and 2D chemistry.
Intelligence: Verifies validity of authored chemistry
Intelligence: Verifies validity of authored chemistry
Authoring: Chem4Word – Chemistry Drawing in WordAuthoring: Chem4Word – Chemistry Drawing in Word
Standards = longevityStandards = longevity
Help implement and develop standardsHelp implement and develop standards
– Open ontologies for chemistryOpen ontologies for chemistry
– InChI TrustInChI Trust
– How to publish this - pre-competitionHow to publish this - pre-competition
Addressing a real need in standardsAddressing a real need in standards
Pistoia AlliancePistoia Alliance
““An initiative to provide an open foundation of data An initiative to provide an open foundation of data standards, ontologies and web-services to streamline the standards, ontologies and web-services to streamline the Pharmaceutical Drug Discovery workflow”Pharmaceutical Drug Discovery workflow”
Semantic Enrichment of the Scientific Semantic Enrichment of the Scientific Literature (SESL) Oct09-Oct10Literature (SESL) Oct09-Oct10
• Pistoia Alliance-fundedPistoia Alliance-funded• EBIEBI• Elsevier, NPG, OUP, RSCElsevier, NPG, OUP, RSC
How to use this information How to use this information betterbetter to to benefit existing researchers – benefit existing researchers –
computers and humanscomputers and humans
• Real behaviour (for humans) Real behaviour (for humans) • Clear requirements (for computer discovery)Clear requirements (for computer discovery)
media.obsessable.com
As few interfaces as possibleAs few interfaces as possible
What do humans want?What do humans want?
What do computers want?What do computers want?
Web servicesWeb services
flickr.com/photos/microcosmos
A free to access online database for chemists
Website and web services
Links over 25 million compounds integrated to <300 data sources
A curation platform for the public to improve the quality of data online
A deposition platform for the public to annotate and extend the data
ChemSpider – A Pragmatic VisionChemSpider – A Pragmatic Vision
““Build a Structure Centric Community”Build a Structure Centric Community”
– Integrate chemical structure data on the webIntegrate chemical structure data on the web– Create a “structure-based hub” to information and Create a “structure-based hub” to information and
datadata– Provide access to structure-based “algorithms”Provide access to structure-based “algorithms”– Let chemists contribute their own dataLet chemists contribute their own data– Allow the community to Allow the community to curate/correctcurate/correct data data
Why did the RSC acquire Why did the RSC acquire ChemSpider?ChemSpider?
• Data versus documentsData versus documents
• Enhancing discoverabilityEnhancing discoverability
• Build on cheminformatics expertiseBuild on cheminformatics expertise
• RSC presence in the open data spaceRSC presence in the open data space
• Critical mass of data for structure Critical mass of data for structure searchingsearching
• Networking chemical scientistsNetworking chemical scientists
Crowd-sourcing chemistry curationCrowd-sourcing chemistry curation
Identify/tag errors, edit names, synonyms, identify records Identify/tag errors, edit names, synonyms, identify records to deprecateto deprecate
Differences between ChemSpider, Differences between ChemSpider, Reaxys and SciFinderReaxys and SciFinder
• Everything on Reaxys and Scifinder is curatedEverything on Reaxys and Scifinder is curated• The data resources can be over a 100 years oldThe data resources can be over a 100 years old• The platforms are commercial and “read-only”The platforms are commercial and “read-only”
• ChemSpider is free, to everyoneChemSpider is free, to everyone• Data are in a state of ongoing curation & annotationData are in a state of ongoing curation & annotation• Data resources are from the “electronic era” Data resources are from the “electronic era” • Data are expanded daily and enhanced on an ongoing basisData are expanded daily and enhanced on an ongoing basis• The platform delivers integrated algorithm accessThe platform delivers integrated algorithm access
Future of chemistry online?Future of chemistry online?
• Make the internet searchable by chemical structure and Make the internet searchable by chemical structure and substructure by a free online servicesubstructure by a free online service
• Aggregate and help improve disparate public sourcesAggregate and help improve disparate public sources
• Highlight high quality publicationsHighlight high quality publications• Test sharing and discussion of research data in the openTest sharing and discussion of research data in the open• Provide structural home to preserve researchers’ collections, Provide structural home to preserve researchers’ collections,
experimental and property dataexperimental and property data
OreChem ProjectOreChem Project
• ParticipantsParticipants– Cambridge UniversityCambridge University– Cornell UniversityCornell University– Indiana UniversityIndiana University– Penn State UniversityPenn State University
• FundingFunding– Microsoft ResearchMicrosoft Research– NSFNSF
OreChem ProjectOreChem Project
• Data integrationData integration– Representation/reuse through common data models and Representation/reuse through common data models and
ontologiesontologies
• Data capture and recoveryData capture and recovery– At source capture of experimental data and research process At source capture of experimental data and research process
(ELNs)(ELNs)– Compound object authoringCompound object authoring– Retrospective harvesting of chemistry dataRetrospective harvesting of chemistry data
• Data storage and manipulationData storage and manipulation– Cloud-based triple storeCloud-based triple store– Chemical structure searchChemical structure search– Linked data integrationLinked data integration– Computation of propertiesComputation of properties
Chemistry is particularly challengingChemistry is particularly challenging
• Commercial value of chemical information (e.g. Commercial value of chemical information (e.g. Pharma industry)Pharma industry)
• Nature of chemistry research cultureNature of chemistry research culture– Predominance of synthesis (creation) overshadows Predominance of synthesis (creation) overshadows
discovery mode typical of physics or biologydiscovery mode typical of physics or biology– Autonomy, successful research with limited reliance Autonomy, successful research with limited reliance
on otherson others
• Dominance of scholarly societies as publishersDominance of scholarly societies as publishers– ACS (CAS)ACS (CAS)– RSCRSC
Chemistry on the InternetChemistry on the Internet – a future vision – a future vision
• The “semantic web” for chemistry is in placeThe “semantic web” for chemistry is in place• Crowdsourcing is commonplaceCrowdsourcing is commonplace• Chemists will search the web by “structure”Chemists will search the web by “structure”• Chemistry articles indexed and searchableChemistry articles indexed and searchable• Reduced number of searches to find data because data are Reduced number of searches to find data because data are
integrated – compounds, vendors, syntheses, data, integrated – compounds, vendors, syntheses, data, publications and patentspublications and patents
• A world of A world of Open Access Open Access and and Open DataOpen Data
AcknowledgementsAcknowledgements
• Colin Groom, Gary Battle CCDCColin Groom, Gary Battle CCDC
• Richard Kidd, RSCRichard Kidd, RSC
• Tony Williams, RSC ChemSpiderTony Williams, RSC ChemSpider
• Carl Lagoze, OreChemCarl Lagoze, OreChem