how chemists use data dr william g town president, kilmorie consulting [email protected] fourth...

49
How chemists use data How chemists use data Dr William G Town President, Kilmorie Consulting [email protected] Fourth Bloomsbury Conference on E-publishing and E-publications 24 th and 25 th June 2010

Upload: jaheim-herford

Post on 16-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

How chemists use dataHow chemists use data

Dr William G TownPresident, Kilmorie Consulting

[email protected] Bloomsbury Conference on E-publishing and E-

publications24th and 25th June 2010

OverviewOverview

• Chemistry documentation perspectiveChemistry documentation perspective

• Case study – CCDCCase study – CCDC

• Study of chemists behaviour – JISCStudy of chemists behaviour – JISC

• RSC Project ProspectRSC Project Prospect

• RSC ChemSpiderRSC ChemSpider

• OreChem projectOreChem project

Chemists have a long tradition of Chemists have a long tradition of documenting chemistrydocumenting chemistry

• Gmelin Handbook of Chemistry (1817- )Gmelin Handbook of Chemistry (1817- )• Beilstein Handbook of Organic Chemistry (1881 - ) Beilstein Handbook of Organic Chemistry (1881 - ) • Chemical Abstracts (1907 - )Chemical Abstracts (1907 - )

– CAS Online (1983 - )CAS Online (1983 - )– STN Express (1987 - )STN Express (1987 - )– SciFinder (1995 - )SciFinder (1995 - )

• ChemWeb (1996 - )ChemWeb (1996 - )• Reaxys (2009 - )Reaxys (2009 - )

Chemists have a long tradition of Chemists have a long tradition of documenting chemistrydocumenting chemistry

• Data centres (e.g. CCDC) started in 1960sData centres (e.g. CCDC) started in 1960s• Extensive chemical database activitiesExtensive chemical database activities

– Bibliographic databases (1960s – )(e.g. CAS)Bibliographic databases (1960s – )(e.g. CAS)– Factual databases (1980s – )(e.g. Beilstein)Factual databases (1980s – )(e.g. Beilstein)– Open access databases (2000s – )(e.g. Crystal Eye)Open access databases (2000s – )(e.g. Crystal Eye)

What’s the status of chemistry online?What’s the status of chemistry online?

• Encyclopaedic articles (Wikipedia)Encyclopaedic articles (Wikipedia)• Chemical vendor databasesChemical vendor databases• Metabolic pathway databasesMetabolic pathway databases• Virtual Screening databasesVirtual Screening databases• Property databasesProperty databases• Screening assay resultsScreening assay results• Patents with chemical structures (IBM & SureChem)Patents with chemical structures (IBM & SureChem)• ADME/Tox dataADME/Tox data• Scientific publications Scientific publications • Compound aggregatorsCompound aggregators• Blogs/Wikis and Open Notebook ScienceBlogs/Wikis and Open Notebook Science• Commercial databasesCommercial databases

Chemists like structuresChemists like structures

digitonin

Cambridge Crystallographic Data Cambridge Crystallographic Data Centre (CCDC)Centre (CCDC)

Founded in 1965 with grant funding in the Department Founded in 1965 with grant funding in the Department of Chemistry, University of Cambridge of Chemistry, University of Cambridge

Self financing, self administering Institution since 1987Self financing, self administering Institution since 1987

– Not-for-profit, charitable, research InstituteNot-for-profit, charitable, research Institute

– Recognized institute for postgraduate degrees of Recognized institute for postgraduate degrees of the University of Cambridgethe University of Cambridge

ObjectivesObjectives

– ““advancement and promotion of the science of advancement and promotion of the science of chemistry and crystallography for the public benefit”chemistry and crystallography for the public benefit”

Cambridge Structural DatabaseCambridge Structural Database

CSD Growth 1970-2010

Worldwide repository of validated small-molecule crystal structures

Dec 09 – 500,000th structure milestone reached

LamotrigineActa Cryst., Sect.C:Cryst Struct. Commun. (2009), 65, o460Refcode: EFEMUX01

Knowledge mining using the CSDKnowledge mining using the CSD “Crystals are windows on the world of atoms”

(Chet Raymo, Boston Globe, Science Musings)

CSD System search and analysis software CSD System search and analysis software permit structural knowledge in the CSD to be permit structural knowledge in the CSD to be mined from the raw data, to generate:mined from the raw data, to generate: Crystallographic knowledgeCrystallographic knowledge Intra-molecular structural knowledgeIntra-molecular structural knowledge Inter-molecular structural knowledgeInter-molecular structural knowledge

Knowledge mining using the CSDKnowledge mining using the CSDScientific ApplicationsScientific Applications

• Structural chemistry and crystal engineeringStructural chemistry and crystal engineering• Rational drug discovery and designRational drug discovery and design• Protein – ligand interactions & ligand dockingProtein – ligand interactions & ligand docking• Drug development, formulation and deliveryDrug development, formulation and delivery• Materials research and developmentMaterials research and development• Crystal structure predictionCrystal structure prediction• Crystal structure determinationCrystal structure determination

A study of scholarly communication between A study of scholarly communication between chemists and of their use of Web 2.0 chemists and of their use of Web 2.0

technologiestechnologies

• Study commissioned by JISC (UK Joint Study commissioned by JISC (UK Joint Information Systems Committee)Information Systems Committee)

• Principal contractor was Publishing Directions Principal contractor was Publishing Directions (Deborah Kahn – project leader)(Deborah Kahn – project leader)

• Project team composed of Nicki Dennis, Lara Project team composed of Nicki Dennis, Lara Burns and meBurns and me

• Started November ‘08, reported in April ’09Started November ‘08, reported in April ’09

http://www.jisc.ac.uk/media/documents/aboutus/workinggroups/scadvocacyfinal%20report.pdf

Background to the studyBackground to the study

Methods of scholarly communication have changed rapidly Methods of scholarly communication have changed rapidly in the past decade. Improvements in computing and in the past decade. Improvements in computing and social networking technologiessocial networking technologies, digital data capture , digital data capture techniques, powerful techniques, powerful data and text mining techniquesdata and text mining techniques and other technological changes enable practices that and other technological changes enable practices that

are are collaborative, network based and highly intensivecollaborative, network based and highly intensive..

Background to the studyBackground to the study

• We researched the needs of academics in two specific We researched the needs of academics in two specific areas, economics and chemistry. areas, economics and chemistry.

• Recommendations were made on advocacy Recommendations were made on advocacy programmes for each discipline which will be most programmes for each discipline which will be most effective for encouraging effective for encouraging optimum take up of useful optimum take up of useful technologiestechnologies and other developments which and other developments which improve improve scholarly communicationscholarly communication..

Use of information resourcesUse of information resources

Use of information resources by research chemists - top ten

0

20

40

60

80

100

120

Publis

hed

journ

al pa

pers

Books

Chemica

l stru

ctur

es

Exper

imen

tal d

ata se

ts

Power

Point p

rese

ntatio

ns

Imag

es

Wor

king

pape

rs

Mod

els CIFs

Simul

atio

ns or

mac

ros

Pe

rce

nta

ge

of

sa

mp

le

Use of information resourcesUse of information resources

Online resources used at least weekly in chemistry - top ten

0

1020

3040

50

6070

8090

100

e-Jo

urnals

Web o

f Kno

wledg

e

Wiki

pedia

SciFind

er Sch

olar

Structu

ral d

ataba

ses

Article

aler

ts

Google

scho

lar

Reactio

n da

tabas

es

Scopus

Spectr

al dat

abas

es

Per

cen

tag

e o

f sa

mp

le

Use of information resourcesUse of information resources

• High use of High use of WikipediaWikipedia and and Google ScholarGoogle Scholar but chemists use alerting services and but chemists use alerting services and more specialised subject based services more specialised subject based services – This is likely to reflect the fact that chemists This is likely to reflect the fact that chemists

are taught information skills as part of their are taught information skills as part of their degree coursedegree course

Data sharingData sharing

Data storageData storage

Data storage and sharingData storage and sharing

• Chemists share datasets since they Chemists share datasets since they work work collaborativelycollaboratively across institutes across institutes

• Despite considerable work around repositories Despite considerable work around repositories and storage, data are still being stored locally and storage, data are still being stored locally rather than in rather than in institutionalinstitutional or or subject based subject based repositoriesrepositories. .

• Concerns around Concerns around ownership of resultsownership of results and of and of “competitors” obtaining the results need to be “competitors” obtaining the results need to be addressed before this will change significantly. addressed before this will change significantly.

Three years of semantic publishing – Three years of semantic publishing – RSC Project ProspectRSC Project Prospect

What were they trying to improve?What were they trying to improve?

– DiscoverabilityDiscoverability

– UseUse

– UnderstandingUnderstanding

– LinkingLinking

And why...And why...

• What chemistry on the web may become...What chemistry on the web may become...

• Prolonged exposure to Prolonged exposure to Peter Murray-RustPeter Murray-Rust

Quick, what can we mark up?Quick, what can we mark up?What standards did we have in 2007?What standards did we have in 2007?

• InChI – for some compoundsInChI – for some compounds• ChEBI for some compounds and groups of compoundsChEBI for some compounds and groups of compounds• Gene/Sequence/Cell OntologiesGene/Sequence/Cell Ontologies• IUPAC Gold Book (dictionary, really, but online)IUPAC Gold Book (dictionary, really, but online)

And RDF/OWL as distribution formatAnd RDF/OWL as distribution format

30-40% of RSC publishing30-40% of RSC publishing

What did RSC learn with Prospect?What did RSC learn with Prospect?

• This is probably the way to go – 4000 articles so farThis is probably the way to go – 4000 articles so far• How do they cover all subjects?How do they cover all subjects?

– Standards not well defined in all areasStandards not well defined in all areas

• Scale up in manual QAScale up in manual QA• Scale up during huge growth and scope of RSC Scale up during huge growth and scope of RSC

publishing activitiespublishing activities• How to use all that real chemistry data?How to use all that real chemistry data?• Pump prime to change what is asked from authorsPump prime to change what is asked from authors• Is the vision the day-glo article? (“Free headache for Is the vision the day-glo article? (“Free headache for

every user”)every user”)

• Phil Bourne

• Lynn Fink

Source code and binary:http://research.microsoft.com/ontology/

Relationships: Ontology browser

Relationships: Ontology browser

Intent: Term recognition & disambiguation based on OBO or OWL formats

Intent: Term recognition & disambiguation based on OBO or OWL formats

John WilbanksServices: Ontology download web service

Ontology Add-in for Word 2007Ontology Add-in for Word 2007

<?xml version="1.0" ?><cml version="3" convention="org-synth-report" xmlns="http://www.xml-cml.org/schema"> <molecule id="m1"> <atomArray> <atom id="a1" elementType="C" x2="-2.9149999618530273" y2="0.7699999809265137" /> <atom id="a2" elementType="C" x2="-1.5813208400249916" y2="1.5399999809265137" /> <atom id="a3" elementType="O" x2="-0.24764171819695613" y2="0.7699999809265134" /> <atom id="a4" elementType="O" x2="-1.5813208400249912" y2="3.0799999809265137" /> <atom id="a5" elementType="H" x2="-4.248679083681063" y2="1.5399999809265137" /> <atom id="a6" elementType="H" x2="-2.914999961853028" y2="-0.7700000190734864" /> <atom id="a7" elementType="H" x2="-4.248679083681063" y2="-1.907348645691087E-8" /> <atom id="a8" elementType="H" x2="1.0860374036310796" y2="1.5399999809265132" /> </atomArray> <bondArray> <bond atomRefs2="a1 a2" order="1" /> <bond atomRefs2="a2 a3" order="1" /> <bond atomRefs2="a2 a4" order="2" /> <bond atomRefs2="a1 a5" order="1" /> <bond atomRefs2="a1 a6" order="1" /> <bond atomRefs2="a1 a7" order="1" /> <bond atomRefs2="a3 a8" order="1" /> </bondArray> </molecule></cml>

<?xml version="1.0" ?><cml version="3" convention="org-synth-report" xmlns="http://www.xml-cml.org/schema"> <molecule id="m1"> <atomArray> <atom id="a1" elementType="C" x2="-2.9149999618530273" y2="0.7699999809265137" /> <atom id="a2" elementType="C" x2="-1.5813208400249916" y2="1.5399999809265137" /> <atom id="a3" elementType="O" x2="-0.24764171819695613" y2="0.7699999809265134" /> <atom id="a4" elementType="O" x2="-1.5813208400249912" y2="3.0799999809265137" /> <atom id="a5" elementType="H" x2="-4.248679083681063" y2="1.5399999809265137" /> <atom id="a6" elementType="H" x2="-2.914999961853028" y2="-0.7700000190734864" /> <atom id="a7" elementType="H" x2="-4.248679083681063" y2="-1.907348645691087E-8" /> <atom id="a8" elementType="H" x2="1.0860374036310796" y2="1.5399999809265132" /> </atomArray> <bondArray> <bond atomRefs2="a1 a2" order="1" /> <bond atomRefs2="a2 a3" order="1" /> <bond atomRefs2="a2 a4" order="2" /> <bond atomRefs2="a1 a5" order="1" /> <bond atomRefs2="a1 a6" order="1" /> <bond atomRefs2="a1 a7" order="1" /> <bond atomRefs2="a3 a8" order="1" /> </bondArray> </molecule></cml>

Relationships: Navigate and link referenced chemistry

Relationships: Navigate and link referenced chemistry

Available soon:

http://research.microsoft.com/chem4word/ Data: Semantics stored in Chemistry Markup Language

Data: Semantics stored in Chemistry Markup Language

Intent: Recognizes chemical dictionary and ontology terms

Intent: Recognizes chemical dictionary and ontology terms

Author and edit 1D and 2D chemistry. Author and edit 1D and 2D chemistry.

Intelligence: Verifies validity of authored chemistry

Intelligence: Verifies validity of authored chemistry

Authoring: Chem4Word – Chemistry Drawing in WordAuthoring: Chem4Word – Chemistry Drawing in Word

Standards = longevityStandards = longevity

Help implement and develop standardsHelp implement and develop standards

– Open ontologies for chemistryOpen ontologies for chemistry

– InChI TrustInChI Trust

– How to publish this - pre-competitionHow to publish this - pre-competition

Addressing a real need in standardsAddressing a real need in standards

Pistoia AlliancePistoia Alliance

““An initiative to provide an open foundation of data An initiative to provide an open foundation of data standards, ontologies and web-services to streamline the standards, ontologies and web-services to streamline the Pharmaceutical Drug Discovery workflow”Pharmaceutical Drug Discovery workflow”

Semantic Enrichment of the Scientific Semantic Enrichment of the Scientific Literature (SESL) Oct09-Oct10Literature (SESL) Oct09-Oct10

• Pistoia Alliance-fundedPistoia Alliance-funded• EBIEBI• Elsevier, NPG, OUP, RSCElsevier, NPG, OUP, RSC

How to use this information How to use this information betterbetter to to benefit existing researchers – benefit existing researchers –

computers and humanscomputers and humans

• Real behaviour (for humans) Real behaviour (for humans) • Clear requirements (for computer discovery)Clear requirements (for computer discovery)

media.obsessable.com

As few interfaces as possibleAs few interfaces as possible

What do humans want?What do humans want?

What do computers want?What do computers want?

Web servicesWeb services

flickr.com/photos/microcosmos

A free to access online database for chemists

Website and web services

Links over 25 million compounds integrated to <300 data sources

A curation platform for the public to improve the quality of data online

A deposition platform for the public to annotate and extend the data

ChemSpider – A Pragmatic VisionChemSpider – A Pragmatic Vision

““Build a Structure Centric Community”Build a Structure Centric Community”

– Integrate chemical structure data on the webIntegrate chemical structure data on the web– Create a “structure-based hub” to information and Create a “structure-based hub” to information and

datadata– Provide access to structure-based “algorithms”Provide access to structure-based “algorithms”– Let chemists contribute their own dataLet chemists contribute their own data– Allow the community to Allow the community to curate/correctcurate/correct data data

Why did the RSC acquire Why did the RSC acquire ChemSpider?ChemSpider?

• Data versus documentsData versus documents

• Enhancing discoverabilityEnhancing discoverability

• Build on cheminformatics expertiseBuild on cheminformatics expertise

• RSC presence in the open data spaceRSC presence in the open data space

• Critical mass of data for structure Critical mass of data for structure searchingsearching

• Networking chemical scientistsNetworking chemical scientists

Crowd-sourcing chemistry curationCrowd-sourcing chemistry curation

Identify/tag errors, edit names, synonyms, identify records Identify/tag errors, edit names, synonyms, identify records to deprecateto deprecate

CAS SciFinderCAS SciFinder

ReaxysReaxys

Differences between ChemSpider, Differences between ChemSpider, Reaxys and SciFinderReaxys and SciFinder

• Everything on Reaxys and Scifinder is curatedEverything on Reaxys and Scifinder is curated• The data resources can be over a 100 years oldThe data resources can be over a 100 years old• The platforms are commercial and “read-only”The platforms are commercial and “read-only”

• ChemSpider is free, to everyoneChemSpider is free, to everyone• Data are in a state of ongoing curation & annotationData are in a state of ongoing curation & annotation• Data resources are from the “electronic era” Data resources are from the “electronic era” • Data are expanded daily and enhanced on an ongoing basisData are expanded daily and enhanced on an ongoing basis• The platform delivers integrated algorithm accessThe platform delivers integrated algorithm access

Future of chemistry online?Future of chemistry online?

• Make the internet searchable by chemical structure and Make the internet searchable by chemical structure and substructure by a free online servicesubstructure by a free online service

• Aggregate and help improve disparate public sourcesAggregate and help improve disparate public sources

• Highlight high quality publicationsHighlight high quality publications• Test sharing and discussion of research data in the openTest sharing and discussion of research data in the open• Provide structural home to preserve researchers’ collections, Provide structural home to preserve researchers’ collections,

experimental and property dataexperimental and property data

OreChem ProjectOreChem Project

• ParticipantsParticipants– Cambridge UniversityCambridge University– Cornell UniversityCornell University– Indiana UniversityIndiana University– Penn State UniversityPenn State University

• FundingFunding– Microsoft ResearchMicrosoft Research– NSFNSF

OreChem ProjectOreChem Project

• Data integrationData integration– Representation/reuse through common data models and Representation/reuse through common data models and

ontologiesontologies

• Data capture and recoveryData capture and recovery– At source capture of experimental data and research process At source capture of experimental data and research process

(ELNs)(ELNs)– Compound object authoringCompound object authoring– Retrospective harvesting of chemistry dataRetrospective harvesting of chemistry data

• Data storage and manipulationData storage and manipulation– Cloud-based triple storeCloud-based triple store– Chemical structure searchChemical structure search– Linked data integrationLinked data integration– Computation of propertiesComputation of properties

Chemistry is particularly challengingChemistry is particularly challenging

• Commercial value of chemical information (e.g. Commercial value of chemical information (e.g. Pharma industry)Pharma industry)

• Nature of chemistry research cultureNature of chemistry research culture– Predominance of synthesis (creation) overshadows Predominance of synthesis (creation) overshadows

discovery mode typical of physics or biologydiscovery mode typical of physics or biology– Autonomy, successful research with limited reliance Autonomy, successful research with limited reliance

on otherson others

• Dominance of scholarly societies as publishersDominance of scholarly societies as publishers– ACS (CAS)ACS (CAS)– RSCRSC

Chemistry on the InternetChemistry on the Internet – a future vision – a future vision

• The “semantic web” for chemistry is in placeThe “semantic web” for chemistry is in place• Crowdsourcing is commonplaceCrowdsourcing is commonplace• Chemists will search the web by “structure”Chemists will search the web by “structure”• Chemistry articles indexed and searchableChemistry articles indexed and searchable• Reduced number of searches to find data because data are Reduced number of searches to find data because data are

integrated – compounds, vendors, syntheses, data, integrated – compounds, vendors, syntheses, data, publications and patentspublications and patents

• A world of A world of Open Access Open Access and and Open DataOpen Data

Linked Data on the WebLinked Data on the Web

AcknowledgementsAcknowledgements

• Colin Groom, Gary Battle CCDCColin Groom, Gary Battle CCDC

• Richard Kidd, RSCRichard Kidd, RSC

• Tony Williams, RSC ChemSpiderTony Williams, RSC ChemSpider

• Carl Lagoze, OreChemCarl Lagoze, OreChem

Any questions?Any questions?