big data challenges associated with building a national data repository for chemistry
DESCRIPTION
At a time when the data explosion has simply been redefined as “Big”, the hurdles associated with building a subject-specific data repository for chemistry are daunting. Combining a multitude of non-standard data formats for chemicals, related properties, reactions, spectra etc., together with the confusion of licensing and embargoing, and providing for data exchange and integration with services and platforms external to the repository, the challenge is significant. This all at a time when semantic technologies are touted as the fundamental technology to enhance integration and discoverability. Funding agencies are demanding change, especially a change towards access to open data to parallel their expectations around Open Access publishing. The Royal Society of Chemistry has been funded by the Engineering and Physical Science Research of the UK to deliver a “chemical database service” for UK scientists. This presentation will provide an overview of the challenges associated with this project and our progress in delivering a chemistry repository capable of handling the complex data types ssociated with chemistry. The benefits of such a repository in terms of providing data to develop prediction models to further enable scientific discovery will be discussed and the potential impact on the future of scientific publishing will also be examined.TRANSCRIPT
![Page 1: Big data challenges associated with building a national data repository for chemistry](https://reader038.vdocuments.us/reader038/viewer/2022110306/554e7d1db4c905f66a8b5271/html5/thumbnails/1.jpg)
The Big Data Challenges Associated with Building a National Data Repository for Chemistry
Antony Williams
ICIC Meeting, Vienna
October 14th 2013
![Page 2: Big data challenges associated with building a national data repository for chemistry](https://reader038.vdocuments.us/reader038/viewer/2022110306/554e7d1db4c905f66a8b5271/html5/thumbnails/2.jpg)
So what is all this Big Data?
![Page 3: Big data challenges associated with building a national data repository for chemistry](https://reader038.vdocuments.us/reader038/viewer/2022110306/554e7d1db4c905f66a8b5271/html5/thumbnails/3.jpg)
![Page 4: Big data challenges associated with building a national data repository for chemistry](https://reader038.vdocuments.us/reader038/viewer/2022110306/554e7d1db4c905f66a8b5271/html5/thumbnails/4.jpg)
And the World of Chemistry?
![Page 5: Big data challenges associated with building a national data repository for chemistry](https://reader038.vdocuments.us/reader038/viewer/2022110306/554e7d1db4c905f66a8b5271/html5/thumbnails/5.jpg)
And the World of Chemistry?
“The InChIKey indexing has therefore turned Google into a de-facto open global chemical information hub by merging links to most significant sources, including over 50 million PubChem and ChemSpider records.”
![Page 6: Big data challenges associated with building a national data repository for chemistry](https://reader038.vdocuments.us/reader038/viewer/2022110306/554e7d1db4c905f66a8b5271/html5/thumbnails/6.jpg)
And the World of Chemistry?
![Page 7: Big data challenges associated with building a national data repository for chemistry](https://reader038.vdocuments.us/reader038/viewer/2022110306/554e7d1db4c905f66a8b5271/html5/thumbnails/7.jpg)
RSC’s ChemSpider
>29 million chemicals from >500 sources
![Page 8: Big data challenges associated with building a national data repository for chemistry](https://reader038.vdocuments.us/reader038/viewer/2022110306/554e7d1db4c905f66a8b5271/html5/thumbnails/8.jpg)
…and the world of Openness
![Page 9: Big data challenges associated with building a national data repository for chemistry](https://reader038.vdocuments.us/reader038/viewer/2022110306/554e7d1db4c905f66a8b5271/html5/thumbnails/9.jpg)
Times have changed…
Open Access funder mandates…
![Page 10: Big data challenges associated with building a national data repository for chemistry](https://reader038.vdocuments.us/reader038/viewer/2022110306/554e7d1db4c905f66a8b5271/html5/thumbnails/10.jpg)
Times have changed…
Growth, growth, growth…
![Page 11: Big data challenges associated with building a national data repository for chemistry](https://reader038.vdocuments.us/reader038/viewer/2022110306/554e7d1db4c905f66a8b5271/html5/thumbnails/11.jpg)
Publishers are responding
![Page 12: Big data challenges associated with building a national data repository for chemistry](https://reader038.vdocuments.us/reader038/viewer/2022110306/554e7d1db4c905f66a8b5271/html5/thumbnails/12.jpg)
The world of Open Data…
![Page 13: Big data challenges associated with building a national data repository for chemistry](https://reader038.vdocuments.us/reader038/viewer/2022110306/554e7d1db4c905f66a8b5271/html5/thumbnails/13.jpg)
Open Data are everywhere
• Is Openness and Social Sharing changing the world?
• The cultural experiments in Open Data and exchange are almost daily
• Mobile platforms enhance participation
• And then what of Chemistry Data???
![Page 14: Big data challenges associated with building a national data repository for chemistry](https://reader038.vdocuments.us/reader038/viewer/2022110306/554e7d1db4c905f66a8b5271/html5/thumbnails/14.jpg)
Publications-summary of work
• Scientific publications are a summary of work• Is all work reported?• How much science is lost to pruning?• What of value sits in notebooks and is lost?• Publications offering access to “real data”?
• How much data is lost?• How many compounds never reported?• How many syntheses fail or succeed?• How many characterization measurements?
![Page 15: Big data challenges associated with building a national data repository for chemistry](https://reader038.vdocuments.us/reader038/viewer/2022110306/554e7d1db4c905f66a8b5271/html5/thumbnails/15.jpg)
About Me…as a Chemist• I’ve performed a few dozen chemical
syntheses• I’ve run thousands of analytical spectra• I’ve generated thousands of NMR assignments• I’ve probably published <5% of all work • Most of it has been lost• But things can be different today….• But it still needs to be associated with me…
![Page 16: Big data challenges associated with building a national data repository for chemistry](https://reader038.vdocuments.us/reader038/viewer/2022110306/554e7d1db4c905f66a8b5271/html5/thumbnails/16.jpg)
What of non-abstracted data?
• How much data generated in a lab, that COULD go public, is lost forever?
![Page 17: Big data challenges associated with building a national data repository for chemistry](https://reader038.vdocuments.us/reader038/viewer/2022110306/554e7d1db4c905f66a8b5271/html5/thumbnails/17.jpg)
• How much data generated in a lab, that COULD go public, is lost forever?
• Public Domain reference databases of value?• Syntheses• Properties• Spectra and CIFs• Images• Raw data vs. representations of data
What of non-abstracted data?
![Page 18: Big data challenges associated with building a national data repository for chemistry](https://reader038.vdocuments.us/reader038/viewer/2022110306/554e7d1db4c905f66a8b5271/html5/thumbnails/18.jpg)
ChemSpider
• ChemSpider allowed the community to participate in linking the internet of chemistry & crowdsourcing of data
• Successful experiment in terms of building a central hub for integrated web search
• More people are “users” than “contributors”
• Yet basic feedback and game-play helps
![Page 19: Big data challenges associated with building a national data repository for chemistry](https://reader038.vdocuments.us/reader038/viewer/2022110306/554e7d1db4c905f66a8b5271/html5/thumbnails/19.jpg)
Crowdsourced “Annotations”• Users can add
• Descriptions, Syntheses and Commentaries• Links to PubMed articles• Links to articles via DOIs • Add spectral data• Add Crystallographic Information Files• Add photos• Add MP3 files• Add Videos
![Page 20: Big data challenges associated with building a national data repository for chemistry](https://reader038.vdocuments.us/reader038/viewer/2022110306/554e7d1db4c905f66a8b5271/html5/thumbnails/20.jpg)
An EPSRC Call
“…the identification of the need for a UK national service for the provision of a searchable, electronic chemical database for the UK academic research community.”
![Page 21: Big data challenges associated with building a national data repository for chemistry](https://reader038.vdocuments.us/reader038/viewer/2022110306/554e7d1db4c905f66a8b5271/html5/thumbnails/21.jpg)
• Service for UK Academics• “Prepaid access” integrating
commercial databases and services• Access to curated data sets • Provision of prediction algorithms
National Chemical Database Service
![Page 22: Big data challenges associated with building a national data repository for chemistry](https://reader038.vdocuments.us/reader038/viewer/2022110306/554e7d1db4c905f66a8b5271/html5/thumbnails/22.jpg)
National Chemical Database Service
![Page 23: Big data challenges associated with building a national data repository for chemistry](https://reader038.vdocuments.us/reader038/viewer/2022110306/554e7d1db4c905f66a8b5271/html5/thumbnails/23.jpg)
• Service for UK Academics• “Prepaid access” integrating
commercial databases and services• Access to curated data sets • Provision of prediction algorithms
• Ultimate goal is to federate search • Development of “data repository”
National Chemical Database Service
![Page 24: Big data challenges associated with building a national data repository for chemistry](https://reader038.vdocuments.us/reader038/viewer/2022110306/554e7d1db4c905f66a8b5271/html5/thumbnails/24.jpg)
Development of Data Repository
• Data repository should not just be a data dump – should not be a “big disk”
• Searchable, integrated, segregated repository of data types
• Data access including private, shared embargoed and public
• Delivery of derived models from data• Integrated to AltMetrics models
![Page 25: Big data challenges associated with building a national data repository for chemistry](https://reader038.vdocuments.us/reader038/viewer/2022110306/554e7d1db4c905f66a8b5271/html5/thumbnails/25.jpg)
What can drive participation?
• What can drive scientists to participate and contribute?• Ensuring provenance of their data for reuse• Mandates from funding agencies• Improved systems to ease contribution• Additional contributions to science• Improved publishing processes• Recognition for contributions
![Page 26: Big data challenges associated with building a national data repository for chemistry](https://reader038.vdocuments.us/reader038/viewer/2022110306/554e7d1db4c905f66a8b5271/html5/thumbnails/26.jpg)
AltMetrics
![Page 27: Big data challenges associated with building a national data repository for chemistry](https://reader038.vdocuments.us/reader038/viewer/2022110306/554e7d1db4c905f66a8b5271/html5/thumbnails/27.jpg)
AltMetrics
![Page 28: Big data challenges associated with building a national data repository for chemistry](https://reader038.vdocuments.us/reader038/viewer/2022110306/554e7d1db4c905f66a8b5271/html5/thumbnails/28.jpg)
AltMetrics as Scientist Impact
![Page 29: Big data challenges associated with building a national data repository for chemistry](https://reader038.vdocuments.us/reader038/viewer/2022110306/554e7d1db4c905f66a8b5271/html5/thumbnails/29.jpg)
AltMetrics
![Page 30: Big data challenges associated with building a national data repository for chemistry](https://reader038.vdocuments.us/reader038/viewer/2022110306/554e7d1db4c905f66a8b5271/html5/thumbnails/30.jpg)
Plum Analytics
![Page 31: Big data challenges associated with building a national data repository for chemistry](https://reader038.vdocuments.us/reader038/viewer/2022110306/554e7d1db4c905f66a8b5271/html5/thumbnails/31.jpg)
Plum Analytics
![Page 32: Big data challenges associated with building a national data repository for chemistry](https://reader038.vdocuments.us/reader038/viewer/2022110306/554e7d1db4c905f66a8b5271/html5/thumbnails/32.jpg)
Rewards and Recognition
Congratulations! Your 1st CSSP article has been published. Philosopher Lao Tzu said “A journey of a thousand miles begins with a single step”. In the same way we hope that this will be the first of many submissions that you make to CSSP.
The First Step badge is awarded when a user submits (& has published) their 1st CSSP article.
![Page 33: Big data challenges associated with building a national data repository for chemistry](https://reader038.vdocuments.us/reader038/viewer/2022110306/554e7d1db4c905f66a8b5271/html5/thumbnails/33.jpg)
AltMetrics Feeds
• For our data repository ensure contribution of data will feed out to the AltMetrics platforms
• Every data point, every data download, use and reuse will be associated with the scientist
• Data will be DOI’ed (presently under review)
• Services provided will allow for AltMetrics use
![Page 34: Big data challenges associated with building a national data repository for chemistry](https://reader038.vdocuments.us/reader038/viewer/2022110306/554e7d1db4c905f66a8b5271/html5/thumbnails/34.jpg)
Domain Specific Challenges
• Creating a platform of value not just dumping• Searchability, segregation, tagging, use and
reuse, collaboration, low barrier to participation• Quality of chemistry data at source
• ensuring chemicals are correct• reactions map and balance as appropriate• file format handling for analytical data types –
binary file formats are proprietary• valid interpretation of data
![Page 35: Big data challenges associated with building a national data repository for chemistry](https://reader038.vdocuments.us/reader038/viewer/2022110306/554e7d1db4c905f66a8b5271/html5/thumbnails/35.jpg)
Domain Specific Challenges• Quality of data at source
• ensuring chemicals are correct - VALIDATION• reactions map and balance as appropriate –
VALIDATION and STANDARDIZATION• file format handling for analytical data types –
binary file formats are proprietary - STANDARDIZATION
• valid interpretation of data – VALIDATION and ANNOTATION
![Page 36: Big data challenges associated with building a national data repository for chemistry](https://reader038.vdocuments.us/reader038/viewer/2022110306/554e7d1db4c905f66a8b5271/html5/thumbnails/36.jpg)
Validating Chemicals
• Community service for validation and standardization of chemicals (CVSP)
• Open rules sets but standard set based on FDA substance registry system
![Page 37: Big data challenges associated with building a national data repository for chemistry](https://reader038.vdocuments.us/reader038/viewer/2022110306/554e7d1db4c905f66a8b5271/html5/thumbnails/37.jpg)
DB08128
J. Brechner, IUPACGraphical Representation of stereochem. configurationsSection: ST-1.1.10
DB06287
Validating chemicals
![Page 38: Big data challenges associated with building a national data repository for chemistry](https://reader038.vdocuments.us/reader038/viewer/2022110306/554e7d1db4c905f66a8b5271/html5/thumbnails/38.jpg)
Standardizing Chemicals
![Page 39: Big data challenges associated with building a national data repository for chemistry](https://reader038.vdocuments.us/reader038/viewer/2022110306/554e7d1db4c905f66a8b5271/html5/thumbnails/39.jpg)
Validated Name-Structure dictionaries for data checking
• Chemical name dictionaries used for:• Text-mining (publications, patents)
• Linking to other databases – think Biology• Drug names are incredibly valuable links
• Searching the web• Names link to structures
![Page 40: Big data challenges associated with building a national data repository for chemistry](https://reader038.vdocuments.us/reader038/viewer/2022110306/554e7d1db4c905f66a8b5271/html5/thumbnails/40.jpg)
Difficult to navigate…
What’s the structure?What’s the structure?
Are they in our file?
Are they in our file?
What’s similar?What’s
similar?
What’s the target?
What’s the target?Pharmacology
data?Pharmacology
data?
Known Pathways?
Known Pathways?
Working On Now?
Working On Now?Connections
to disease?Connections to disease?
Expressed in right cell type?Expressed in
right cell type?
Competitors?Competitors?
IP?IP?
![Page 41: Big data challenges associated with building a national data repository for chemistry](https://reader038.vdocuments.us/reader038/viewer/2022110306/554e7d1db4c905f66a8b5271/html5/thumbnails/41.jpg)
Inside our Publication Archive
• How much data is in the archive, in the publications and in the supplementary info?• How many compounds for ChemSpider?• How many syntheses for ChemSpider
reactions?• How many characterization measurements?
• Property Data• Spectral Data• Graphs and charts to be used for modeling?
![Page 42: Big data challenges associated with building a national data repository for chemistry](https://reader038.vdocuments.us/reader038/viewer/2022110306/554e7d1db4c905f66a8b5271/html5/thumbnails/42.jpg)
What if we could capture it all?Digitally Enhancing the RSC Archive
![Page 43: Big data challenges associated with building a national data repository for chemistry](https://reader038.vdocuments.us/reader038/viewer/2022110306/554e7d1db4c905f66a8b5271/html5/thumbnails/43.jpg)
Linking Names to Structures
![Page 44: Big data challenges associated with building a national data repository for chemistry](https://reader038.vdocuments.us/reader038/viewer/2022110306/554e7d1db4c905f66a8b5271/html5/thumbnails/44.jpg)
Semantic Mark-up of Articles
![Page 45: Big data challenges associated with building a national data repository for chemistry](https://reader038.vdocuments.us/reader038/viewer/2022110306/554e7d1db4c905f66a8b5271/html5/thumbnails/45.jpg)
Hosting Reactions• Seed set of over 1 million reactions from patents to
develop validation and standardization routines.• Reactions to be extracted from RSC journal articles,
ESI and reaction databases will be examined• Resulting validation algorithms used at deposition
![Page 46: Big data challenges associated with building a national data repository for chemistry](https://reader038.vdocuments.us/reader038/viewer/2022110306/554e7d1db4c905f66a8b5271/html5/thumbnails/46.jpg)
The challenges of analytical data
• Integration of ChemSpider to analytical instrumentation vendors already in place • Agilent, Bruker, Thermo, Waters
• Vendors produce complex proprietary data formats and standard formats are required (JCAMP, NetCDF, AniML)• ChemSpider already hosts thousands of JCAMP spectra
• Support of “assigned spectra” in place
• Data validation approaches understood
• There are a myriad of analytical data types…
![Page 47: Big data challenges associated with building a national data repository for chemistry](https://reader038.vdocuments.us/reader038/viewer/2022110306/554e7d1db4c905f66a8b5271/html5/thumbnails/47.jpg)
Turning “Figures” Into Data
![Page 48: Big data challenges associated with building a national data repository for chemistry](https://reader038.vdocuments.us/reader038/viewer/2022110306/554e7d1db4c905f66a8b5271/html5/thumbnails/48.jpg)
Community Data Repository
• Automated depositions of data – service-based deposition, sweep and deposit
• Integrate to Electronic Lab Notebooks as feeds
• High value would be databases of reference data, but validated by model validation and the community
• National services feeding the repository – crystallography, mass spectrometry
![Page 49: Big data challenges associated with building a national data repository for chemistry](https://reader038.vdocuments.us/reader038/viewer/2022110306/554e7d1db4c905f66a8b5271/html5/thumbnails/49.jpg)
E-Lab Notebooks
• Integration between ELNs and:• ChemSpider• ChemSpider Reactions• Chemistry Data Repository
![Page 50: Big data challenges associated with building a national data repository for chemistry](https://reader038.vdocuments.us/reader038/viewer/2022110306/554e7d1db4c905f66a8b5271/html5/thumbnails/50.jpg)
What do we have in place?• We are testing a data repository on our assets –
ChemSpider and our archive of publications• Working with many collaborators to define needs• Deposition system for deposition of chemical
compounds – hosts >29 million chemicals• Crowdsourcing curation & annotation platform • Chemical validation & standardization platform• Chemical reactions database with >1 million
reactions and presently developing RVSP• Analytical data handling formats (JCAMP
preferred)• And lots in development…
![Page 51: Big data challenges associated with building a national data repository for chemistry](https://reader038.vdocuments.us/reader038/viewer/2022110306/554e7d1db4c905f66a8b5271/html5/thumbnails/51.jpg)
The Challenges Ahead
• Chemistry is NOT just nicely defined structures!• Materials, minerals, attached to beads,
polymers, ambiguous materials
• Domain-specific measurements• File format standards are limited in application
• Encouraging scientists to free up their data• AltMetrics, open data mandates, systems
• The data explosion continues• 4 years ahead to expand capability
![Page 52: Big data challenges associated with building a national data repository for chemistry](https://reader038.vdocuments.us/reader038/viewer/2022110306/554e7d1db4c905f66a8b5271/html5/thumbnails/52.jpg)
Internet Data
The Future
Commercial SoftwarePre-competitive Data
Open ScienceOpen DataPublishersEducators
Open DatabasesChemical Vendors
Small organic moleculesUndefined materialsOrganometallicsNanomaterialsPolymersMineralsParticle boundLinks to Biologicals
![Page 53: Big data challenges associated with building a national data repository for chemistry](https://reader038.vdocuments.us/reader038/viewer/2022110306/554e7d1db4c905f66a8b5271/html5/thumbnails/53.jpg)
RSC Open Access Repository
•Imagine applying text-mining to all articles•Extract all chemicals, syntheses, chemistry data and link to OA articles•Provide additional data handling tools
![Page 54: Big data challenges associated with building a national data repository for chemistry](https://reader038.vdocuments.us/reader038/viewer/2022110306/554e7d1db4c905f66a8b5271/html5/thumbnails/54.jpg)
Thank you
Email: [email protected]: 0000-0002-2668-4821 Twitter: @ChemConnectorPersonal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams