digitally enabling the rsc archive
TRANSCRIPT
![Page 1: Digitally enabling the RSC archive](https://reader033.vdocuments.us/reader033/viewer/2022060204/559f75b61a28abeb718b4826/html5/thumbnails/1.jpg)
Data Enhancing the RSC Archive
Colin Batchelor, Ken Karapetyan, Alexey Pshenichov, Dave Sharpe, Jon Steele, Valery
Tkachenko and Antony Williams
ACS New Orleans April 2013
![Page 2: Digitally enabling the RSC archive](https://reader033.vdocuments.us/reader033/viewer/2022060204/559f75b61a28abeb718b4826/html5/thumbnails/2.jpg)
Overview
• The big picture
• Where we’ve been
• Statistics as well as semantics
• New directions in experimental data
• Where we’re going
![Page 3: Digitally enabling the RSC archive](https://reader033.vdocuments.us/reader033/viewer/2022060204/559f75b61a28abeb718b4826/html5/thumbnails/3.jpg)
The big picture
We have journal articles going back to 1841 and the aim is to extract:
• Every small molecule we can (graphics and text)
• Reactions
• Spectra
• Data in tables
and classify every paper in a way that makes sense to the reader.
![Page 4: Digitally enabling the RSC archive](https://reader033.vdocuments.us/reader033/viewer/2022060204/559f75b61a28abeb718b4826/html5/thumbnails/4.jpg)
Background
• RSC Publishing moved to an all-XML workflow at the turn of the millennium.
• We digitized the backfile (to 1841) in 2005.
• We launched Project Prospect in 2007.
• We acquired ChemSpider in 2009.
![Page 5: Digitally enabling the RSC archive](https://reader033.vdocuments.us/reader033/viewer/2022060204/559f75b61a28abeb718b4826/html5/thumbnails/5.jpg)
RSC Advances
New high-volume journal covering all of chemistry launched in 2011.
Need a sensible way of navigating all this.
http://www.rsc.org/advances
http://www.rsc.org/RSCAdvancesSubjects
![Page 6: Digitally enabling the RSC archive](https://reader033.vdocuments.us/reader033/viewer/2022060204/559f75b61a28abeb718b4826/html5/thumbnails/6.jpg)
Strategy
• Use topic modelling: latent Dirichlet allocation (LDA) and Gibbs sampling to determine a set of “true” topics
Thomas L. Griffiths and Mark Steyvers, “Finding scientific topics”, Proc. Natl. Acad. Sci. USA, 2004, 101, 5228–5235.
• Publishing expertise gives us 12 broad subjects that will be intuitive to users
• Merge first set to form second
• Tweak
![Page 7: Digitally enabling the RSC archive](https://reader033.vdocuments.us/reader033/viewer/2022060204/559f75b61a28abeb718b4826/html5/thumbnails/7.jpg)
Classify that classification
Generated 128 topics based on 2009 and 2010’s articles (> 20000 papers).
Generated Wordle images (www.wordle.net) of the topics for internal staff.
![Page 8: Digitally enabling the RSC archive](https://reader033.vdocuments.us/reader033/viewer/2022060204/559f75b61a28abeb718b4826/html5/thumbnails/8.jpg)
![Page 9: Digitally enabling the RSC archive](https://reader033.vdocuments.us/reader033/viewer/2022060204/559f75b61a28abeb718b4826/html5/thumbnails/9.jpg)
Classify that classification: results
7 topics (75, 57, 65, 67, 82, 113, 123) were rejected for being nonsense.
1 topic (127) was rejected for being too general.
120 topics were classified under the 12 headings and given names.
Examples…
![Page 10: Digitally enabling the RSC archive](https://reader033.vdocuments.us/reader033/viewer/2022060204/559f75b61a28abeb718b4826/html5/thumbnails/10.jpg)
Examples
1: “kinetics” → Physical2: “coordination complexes” → Inorganic3: “general materials” → Materials4: “misc. organic” → Organic 5: “bacteria” → Biological + Food and health6: “theoretical” → Physical7: “cells” → Bio8: “water and solution chemistry” → Physical9: “gels” → Materials10: “inorganic material properties” → Physical + Inorganic + Materials11: “general organic” → Organic12: “coordination chemistry” → Inorganic13: “photochemistry” → Inorganic + Materials + Energy
![Page 11: Digitally enabling the RSC archive](https://reader033.vdocuments.us/reader033/viewer/2022060204/559f75b61a28abeb718b4826/html5/thumbnails/11.jpg)
“Very useful!”
“… will make it easier for readers to identify papers which might be interesting to them.”
“Superb!”
![Page 12: Digitally enabling the RSC archive](https://reader033.vdocuments.us/reader033/viewer/2022060204/559f75b61a28abeb718b4826/html5/thumbnails/12.jpg)
What now?
Shortly rolling out the subject classification to other general journals:
• Chemical Communications
• Chemical Science
• Journal of Materials Chemistry A, B and C
• New Journal of Chemistry
![Page 13: Digitally enabling the RSC archive](https://reader033.vdocuments.us/reader033/viewer/2022060204/559f75b61a28abeb718b4826/html5/thumbnails/13.jpg)
Beyond Prospect: further steps in text-mining
Migration to Oscar 4https://bitbucket.org/wwmm/oscar4/wiki/HomeMultiple name to structure engines
OPSIN, ACD/Labs, LexichemACD/Labs DictionaryBetter disambiguationParallelization with HadoopStructure validation and standardization (see later)Reaction extraction from text (see later)
![Page 14: Digitally enabling the RSC archive](https://reader033.vdocuments.us/reader033/viewer/2022060204/559f75b61a28abeb718b4826/html5/thumbnails/14.jpg)
On an experimental run with names from Organic and Biomolecular Chemistry
Is any structure returned at all by a given n2s engine?
Lexichem = a (2798)ACD = b (3049)OPSIN = c (3309)
![Page 15: Digitally enabling the RSC archive](https://reader033.vdocuments.us/reader033/viewer/2022060204/559f75b61a28abeb718b4826/html5/thumbnails/15.jpg)
Structure disagreements
Out of 2588 names where at least one of the engines differed or didn’t return a result:
A = ACD(1538 in total)B = Lexichem(1301 in total)C = OPSIN(2097 in total)
![Page 16: Digitally enabling the RSC archive](https://reader033.vdocuments.us/reader033/viewer/2022060204/559f75b61a28abeb718b4826/html5/thumbnails/16.jpg)
Iterations
With the Hadoop cluster, we can mine thousands of articles a night.
We’re initially iterating over the material back to 2000, for which we have native XML. Then it’s a case of going back and testing out the OCRed material.
![Page 17: Digitally enabling the RSC archive](https://reader033.vdocuments.us/reader033/viewer/2022060204/559f75b61a28abeb718b4826/html5/thumbnails/17.jpg)
http://cv.beta.rsc-us.org/
This is the beta site for
• Extracting chemical structures from ChemDraw files
• Most importantly: structure validation and standardization
We will be using this for all of the extracted structures.
![Page 18: Digitally enabling the RSC archive](https://reader033.vdocuments.us/reader033/viewer/2022060204/559f75b61a28abeb718b4826/html5/thumbnails/18.jpg)
![Page 19: Digitally enabling the RSC archive](https://reader033.vdocuments.us/reader033/viewer/2022060204/559f75b61a28abeb718b4826/html5/thumbnails/19.jpg)
![Page 20: Digitally enabling the RSC archive](https://reader033.vdocuments.us/reader033/viewer/2022060204/559f75b61a28abeb718b4826/html5/thumbnails/20.jpg)
Reaction extraction from text
We have had some preliminary experience of this with Daniel Lowe (NextMove, formerly Cambridge)’s ChemicalTaggerwork.
To go to ChemSpider Reactions:
http://csr.dev.rsc-us.org/
![Page 21: Digitally enabling the RSC archive](https://reader033.vdocuments.us/reader033/viewer/2022060204/559f75b61a28abeb718b4826/html5/thumbnails/21.jpg)
Experimental data
We’ve already seen the possibilities for extracting data from organic experimental sections, but what about other sorts of data?
Given chemical structures and extracted data we may be able to start building models and making them available.
![Page 22: Digitally enabling the RSC archive](https://reader033.vdocuments.us/reader033/viewer/2022060204/559f75b61a28abeb718b4826/html5/thumbnails/22.jpg)
New directions in experimental data (1)
We are working with William Brouwer (Penn State) to extract data from graphs.
Obviously this is faute de mieux and we’d rather have the original data, but we’re giving a flavour of what might be possible.
![Page 23: Digitally enabling the RSC archive](https://reader033.vdocuments.us/reader033/viewer/2022060204/559f75b61a28abeb718b4826/html5/thumbnails/23.jpg)
Recent Work
![Page 24: Digitally enabling the RSC archive](https://reader033.vdocuments.us/reader033/viewer/2022060204/559f75b61a28abeb718b4826/html5/thumbnails/24.jpg)
Digitized Spectrum
![Page 25: Digitally enabling the RSC archive](https://reader033.vdocuments.us/reader033/viewer/2022060204/559f75b61a28abeb718b4826/html5/thumbnails/25.jpg)
Comparison of Spectra
![Page 26: Digitally enabling the RSC archive](https://reader033.vdocuments.us/reader033/viewer/2022060204/559f75b61a28abeb718b4826/html5/thumbnails/26.jpg)
And now on ChemSpider…
![Page 27: Digitally enabling the RSC archive](https://reader033.vdocuments.us/reader033/viewer/2022060204/559f75b61a28abeb718b4826/html5/thumbnails/27.jpg)
![Page 28: Digitally enabling the RSC archive](https://reader033.vdocuments.us/reader033/viewer/2022060204/559f75b61a28abeb718b4826/html5/thumbnails/28.jpg)
New directions in experimental data (2)
Dye solar cell data is every bit as systematic as organic experimental sections.
![Page 29: Digitally enabling the RSC archive](https://reader033.vdocuments.us/reader033/viewer/2022060204/559f75b61a28abeb718b4826/html5/thumbnails/29.jpg)
Human curation of results
Previously: built into partly-manual annotation workflow.
Currently: macro-scale, iterative.
Coming: Challenger
![Page 30: Digitally enabling the RSC archive](https://reader033.vdocuments.us/reader033/viewer/2022060204/559f75b61a28abeb718b4826/html5/thumbnails/30.jpg)
DERA
• DERA will unveil from our archive
– Chemicals
– Reactions
– Figures
– Spectra/Analytical Data
– Property Data
– And yes….it will need curation and filtering!