approaches for extraction and digital chromatography of chemical data
TRANSCRIPT
![Page 1: Approaches for extraction and digital chromatography of chemical data](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a3ecea1a28abd7378b481e/html5/thumbnails/1.jpg)
Approaches for extraction and “digital chromatography” of chemical data:A perspective from the RSC
![Page 2: Approaches for extraction and digital chromatography of chemical data](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a3ecea1a28abd7378b481e/html5/thumbnails/2.jpg)
Overview• Introduction
– What data can we consider?
– What are the challenges
– What data and sources does the RSC have?
– Experimental Data Checker
• Case Studies:
– Project Prospect
– Chair forms of Sugars/cyclohexanes
![Page 3: Approaches for extraction and digital chromatography of chemical data](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a3ecea1a28abd7378b481e/html5/thumbnails/3.jpg)
Traditional Chromatography
Images taken from:http://www.sciencemadness.org/talk/viewthread.php?tid=3960&page=3
http://en.wikipedia.org/wiki/Column_chromatography
![Page 4: Approaches for extraction and digital chromatography of chemical data](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a3ecea1a28abd7378b481e/html5/thumbnails/4.jpg)
Why Digital Chromatography?
• Useable information is mixed in with description and analysis – Makes it difficult to find
• Despite our best efforts – still lots of ambiguous or plain wrong/unusable chemical information
• Why?– Human error– Processing errors– Incorrect usage of data generation/extraction– Style over meaning– Data not generated with reuse in mind– Data generated for humans
![Page 5: Approaches for extraction and digital chromatography of chemical data](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a3ecea1a28abd7378b481e/html5/thumbnails/5.jpg)
Style/Layout Vs Meaning
• Structures drawn to illustrate more than just the identity
• Data not generated with reuse in mind
• Author practices
• Mixed 2D and perspective representations
• Unintentional definition of stereochemistry
![Page 6: Approaches for extraction and digital chromatography of chemical data](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a3ecea1a28abd7378b481e/html5/thumbnails/6.jpg)
Data generated for humans
• Separated/Orphaned information inc. Markush structures, information passed by reference
![Page 7: Approaches for extraction and digital chromatography of chemical data](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a3ecea1a28abd7378b481e/html5/thumbnails/7.jpg)
![Page 8: Approaches for extraction and digital chromatography of chemical data](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a3ecea1a28abd7378b481e/html5/thumbnails/8.jpg)
What chemical data can we consider?
• Chemistry is an especially challenging - wide range of types of data– Numeric data– Names– Structures– Terminology
• Over a hugely different set of topics: Org, Inorg, Physical – Meanings/interpretations are not perfectly aligned
• Application of standards can be challenging• Drawing conventions – are documented but not used
![Page 9: Approaches for extraction and digital chromatography of chemical data](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a3ecea1a28abd7378b481e/html5/thumbnails/9.jpg)
What chemical data and sources does the RSC have?
![Page 10: Approaches for extraction and digital chromatography of chemical data](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a3ecea1a28abd7378b481e/html5/thumbnails/10.jpg)
A beginning: helping chemists review their own work
Amphidinoketide I
To a solution of…….…. Amphidinoketide I was isolated as a …….. [α]D
25 −17.6 (c 0.085, CH2Cl2); Rf = 0.61 (1:1 hexane:ethyl acetate); νmax(CHCl3)/cm−1 1707.2 (CO), 1686.9 (CO), 1632.4 (CO), 1618.9 (CC), 1458.1; 1H NMR (CD2Cl2, 500 MHz) δH 6.08 (1H, t, J = 1.3 Hz, 3-CHC), 5.82 (1H, ddt, J = 16.9, 10.2, 6.7 Hz, 19-CHCH2), 4.99 (1H, m (17.1 Hz), 20-CHA), 4.92 (1H, m (10.2 Hz), 20-CHB), 3.05 (1H, dd, J = 17.9, 9.3 Hz, 8-CHA), 3.00–2.90 (3H, m, 9-CHCH3, 11-CHA, 12-CHCH3), 2.72–2.64 (2H, m, 5-CHA, 6-CHA), 2.62–2.55 (2H, m, 5-CHB, 6-CHB), 2.51–2.45 (3H, m, 8-CHB, 11-CHB, 14-CHA), 2.33 (1H, dd, J = 16.9, 7.4 Hz, 14-CHB), 2.09 (3H, s, 21-CH3), 2.05–1.99 (2H, m, 18-CH2), 1.99–1.96 (1H, m, 15-CHCH3), 1.88 (3H, s, 1-CH3), 1.39–1.25 (3H, 17-CH2, 16-CHA), 1.14–1.10 (1H, m, 16-CHB), 1.07 (3H, d, J = 7.0 Hz, 22-CH3), 1.05 (3H, d, J = 7.2 Hz, 23-CH3), 0.87 (3H, d, J = 6.7 Hz, 24-CH3); 13C NMR (CD2Cl2, 125 MHz) δC 213.15 (13-CO), 212.08 (10-CO), 208.40 (7-CO), 198.76 (4-CO), 155.40 (2-CCH), 138.41 (19-CHCH2), 123.54 (3-CHC), 114.19 (20-CH2C), 48.81 (14-CH2), 45.93 (11-CH2), 44.50 (8-CH2), 41.43 (9-CHCH3), 41.01 (12-CHCH3), 37.74 (5-CH2), 36.55 (16-CH2), 36.27 (6-CH2), 34.18 (18-CH2), 28.74 (15-CHCH3), 27.57 (1-CH3), 26.60 (17-CH2), 20.63 (21-CH3), 19.77 (24-CH3), 16.65 (22 or 23-CH3), 16.62 (22 or 23-CH3); HRMS (ESI) Calculated for C24H38O4 413.2668, found 413.26600 (MNa+). (9R, 12R, 15S)-1 had [α]D
25 +11 (c 0.245, CH2Cl2).
• http://www.rsc.org/is/journals/checker/run.htm
![Page 11: Approaches for extraction and digital chromatography of chemical data](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a3ecea1a28abd7378b481e/html5/thumbnails/11.jpg)
Case study 1: Project Prospect
![Page 12: Approaches for extraction and digital chromatography of chemical data](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a3ecea1a28abd7378b481e/html5/thumbnails/12.jpg)
12
What is Prospect?
OSCAR
Enhanced RSC XML
InChI–Name pairs(from ChemSpider)
Ontologies RSC XML
Tool layer
Input layer
Information layer
Output layer RSSEnhanced
HTMLProspectdatabase
InChI–name pairs(in ChemSpider)
AuthorCDX files
Betterontologies
Visible output
![Page 13: Approaches for extraction and digital chromatography of chemical data](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a3ecea1a28abd7378b481e/html5/thumbnails/13.jpg)
People and machines
People
Can understand narratives.
Can interpret pictures.
Can reason about three-dimensional objects.
Can do a high-quality job.
Machines
Can’t understand narratives.
Can’t interpret pictures.
Not able to infer 3D structure from 2D without cues.
Can do a lower-quality, but still useful job.
![Page 14: Approaches for extraction and digital chromatography of chemical data](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a3ecea1a28abd7378b481e/html5/thumbnails/14.jpg)
Case study 2: The chair representation issue
InChI=1S/C6H12O6/c7-1-2-3(8)4(9)5(10)6(11)12-2/h2-11H,1H2
WQZGKKKJIJFFOK-UHFFFAOYSA-N
• 5 stereocentres = 2^5 isomers =32 structures
![Page 15: Approaches for extraction and digital chromatography of chemical data](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a3ecea1a28abd7378b481e/html5/thumbnails/15.jpg)
Case study 2: Chair forms of hexacycles what could go wrong?
![Page 16: Approaches for extraction and digital chromatography of chemical data](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a3ecea1a28abd7378b481e/html5/thumbnails/16.jpg)
How we normalize them:1. Identify 6-membered rings (Indigo)2. Identify what sort of ring it is3. Map atoms onto a standard structure (eg.
beta-D-glucopyranose)4. Tidy
How do we “fix” chair-representations
![Page 17: Approaches for extraction and digital chromatography of chemical data](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a3ecea1a28abd7378b481e/html5/thumbnails/17.jpg)
The future: “The digester”
• Ability to:
– Reconnect R-groups
– Expand abbreviations
– Expand brackets
– Link structures with reference IDs
![Page 18: Approaches for extraction and digital chromatography of chemical data](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a3ecea1a28abd7378b481e/html5/thumbnails/18.jpg)
Other examples that we didn’t mention in case studies
• CIF data importer
• Structure Validation and Standardisation
– (Thurs Aug 23, 9:15 am, Marriott Downtown, Franklin Hall 6)
• Work on creation of ontologies, RXNO, CMO
– Also collaborating on: ChEBI ontology, GO, SO
• Collaboration with Utopia to enable Prospect mark-up of PDFs
![Page 19: Approaches for extraction and digital chromatography of chemical data](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a3ecea1a28abd7378b481e/html5/thumbnails/19.jpg)
Summary
• Many data sharing practices are based on:
– Traditional print articles
– Consumption of data by humans only
• This poses issues for publishers and users alike
• The RSC is developing innovative solutions to address some of these problems
– Chemical structures are challenging
– Limitations to what a machine methods can achieve
– Need to educate authors to think differently
![Page 20: Approaches for extraction and digital chromatography of chemical data](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a3ecea1a28abd7378b481e/html5/thumbnails/20.jpg)
Acknowledgements
• Colin Batchelor - Development and Technical work
• Jeff White & Aileen Day
• Richard Kidd, Graham McCann and Will Russell
• RSC ICT staff