sermacs 2012
DESCRIPTION
Chemical Validation and Standardization Platform (CVSP) from Royal Society of ChemistryTRANSCRIPT
![Page 1: SERMACS 2012](https://reader035.vdocuments.us/reader035/viewer/2022062307/55508341b4c9051e5b8b47e4/html5/thumbnails/1.jpg)
SERMACS 2012
Chemical Validation and Standardization Platform (CVSP)
Kenneth Karapetyan, Colin Batchelor, Valery Tkachenko, Antony Williams
![Page 2: SERMACS 2012](https://reader035.vdocuments.us/reader035/viewer/2022062307/55508341b4c9051e5b8b47e4/html5/thumbnails/2.jpg)
A quality chemical record
Correct file format (SDF, MOL, CDX, etc.)
“Valid” chemical structure Valid atoms (not query atoms) Valid bonds Valid valences Valid charges Stereo (defined/unknown/undefined)
Synonyms Names (name to structure) SMILES, InChIs (SMILES/InChI to structure)
XRefs
![Page 3: SERMACS 2012](https://reader035.vdocuments.us/reader035/viewer/2022062307/55508341b4c9051e5b8b47e4/html5/thumbnails/3.jpg)
“Quality-conscious” databases
Not just for storage but shares responsibility for the data Understands that bad data will eventually propagate to
others Uses deposition system with strong validation rules
Warns users when conflicts are detected and when records are ambiguous
Stops user from depositing when critical chemical errors are suspected
Help depositors to “self-curate” their own data
![Page 4: SERMACS 2012](https://reader035.vdocuments.us/reader035/viewer/2022062307/55508341b4c9051e5b8b47e4/html5/thumbnails/4.jpg)
A Vision of a Validation System
Validation rules are transparent to community
Validation rules can be revised by users for their particular purpose
Free to the community
Accessible in batch mode by Web GUI and Web service
Could be integrated to any chemical deposition system
Rules are based on SMARTS/SMILES for finding offending patterns in molecules. Non-standard INCHI is used for determining unknown/undefined stereo descriptors
![Page 5: SERMACS 2012](https://reader035.vdocuments.us/reader035/viewer/2022062307/55508341b4c9051e5b8b47e4/html5/thumbnails/5.jpg)
Standardization
Who is concerned with standardization? Database maintainers Chemists that prepare data for property calculations Scientists investigating structure-activity relationships
They all can have different business rules in mind
CVSP is an open and flexible platform for different flavors of standardization
Rules are based on SMIRKS transformations
![Page 6: SERMACS 2012](https://reader035.vdocuments.us/reader035/viewer/2022062307/55508341b4c9051e5b8b47e4/html5/thumbnails/6.jpg)
CVSP: select rule set
![Page 7: SERMACS 2012](https://reader035.vdocuments.us/reader035/viewer/2022062307/55508341b4c9051e5b8b47e4/html5/thumbnails/7.jpg)
CVSP: Acid-Base competitive ionization
![Page 8: SERMACS 2012](https://reader035.vdocuments.us/reader035/viewer/2022062307/55508341b4c9051e5b8b47e4/html5/thumbnails/8.jpg)
CVSP: define rule variables
![Page 9: SERMACS 2012](https://reader035.vdocuments.us/reader035/viewer/2022062307/55508341b4c9051e5b8b47e4/html5/thumbnails/9.jpg)
CVSP: result of processing
![Page 10: SERMACS 2012](https://reader035.vdocuments.us/reader035/viewer/2022062307/55508341b4c9051e5b8b47e4/html5/thumbnails/10.jpg)
![Page 11: SERMACS 2012](https://reader035.vdocuments.us/reader035/viewer/2022062307/55508341b4c9051e5b8b47e4/html5/thumbnails/11.jpg)
DrugBank dataset (6516 records)
Marked as Errors (arbitrary)
2 records with query bonds 3 records with invalid atoms (asterisk in polymers) Unusual valence: ~70 (oxygen 3, sulfur 3 and 5, Mg 4, B 5, etc.)
Warnings INCHI not matching structure (100+) SMILES not matching structure (100+)
![Page 12: SERMACS 2012](https://reader035.vdocuments.us/reader035/viewer/2022062307/55508341b4c9051e5b8b47e4/html5/thumbnails/12.jpg)
DrugBank ID: DB00755 InChI=1S/C20H28O2/c1-15(8-6-9-16(2)14-19(21)22)11-12-18-
17(3)10-7-13-20(18,4)5/h6,8-9,11-12,14H,7,10,13H2,1-5H3,(H,21,22)/b9-6+,12-11+,15-8+,16-14+
DrugBank ID: DB00614
![Page 13: SERMACS 2012](https://reader035.vdocuments.us/reader035/viewer/2022062307/55508341b4c9051e5b8b47e4/html5/thumbnails/13.jpg)
Lessons
Don’t bury a depositor with tons of validation messages. Rank them by chemical severity
Bring the depositors attention to records with the highest warning count Records that have inconsistent InChI and
SMILES and NAMES certainly have issues
Only bring the most ambiguous records to a depositors’ attention – let them “self-curate”. Raise records that are suspected to have critical chemical errors to the top
![Page 14: SERMACS 2012](https://reader035.vdocuments.us/reader035/viewer/2022062307/55508341b4c9051e5b8b47e4/html5/thumbnails/14.jpg)
Where will we use CVSP? For external projects we are involved with – for example,
Open PHACTS where we are responsible for hosting chemistry data mapped to biology data
Made available to RSC authors to check their chemical compounds prior to submitting articles
To be integrated to the ChemSpider deposition system to validate and standardize new data
For validating and standardizing existing data in ChemSpider and other RSC compound databases
![Page 15: SERMACS 2012](https://reader035.vdocuments.us/reader035/viewer/2022062307/55508341b4c9051e5b8b47e4/html5/thumbnails/15.jpg)
Thanks
We would appreciate some feedback on CVSP.Please email us to request access.
Also if you would like us to run some datasets through CVSP and give you the results back please email us.
Email: [email protected]