the rsc chemical validation and standardization platform, a potential path to quality-conscious...
Post on 11-May-2015
513 Views
Preview:
DESCRIPTION
TRANSCRIPT
Chemistry Validation and Standardization Platform
Modularization and “Hadoop”ization
Kenneth Karapetyan, Colin Batchelor, Valery Tkachenko, Antony Williams
ACS New Orleans April 2013
Overview
• Motivation• What we support• Modularization• Parallelization• Examples
Motivation: validation
Open and free chemical validation system for:
•Structure validation– Warn on query atoms, pseudo atoms, polymers,
etc.– Nonsensical stereo
•SDF field mapping for validating depositor-provided names, InChI, SMILES
Motivation: standardization
Allows users to use CVSP default standardization workflow (or FDA, Open PHACTS and so on)Allows users to put together their own workflow using modules provided:•Apply default CVSP or user-defined SMIRKS rules•Layout•Neutralize•Get canonical tautomer using ChemAxon’s algorithms•Get biggest organic fragment
What we support
• SD files and mol files• ChemDraw files (in-house code)• Tab-delimited text files of names, InChIs,
SMILES
• Zipped files• GZipped files
CVSP: modularization
Reusable workflows
SMIRKS-based rules
“Hadoop”izationApache Hadoop is a framework for the distributed processing of large data sets across clusters of computers.
CVSP is written in C#. To run it on Linux machines we use Mono (cross-platform .NET runtime environment)
Farm:•28 CPU cores•42G memory•2T disk space
Processor intensive tasks•Tautomerization
Input file Deposit ID in database
Upload to farm for processing on HadoopHadoop processing
Download resultsUpload results to database for user
preview
Convert to SD format
Hadoop queuesThree Hadoop queues are used (capacity queue) to prioritize big/large CVSP submissions
•“Small” submission queue for submissions under 500 records•Large submissions queue•Internal queue
– For internal projects, e.g. tautomer analysis of ChemSpider or ChemSpider standardization
All records have to be processed on Hadoop to user to see the results (no partial preview)
Examples
DrugBank •~6500 records, approximately 2 records per secondPubMed•~100 000 records, about 9 h
Rate-limiting step?
Canonical tautomerizationThis molecule took45 min tocanonicalize.
DrugBank dataset (6516 records)Errors•2 records with query(any) bond•2 records with R groups•3 polymers•18 porphyrins with metal coordinated inside with one of the metal-nitrogen bonds stereogenic•Unusual valence: ~20
Warnings•INCHI not matching structure (100+)•SMILES not matching structure (100+)
DrugBank ID: DB00755InChI=1S/C20H28O2/c1-15(8-6-9-16(2)14-19(21)22)11-12-18-17(3)10-7-13-20(18,4)5/h6,8-9,11-12,14H,7,10,13H2,1-5H3,(H,21,22)/b9-6+,12-11+,15-8+,16-14+
DrugBank ID: DB00614
Stereo issues
DB08128 DB06287
J. Brecher, Pure Appl. Chem., 2008, doi:10.1351/pac200880020277
Thank you
E-mail: karapetyank@rsc.org, batchelorc@rsc.org
Please try CVSP at
http://cv.beta.rsc-us.org
top related