building a virtual ligand screening pipeline using free

Building a virtual ligand screening pipeline

using free software a surveyEnrico GlaabCorresponding author Enrico Glaab Luxembourg Centre for Systems Biomedicine (LCSB) University of Luxembourg 7 avenue des Hauts FourneauxL-4362 Esch-sur-Alzette Luxembourg Tel +352 4666 446186 Fax +352 4666 446949 E-mail enricoglaabunilu

Abstract

Virtual screening the search for bioactive compounds via computational methods provides a wide range of opportunitiesto speed up drug development and reduce the associated risks and costs While virtual screening is already a standard prac-tice in pharmaceutical companies its applications in preclinical academic research still remain under-exploited in spite ofan increasing availability of dedicated free databases and software tools In this survey an overview of recent developmentsin this field is presented focusing on free software and data repositories for screening as alternatives to their commercialcounterparts and outlining how available resources can be interlinked into a comprehensive virtual screening pipelineusing typical academic computing facilities Finally to facilitate the set-up of corresponding pipelines a downloadable soft-ware system is provided using platform virtualization to integrate pre-installed screening tools and scripts for reproducibleapplication across different operating systems

Key words virtual screening docking proteinndashligand binding ADMETox off-target effects workflow management

Introduction

In the pharmaceutical industry computational techniques toscreen for bioactive molecules have become an established com-plement to classical experimental high-throughput screeningmethods Previous success stories have shown that using virtualscreening approaches can help to reduce the required time andcosts for drug development projects and mitigate the risk for late-stage failures (eg in silico techniques were instrumental in thedevelopment of the HIV integrase inhibitor Raltegravir [1] theanticoagulant Tirofiban [2] and the influenza drug compoundZanamivir [3]) In recent years the combination of increasingcomputing power improved algorithms and a wider availabilityof relevant software tools and data repositories has made preclin-ical drug research using virtual screening more feasible for aca-demic laboratories However setting up an efficient and effectivescreening pipeline is still a major challenge and a greater aware-ness about freely available screening quality control and work-flow management software published in recent years would helpto more fully exploit the potential of in silico screening

This review discusses the recent progress in screening based onreceptors and ligands with a focus on free software tools and data-bases as alternatives to commercial resources New developmentsin the field (eg covalent docking novel machine learningapproaches for binding affinity prediction and automated work-flow management software) are covered in combination with prac-tical advice on how to build a typical screening pipeline andcontrol quality and reproducibility As a generic guideline forscreening projects with an already chosen protein drug target ofinterest (see [4] for an overview of target identification approachesnot covered here) a comprehensive framework and pipeline forvirtual small-molecule screening is described providing examplesof free software tools for each step in the process To facilitate theset-up of a corresponding screening pipeline and integrate pre-in-stalled public tools within a unified software framework a down-loadable cross-platform software for reproducible virtual screeningusing the Docker system is provided (see section on lsquoGenericscreening framework and workflow managementrsquo below and thewebsite httpsregistryhubdockercomuvscreeningscreening)

Enrico Glaab is a research associate in bioinformatics at the Luxembourg Centre for Systems Biomedicine He has set up and manages the Institutersquos vir-tual screening pipeline and works on drug target prioritization using omics data screening based on receptors and ligands and biological applications ofmachine learningSubmitted 5 March 2015 Received (in revised form) 20 May 2015

VC The Author 2015 Published by Oxford University Press For Permissions please email journalspermissionsoupcomThis is an Open Access article distributed under the terms of the Creative Commons Attribution License (httpcreativecommonsorglicensesby40)which permits unrestricted reuse distribution and reproduction in any medium provided the original work is properly cited

1

Briefings in Bioinformatics 2015 1ndash15

doi 101093bibbbv037Paper

Briefings in Bioinformatics Advance Access published June 20 2015 at U

niversity of Luxem

bourg on June 22 2015httpbiboxfordjournalsorg

Dow

nloaded from

Data collectionmolecular structure andinteraction databasesProtein structure databases

The availability of 3D structure data for a target protein of inter-est is a major benefit for virtual screening studies althoughpurely ligand-based screening methods may provide an alterna-tive if no suitable target structure can be obtained (see sectionon ligand-based screening below) An overview of the main pub-lic repositories for experimentally derived and in silico modelledprotein structures is given in Table 1 Among these the ProteinData Bank (PDB) [5] is the standard international archive forexperimental structural data of biological macromolecules cov-ering 107 000 structures as of March 2015 It provides access tothe most comprehensive collection of public X-ray crystal struc-tures and is the default resource to obtain protein structures forreceptor-based screening In spite of the rapid growth of thePDB almost doubling in size over the past six years many pro-tein families are still not covered by a representative structureand even in an ideal model scenario the coverage is notexpected to reach 80 before 2020 and 90 before 2027 [6] Asthe structures in the PDB are biased towards proteins that canbe purified and studied using X-ray crystallography nuclearmagnetic resonance (NMR) spectroscopy and electron micros-copy certain types of proteins including pharmacologically im-portant membrane proteins are underrepresented in thedatabase Importantly the quality of PDB structures is alsorestricted by limitations of the experimental methodologies eghydrogen atoms and flexible components cannot be resolvedvia X-ray diffraction and NMR techniques usually provide lowerresolutions than X-ray crystallography Often the experimentalmethods fail to determine the entire protein structure andmany PDB files have missing residues or atoms (see section onprotein structure pre-processing and quality control for guide-lines on how to deal with these and other potential shortcom-ings of PDB files)

If no suitable experimental structure for molecular dockingsimulations can be identified for a chosen target protein a bind-ing site structural model may alternatively be derived fromcomparative modelling if a template protein with close hom-ology to the target is available While the performance of dock-ing simulations using homology models will depend on thesequence similarity of the template(s) to the target protein thequality of the template structure(s) and the modelling approachthe analyses from a previous large-scale validation study byOshiro et al can provide a guideline on the results to be

expected in different scenarios [7] The authors assessed theperformance of docking into homology models using CDK2 andfactor VIIa screening data sets and found that when thesequence identity between the model and template near thebinding site is greater than 50 roughly 5 times more activecompounds are identified than by random chance (a perform-ance that was comparable with docking into crystal structuresaccording to their observations) Their publication providesa plot of the enrichment of true-positive discoveries versus thepercentage sequence identity between the template and targetwhich can serve as an orientation for future studies Large-scalecollections of existing protein structure models includingModBase [8] SWISS-MODEL [9] and PMP [10] are listed in Table1 as resources for proteins not covered by known experimentalstructures Alternatively new comparative models for spe-cific target proteins can be generated using dedicated homologymodelling tools reviewed in detail elsewhere [11] To preventspurious results due to low-quality models users can estimatethe accuracy of docking simulations based on homologymodels a priori via established indices for model quality assess-ment [12]

Small-molecule databases

Screening projects to identify new selective and potent inhibi-tors of a chosen target protein typically use large-scale com-pound libraries containing several thousands or millions ofsmall molecules to start the filtering process Depending on thegoal and type of the study (eg drug development toxin identifi-cation pesticide development) the compound library may con-tain already known drug substances for repositioning syntheticsubstances similar to lead or drug compounds for subsequentstructural optimization or other natural or xenobiotic com-pounds To design suitable compound libraries in terms of thetype number and commercial availability of the included mol-ecules access to large structured and well-annotated reposito-ries of small molecules is needed Some of the mostcomprehensive free databases include ZINC [14] (35 millioncompounds) PubChem [15] (64 million compounds) andChemSpider [16] While many of the largest databases(eg ChemNavigator [17] with 60 million compounds) are com-mercial and only provide restricted data access for academicresearch in recent years public initiatives and vendors ofsmall-molecule compounds have made several structured libra-ries publicly accessible When downloading structure files fromthese repositories users should note that they are usually not

Table 1 The main public repositories for experimentally derived and in silico modelled protein structures including details on content typeapproximate number of current entries and accessibility

Database Content type Approx no of entries Webpage

PDB [5] X-ray Solution NMR ElectronMicroscopy among others

107k (95k X-ray 11k SolutionNMR 700 Electron Microscopy300 others)

httpwwwrcsborg

ModBase [8] 3D protein models from compara-tive modelling

38 million models httpsalilaborgmodweb

SWISS-MODELRepository [9]

3D protein models from homologymodelling

32 million models httpswissmodelexpasyorgrepository

Protein Model Portal(PMP) [10]

Integration of modelled structuresfrom multiple servers

218 million models httpwwwproteinmodelportalorg

Structural BiologyKnowledgebase(SBKB) [13]

PDB structures and associatedhomology models

See PDB and PMP database httpsbkborg

2 | Glaab

at University of L

uxembourg on June 22 2015

httpbiboxfordjournalsorgD

ownloaded from

designed for virtual screening purposes and multiple pre-processing and format conversions are required An exceptionis the ZINC database a dedicated data repository for virtualscreening [14] providing unrestricted access to already pre-processed and filtered structures However even when using acollection of already pre-processed ligands it is often recom-mendable to test alternative pre-processing methods dependingon the following analysis pipeline (see section on ligand pre-processing below)

Proteinndashligand interaction and binding affinitydatabases

For most proteins only few or no small-molecule binders withhigh affinity (in the nanomolar or low micromolar range) andselectivity are already known from previous studies Moreoverthe reported affinities often vary significantly depending on theused measurement technique [18] Proteins with multipleknown and well-characterized binders for the same bindingpocket however cover several targets of biomedical interestand the existing data can provide opportunities for identifyingnew structurally similar molecules with improved selectivityand affinity via ligand-based screening (see dedicated sectionbelow) Moreover existing interaction and binding affinity dataare a useful resource for identifying or predicting off-targeteffects [19] To collect information on the known proteinndashligandinteractions for a receptor or small molecule of interest Table 2lists the main relevant databases most of which are publiclyaccessible Drug2Gene [29] the currently most comprehensivemeta-database may provide a first point of reference for mosttypes of queries Other repositories have a more specific scopeeg PDBbind [30] focuses exclusively on binding affinity datafrom proteinndashligand complexes in the PDB As the databases inTable 2 are updated at different intervals and contain manynon-overlapping entries a study requiring a comprehensivecoverage of known interactions for a target molecule should col-lect current data from all accessible repositories Importantlyissues in data heterogeneity redundancies and biases in thedatabase curation process can result in biased in silico models ofdrug effects and strategies proposed to address or alleviatethese problems include the use of model-based integrationapproaches (eg KIBA [31]) and sophisticated data curation andfiltering processes (eg the procedure proposed by Kramer et al[32] which includes the calculation of several objective qualitymeasures from differences between reported measurements)

Data pre-processingfiltering and qualitycontrol

Quality checking and pre-processing of molecular structure filesis a critical step in virtual screening projects typically involving

a combination of manual data inspection and automated pro-cessing via programming scripts In the following sections anoverview is provided of the main steps and software tools forquality control and pre-processing of protein receptor andsmall-molecule structures and filtering of the compoundlibrary

Protein structure pre-processing and quality control

A typical procedure for the preparation of protein structures forvirtual screening consists of the following steps (1) select theprotein and chain for docking simulations and determine therelevant binding pocket (2) quality control (check for formaterrors missing atoms or residues and steric clashes) (3) deter-mine missing connectivity information bond orders and partialchargesprotonation states (preferably multiple possible statesshould be considered during docking simulations) (4) addhydrogen atoms (5) optimize hydrogen bonds (6) create disul-phide bonds and bonds to metals (adjust partial chargesif needed) (7) select water molecules to be removed (preferablymultiple selections should be considered during docking simu-lations) (8) fix misoriented groups (eg amide groups of aspara-gine and glutamine the imidazole ring in histidines adjustpartial charges if needed) (9) apply a restrained protein energyminimization (run a minimization while restraining heavyatoms not to deviate significantly from the input structurereceptor flexibility should still be taken into consideration dur-ing the docking stage) and (10) final quality check (repeat thequality control for the pre-processed structure) Sastry et al per-formed a comparative evaluation of different pre-processingsteps and parameters suggesting that each of the commonoptimization steps is relevant in practice and that in particularthe H-bond optimization and protein minimization procedureswhich are sometimes left out in automated pre-processingtools can improve the final enrichment statistics [33]Interestingly their results also indicate that retaining watermolecules for protein preparation and then eliminating thembefore docking was inconsequential as compared with remov-ing water molecules prior to any preparation steps (howeverthey did not consider alternative selections of water moleculesduring the docking stage see discussion below) While Sastryet al focus on commercial pre-processing software for the dock-ing tool GLIDE [34] in the following paragraph alternativemethods and tools for the different pre-processing steps arediscussed

At first the user chooses the protein structure and chain fordocking (or ideally multiple available structures for the targetprotein are used to run docking simulations in parallel) anddetermines the relevant binding site Should the binding sitenot be known from previous crystallized proteinndashligand com-plexes several binding pocket prediction methods are available

Table 2 Overview of proteinndashligand interaction and binding affinity databases with details on the approximate current number of entries andpublic accessibility

Database Approx no of entries Free for academia Webpage

Drug2Gene [29] 44 million yes httpwwwdrug2genecomBindingDB [18] 11 million yes httpwwwbindingdborgSuperTarget [24] 330k yes httpinsilicocharitedesupertargetPDSP Ki Database [25] 55k yes httppdspmeduncedukidbphpBinding MOAD [26] 23k yes httpwwwbindingmoadorgPDBbind [30] 11k yes httpwwwpdbbindorgcnThomson Reuters MetaDrug 700k no httpthomsonreuterscommetadrug

Building a virtual ligand screening pipeline using free software | 3

at University of L



ownloaded from

eg MetaPocket [35] DoGSiteScorer [21] CASTp [36] andSplitPocket [20] (see [28] for a review of related approaches)Next a quality control is necessary as protein crystal structuresin public repositories like the PDB often contain errors or miss-ing residues (see the section on protein structure databases)Only some of the issues can be addressed by automated pre-processing tools and protein structure files should thereforefirst be checked manually PDB files can be opened in a simpletext editor and often contain important remarks on shortcom-ings of the corresponding structure eg a list of missing resi-dues Missing or mislabelled atoms (not conforming to theIUPAC naming conventions [22]) in residues unusual bondlengths and steric clashes can be identified via dedicated qual-ity checking tools eg PROCHECK [23] WHAT_IF [27] Verify3D[37] and PDB-REDO [38] Moreover by visualizing the combin-ations of backbone dihedral angles w and u of residues in a 2Dgraph known as the Ramachandran plot users can identifyunrealistic conformations in comparison with typicallyobserved ranges of wndashu combinations [39] Additional manualinspection of a protein structure in a molecular file viewer egUCSF Chimera [40] PyMOL [41] VMD [42] Yasara [43] Rasmol[44] Swiss PDB Viewer [45] and BALLView [46] should be con-ducted as well because in particular older PDB files often donot conform to the standard format resulting in unpredictableerrors in downstream analyses Molecular visualization toolslike BALLView also allow the user to add missing hydrogensand optimize their positions remove ligands from complexstructures and apply an energy minimization (however insteadof using a static minimized structure the user should preferablyapply docking approaches that account for receptor flexibilitysee section on screening using receptor structures below)Selecting the water molecules to be removed is more difficultas some of them could contribute significantly to proteinndashligandinteractions and this may depend on the specific ligandAlthough this task still remains a challenge dedicatedapproaches are available eg as part of the Relibasethorn softwarethe WaterMap (httpwwwschrodingercomWaterMapphp)and AcquaAlta [47] method Preferably different combinatorialpossibilities to include or exclude water molecules should beexplored during the docking procedure in spite of increasedruntimes Similar considerations apply to the protonationstates of residues in the active site which may vary dependingon the ligand and should ideally be chosen separately for eachdocking pose (eg using the Protonate 3D software [48] or thescoring function in the eHITS docking software [49])Moreover flipped side-chain conformations for His Gln andAsn residues may need to be adjusted to improve the inter-actions with neighbouring groups (eg using the Hthornthorn software[50]) After a final energy minimization the resulting struc-ture should be checked again using quality control tools (seeabove)

If multiple crystal structures are available for the target pro-tein users are advised to select the input for docking simula-tions not only by comparing structures in terms of resolutionbut also domain and side chain completeness presence ofmutations and errors annotated in the structure file (ideallydocking runs will be performed with multiple available struc-tures to compare the results) If on the contrary no experimen-tal or previously modelled structure of sufficient quality isavailable for the target protein potential alternatives may be touse ligand-based screening (see dedicated section below) or tocreate a new homology model (see [51] for a review of corres-ponding software) Even when using in silico modelled struc-tures the pre-processing and quality control tools mentioned

above should still be applied to check the suitability of the inputfor the following analyses

Ligand pre-processing and pre-filtering of thecompound library

Pre-processing of structure files is not only essential for macro-molecular target proteins but also for small-molecule com-pounds Large-scale compound collections are often stored incompact 1D- (eg SMILES) or 2D-formats (eg SDF) so that 3Dco-ordinates first have to be generated and hydrogen atomsadded to the structure Apart from format conversion toolssuch as OpenBabel [52] dedicated ligand pre-processing meth-ods are available to generate customized compound librariesincluding tautomeric ionization and stereochemical variantsand optionally to perform energy minimization (eg the soft-ware packages LigPrep [53] Epik [54] and SPORES [55]) Specificprotonation states and partial charges are typically assignedduring the docking stage because they should be consistent forthe protein and ligand (a wide range of methods for protonationand partial charge assignment are available and have previ-ously been compared in terms of their benefits for binding affin-ity estimation [56])

To avoid prohibitive runtimes for a docking screen againstall compounds in a public database the initial compound col-lection is typically pre-filtered in accordance with the goals andconstraints of the study For example compounds that are toolarge to fit into the targeted binding pocket should be filteredout immediately Moreover compounds can be pre-filtered interms of their lsquodrug-likenessrsquo properties eg using lsquoLipinskirsquosrule of fiversquo or related rule sets [57 58] or in terms of theirstructural and chemical similarity to already known bindingmolecules for the target (see section on ligand-based screening)Ligand similarity calculations may also help to remove highlysimilar structures from a library making it more compact whileretaining a wide coverage of diverse molecules Relevant toolsfor compound library design include Tripos Diverse SolutionAccelrys Discovery Studio Medchem Studio ilib diverse and theopen-source software ChemT [59] Finally fast methods to pre-dict bioavailability and toxicity properties of small molecules(see corresponding section on ADMETox filtering below) mayalso be applied at this stage to filter out compounds withunwanted properties early in the screening process

Compound screening and analysisReceptor-based screening

If an experimentally derived structure or a high-quality hom-ology model is available for a target protein of interestreceptor-based screening approaches can be applied to predictand rank small molecules from a compound library as putativebinders in the proteinrsquos active site For this purpose fastmolecular docking simulations are used to model and evaluatepossible binding poses for each compound After the bindingpocket has been defined and the structure has been pre-processed (see section on protein structure pre-processing) typ-ical docking programs exploit three types of techniques toevaluate large numbers of compounds efficiently

i compact structure representations (to reduce the size of thesearch space)

ii efficient search space exploration methods (to identify pos-sible docking poses) and

4 | Glaab

at University of L



ownloaded from

iii fast scoring functions (to rank compounds in terms of esti-mated relative differences in binding affinity)

Dedicated structure representations for molecular dockingusually restrict the search space to the receptor binding pocket(as opposed to lsquoblind dockingrsquo used when the location of thebinding site is unknown) and replace full-atom models by moresimplified representations These include geometric surfacerepresentations like spheres [60 61] Voronoi tessellation or tri-angulation-based representations (eg in BetaDock [62]) gridrepresentations in which interaction potentials of probe atomsare mapped to points on a grid with adjustable coarseness(eg in the AutoDock software [63 64]) or a reduction to pointsand vectors reflecting critical properties for the interaction withthe ligand (eg the LUDI representation [65] used in FlexX [66])Apart from the structure representation the size of the searchspace also depends on the extent to which structural flexibilityof the ligand and receptor is taken into account While the con-sideration of ligand flexibility has become a standard in molecu-lar docking since the introduction of the FlexX software [66]accounting for receptor flexibility and conformational adjust-ments in the binding pocket upon ligand binding is still a majorchallenge due to the significant increase in degrees of freedomto be explored However depending on the targeted proteinfamily protein flexibility can often have a decisive influence onbinding events and is a major limiting factor for successfulscreening Two main generic models have been proposed todescribe protein conformational changes upon binding eventsthe lsquoinduced-fitrsquo model in which the interaction between a pro-tein and its binding partner induces a conformational change inthe protein and the lsquoconformational selectionrsquo model(also referred to as population selection fluctuation fit orselected fit model) in which among the different conform-ations assumed by the dynamically fluctuating protein the lig-and selects the most compatible one for binding [67 68]Current computational techniques to address receptor flexibil-ity include the use of multiple static receptor representationsthat reflect different conformations (a strategy known aslsquoensemble dockingrsquo) [69] the search for alternative amino acidside-chain conformations at the binding site using rotamerlibraries [70 71] and the representation of flexibility via relevantnormal modes [72]

Even without the consideration of receptor flexibility thevast search space resulting from the combination of possibleconformations and docking poses typically makes an exhaust-ive search infeasible without extensive prior filtering Genericmeta-heuristics are therefore often applied to explore possibledocking solutions more efficiently eg Monte Carlo approaches(used in RosettaLigand [73] GlamDock [74] GLIDE [34] andLigandFit [75] among others) or Evolutionary Algorithms (usedin GOLD [76] FITTED [77] BetaDock [62] and FLIPDock [71]) Analternative search method derived from de novo ligand design isthe Incremental Construction approach [78] which first places abase fragment or anchor fragment of the ligand in the bindingpocket and then adds the remaining fragments incrementallyto fill cavities considering different possible solutions resultingfrom conformational flexibility (eg used in FlexX [66 78] Dock[6061] and Surflex [79]) More recently docking approachesusing an exhaustive search within multi-step filteringapproaches for docking poses have been proposed eg usingreduced-resolution shape representations and a smooth shape-based scoring function (FRED [80]) or applying a new graphmatching algorithm to enumerate all compatible pose combin-ations of rigid sub-fragments from a decomposed ligand

(eHITS [81 82]) Table 3 provides an overview of currently avail-able free and commercial proteinndashligand docking programs andthe main algorithmic principle used highlighting that a wideselection of current approaches is already freely available foracademic research

The broad range of structure representation and searchmethodologies covered by these software tools is comple-mented by an equally wide variety of scoring functions used toevaluate docking poses These can roughly be grouped intothree types of approaches (1) classical molecular mechanics orforce field-based methods (eg adjusted force fields like AMBER[93] and CHARMM [86] and variants applied in DOCK [60 61]GoldScore [76 98] and AutoDock [63 64]) (2) empirical scoringfunctions obtained via regression analysis of experimentalstructural and binding affinity data (eg ChemScore [99] FlexXF-Score [102] X-Score [103] GlideScore [34] LUDI [65] PLP [104]Cyscore [105] ID-Score [106] and Surflex [79 107] and (3)knowledge-based scoring functions derived using informationfrom resolved crystal structures (eg DrugScore [108] DSX [109]PMF [110] ITScore [94] SMoG [111] STScore [112] and ASP [113])Moreover many in silico screening pipelines have extendedthese fast-ranking approaches by applying a refined but moretime-consuming scoring as a post-processing to only the top-ranked poses eg using methods for absolute binding affinityestimation [114ndash116]

To give the user an overview of the typical predictive per-formance and runtime efficiency to be expected from com-monly used receptor-based screening approaches a variety ofcomparative reviews have been conducted Docking perform-ance is typically measured via the enrichment factor ie for agiven fraction x of the screened compound library this factorcorresponds to the ratio of experimentally found active struc-tures among the top x ranked compounds to the expectednumber of actives among a random selection of x compoundsWhen comparing different docking methods on benchmarkdata with known actives the enrichment factors for the top 15 and 10 of ranked compounds vary significantly across dif-ferent targets (eg depending on the protein family the qualityof the crystal structure and the drugability of its binding pocket)and different docking methods ranging from between 16 to148 with a median enrichment factor of 4 in a large-scale valid-ation study (always using the best-performing scoring functionavailable for each docking method) [117] However no methodwas consistently superior to other approaches across differentdata sets A separate comparative study plotted the rate of true-positive identifications against the rate of false-positives fordifferent docking approaches and benchmark data sets to deter-mine the area under the curve (AUC) as a performance measure[92] Mean AUC values between 055 and 072 were obtainedand the GLIDE HTVS approach [34] significantly outperformedother methods Instead of relying on published evaluation stud-ies users can also evaluate their own docking pipeline on oneof the widely used benchmark collections eg the Directory ofUseful Decoys [118] and Maximum Unbiased Validation [119]Apart from the predictive performance the runtime require-ments for docking simulations also vary largely depending onthe size and conformational flexibility of ligand(s) and the bind-ing pocket (or the protein surface for blind docking) and thestructure representation scoring and search space explorationapproach used To alleviate the computational burden resultingfrom a runtime behaviour that tends to scale exponentiallywith the number of degrees of freedom to be explored dockingalgorithms use efficient sampling techniques [102 120] and


at University of L



ownloaded from

search space exploration methods (eg divide-and-conquer orbranch-and-bound [97 120]) and prior knowledge to prune thesearch space eg from rotamer libraries [70 71] Moreoversome docking algorithms have been parallelized [121 122] orextended to exploit GPU acceleration [123 124] and FPGA-basedsystems [124 125] On a common mono-processor Linux work-station typical software tools dock up to 10 compounds per sec-ond [126 127] but to obtain reliable runtime estimates the usershould perform test runs on a few representative compoundsfor the library to be screened In any case the user will need totake into consideration that the achievable quality and effi-ciency of docking algorithms will always be subject to generallimitations resulting from the restricted quality of the inputreceptor structure(s) the total number of degrees of freedom forfully flexible docking and the inaccuracies of in silico scoringfunctions

Apart from classical docking approaches in recent yearsseveral software packages have also complemented conven-tional screening for non-covalent interactions by dedicatedcovalent docking methods eg DOCKTITE [83] for the MOE pack-age [84] CovalentDock [85] for AutoDock [63 64] CovDock [87]for GLIDE [34] and DOCKovalent [88] for DOCK [60 61] Theseapproaches typically first identify nucleophilic groups in thetarget protein and electrophilic groups in the ligand and then

apply similar search space exploration methods as in classicaldocking using dedicated scoring terms to account for theenergy contribution of covalent bonds (however often the userfirst has to specify an attachment site eg a cysteine or serineresidue in the binding pocket)

A further more recent development is the use of consensusranking and machine learning techniques to combine either thefinal outcomes for different docking methods or integrate differ-ent components of their scoring functions to obtain a more reli-able assessment of docking solutions [89ndash91 95] Theseintegration techniques outperform individual algorithms in thegreat majority of applications suggesting that users shouldideally not rely on only a single docking approach or scoringfunction Using parallel processing on high-performance com-puting systems such integrative compound rankings across dif-ferent methods can be obtained without significantly extendingthe overall runtime

Finally new drug design techniques have been developed toaccount for protein mutations that may confer drug resistanceeg in cancer cells and viral or bacterial proteins Generally twotypes of strategies can be distinguished (1) approaches directlytargeting the mutant proteins with drug resistance and (2)approaches using single drugs or drug combinations targetingmultiple proteins Combinatorial therapies using multiple drugs

Table 3 Software tools for protein-ligand docking with information on the main algorithmic principle used and the public accessibility

Software Principle Free for academia Webpage

AutoDock [63 64] Monte Carlo amp Lamarckian geneticalgorithm

Yes httpautodockscrippsedu

AutoDock Vina [130] Iterated local search Yes httpvinascrippseduDOCK [60 61] Incremental construction Yes httpdockcompbioucsfeduSLIDE [131 132] Mean field theory optimization Yes httpwwwbchmsuedukuhnsoft-

wareslideindexhtmlRosettaLigand [73] Monte Carlo Yes httprosettadockgraylabjhueduFRED [80] Exhaustive search multi-step

filteringYes httpwwweyesopencomoedocking

FITTED [77] Genetic algorithm Yes (no cluster use) httpwwwfittedcaGlamDock [74] Monte Carlo Yes httpwwwchil2deGlamdockhtmlSwissDock EADock DSS [133 134] Exhaustive ranking amp clustering of

tentative binding modesYes httpwwwswissdockch

iGEMDOCK GEMDOCK [135] Evolutionary algorithm Yes httpgemdocklifenctuedutwdockigemdockphp

rDOCK [136] Genetic algorithm thornMonte Carlo thornsimplex

Yes httprdocksourceforgenet

BetaDock [62] Genetic algorithm Yes httpvoronoihanyangackrsoftwarehtm

FLIPDock [71] Genetic algorithm Yes httpflipdockscrippseduGalaxyDock2 [137] Conformational space annealing Yes httpgalaxyseoklaborgsoftwares

galaxydockhtmlLeadIT (FlexXHYDE) [66 114] Incremental construction No httpwwwbiosolveitdeleaditGLIDE [34] Side point search thornMonte Carlo No httpwwwschrodingercomGOLD [76] Genetic algorithm No httpwwwccdccamacukSolutions

GoldSuitePagesGOLDaspxSurflex [79 107] Incremental construction No httpwwwtriposcomindexphpICM [138 139] Iterated local search No httpwwwmolsoftcomdockinghtmlMOE [84] Parallelized FlexX (see above) No httpwwwchemcompcomMOE-Molec

ular_Operating_EnvironmenthtmLigandFit [75] Monte Carlo No httpaccelryscomproductsdiscovery-

studioeHiTS [81 82] Exhaustive search multi-step

filteringNo httpwwwsimbiosyscaehits

indexhtmlDrug Discovery Workbench [140 141] Multiple metaheuristics No httpwwwclcbiocomproductsclc-

drug-discovery-workbench

6 | Glaab

at University of L



ownloaded from

with multiple targets have become a standard for the treatmentof HIV infections and statistical learning software to predictoptimal drug combinations from the HIV genome sequence inparticular the Geno2Pheno approach [96] is already applied inclinical practice Directly targeting mutant proteins is a morechallenging task as crystal structures of the mutants are usu-ally not available and difficult to model in silico Hao et al pro-pose an interesting strategy to use conformational flexibilitywithin inhibitor structures to address drug resistance focusingon the HIV-1 reverse transcriptase target [100] However asincreased structural flexibility may also result in reduced select-ivity most published approaches to counteract drug resistanceuse conventional small-molecule design techniques but exploitdetailed knowledge on the structural basis of resistance for spe-cific targets to prioritize ligands in terms of the likelihood ofbinding robustly to different mutated variants of the target Forexample Esser et al analyse where and why inhibitors for therespiratory component cytochrome bc1 complex subunit failand propose alternatives by considering different active sites inthe protein [101] For kinase targets in cancer diseases Bikkeret al discuss patterns formed by the location of resistance mu-tations across multiple targets and their implications for drugdesign [128] Apart from mutations other types of drug resist-ance mechanisms eg over-expression of efflux transporters incancers have previously been reviewed in detail [129]

Ligand-based screening

A receptor structure of sufficient quality for docking simula-tions is often not available for a chosen target proteinAlternatively if binders for the target binding pocket are alreadyknown further compounds may be predicted as binders withsimilar type of activity from their structural and chemical simi-larity to the known ligands In analogy to the previously dis-cussed docking methods corresponding ligand-based screeningtechniques differ in terms of structure representation consider-ation of structural flexibility and the used search methodologyand scoring function

To represent structures compactly for fast similaritysearches a wide variety of molecular descriptors has been pro-posed including 0D-descriptors (simple count and constitutionaldescriptors like atom count bond count and molecular weight)1D-descriptors (binary fingerprints for the presenceabsenceof structural features fragment counts and rule-based sub-structure representations known as SMILESSMARTS [142])2D-descriptors (topological descriptors graph invariants likeconnectivity indices as well as feature trees [143] see discus-sion below) 3D-descriptors (geometry surface and volumedescriptors like 3D-WHIM [144] and 3D-MORSE [145]) and4D-descriptors (stereoelectronic and stereodynamic descriptorsobtained from grid-based quantitative structure activity rela-tionship [QSAR] methods like CoMFACOMSIA [146 147] imple-mented in Open3DQSAR [148] or dynamic QSAR techniquescovering time-dependent 3D-properties like conformationalflexibility and transport properties [149]) A detailed compen-dium of molecular descriptors has recently been compiled byTodeschini and Consonni [150]

The scoring method to quantify the structural similaritymainly depends on the used descriptor types and individualchoices on how to weigh the relevance of different molecularfeatures For binary fingerprint descriptors compound similar-ity is often quantified using the Tanimoto coefficient ie theproportion of the features shared among two molecules dividedby the size of their union (similar scores include the Dice Index

and Tversky Index with adjustable weights see [151] for a com-parison of different approaches) More recently similarity scor-ing using data compression and the information-theoreticconcept of the Normalized Compression Distance [152] hasbeen proposed for string-based molecule representations(implemented in the software Zippity [153]) To account for bothtopological and physicochemical properties Rarey and Dixonintroduced a fast screening approach using feature trees agraph-based representation of molecular sub-fragments andtheir interconnections [143] While these techniques relying on1D- and 2D-descriptors are suitable for screening millions ofcompounds more complex scoring functions using 3D- and4D-descriptors statistical learning and available binding affin-ities for already known binders can provide more accurate esti-mations but involve significantly higher runtimes (ie they aremainly suitable for post-screening of pre-selected compounds)In particular using more computationally expensive algorithmsfor flexible ligand superposition compounds can be overlaidonto known binding molecules by matching their shape andfunctional groups (eg implemented in CatalystHipHop [154]SLATE [155] DISCO [156] GASP [157] GALAHAD [158] GAPE[159] and PharmaGIST [160]) or by superimposing their frag-ments incrementally onto a template ligand kept rigid as inFlexS [161] The superposition of known binders can also enablethe inference of a pharmacophore ie the 3D arrangement offunctional groups and structural features relevant for the bind-ing interactions with the receptor providing useful constraintsto restrict the screening search space Moreover if a sufficientlylarge and diverse training set of known binders is availablesharing the same binding pocket and binding mode the super-imposition of new compounds may enable the prediction oftheir most likely binding conformations and affinities viamachine learning and 3D-QSAR methods (eg COMFA andCOMSIA [146 147]) Overall the choice of molecular descriptorsdepends on the envisaged application the available data andruntime for the analysis Previous comparative reviews mayhelp users to select adequate descriptors and associatedanalysis techniques (see [162] for a review on descriptors forfast ligand-based screening and section Protein structurepre-processing and quality control in [163] for a comparison ofdescriptor-based methods for binding affinity prediction) As anadditional filter for a pre-selection of candidate descriptorsstatistical feature selection methods can be applied [164] Thereader should also note the generic limitations of different de-scriptor types in particular 1D- and 2D-descriptors can onlycapture limited and indirect information on the spatial struc-ture of ligands whereas the descriptors used in 3D-QSAR meth-ods like COMFA and COMSIA overcome this restriction at theexpense of necessitating a computationally complex ligandsuperpositioning [163] Descriptors for dynamic 3D propertieslike conformational flexibility cover an additional layer of infor-mation not sufficiently addressed by simpler descriptor types[149] however the amount and type of data required to calcu-late these descriptors limits their applicability Apart from thetype of information captured by descriptors their interpretabil-ity may also be considered as a selection criterion (eg topo-logical indices [165] have been criticized for a lack of a clearphysicochemical meaning) As the number of proposed descrip-tors continues to grow and no simple rules are available tochoose optimal descriptors for each application users may alsowish to consult dedicated reference works explaining and com-paring descriptor properties in detail [150]

Moreover performance evaluations have been conducted onbenchmark data to compare ligand-based screening methods


at University of L



ownloaded from

using different descriptors against receptor-based screeningtechniques Interestingly in many of these studies ligand-based methods have been reported to provide either similar orbetter enrichment of actives among the top-ranked compounds[117 126 166] For example Venkatraman et al found that 2Dfingerprint-based approaches provide higher enrichment scoresthan docking methods for many targets in benchmark data sets[166] However as most ligand-based screening approachesscore new compounds in terms of their similarity to alreadyexisting binders the novelty of top-ranked molecules may oftenbe limited as compared with new binders identified via dockingapproaches From their results Venkatraman et al also derivethe recommendation to use descriptors that can representmultiple possible conformations of a ligand Another compara-tive study by Kruger et al obtained comparable enrichmentswith approaches based on receptors or ligands but diverse per-formance results were observed across different groups of tar-gets [117] Therefore the authors suggest to consider both typesof approaches as complementary and if possible apply themjointly to increase the number and structural variety of identi-fied actives Indeed a comparison of data fusion techniques tocombine screening based on receptors or ligands by Sastry et al[92] showed that the average enrichment in the top 1 ofranked compounds could be improved by between 9 and 25in comparison to the top individual approach for differentbenchmark data sets (with a mean enrichment factor between20 to 40)

One of the main advantages of ligand-based screening meth-ods using 0D- 1D- and 2D-descriptors are their extremely shortruntimes eg fingerprint similarity searches can screen around10 000 ligands per second on a 24-GHz AMD Opteron processor[92] For comparison on the same processor 3D-ligand basedmethods like shape screening can screen roughly 10 ligands persecond on a database of pre-computed conformations anddocking with Glide HTVS takes approx 1ndash2 s per ligand [92]However the applicability of ligand-based screening methods isstrictly limited by the availability number and diversity ofknown binding ligands for the target and specific binding pocketof interest and the most widely used similarity-based scoringfunctions will by design only find compounds with high similar-ity to already known binders

In summary although most ligand-based approaches arenot designed to identify entirely new binders with diverse struc-tures and binding modes structurally similar compounds toknown binders may still display improved properties in termsof affinity selectivity or ADMETox properties as exemplifiedby previous success stories [167ndash169] Finally if both the recep-tor structure and an initial set of known binders are availablefor the target protein the combination of screening techniquesbased on receptors or ligands may help to increase the enrich-ment of active molecules among the top-ranked compounds[170]

ADMETox and off-target effects prediction

In preclinical drug development projects screening using dock-ing or ligand similarity scoring is often applied in combinationwith in silico methods to estimate bioavailability selectivity tox-icity and general pharmacokinetics properties to filter com-pounds more rigorously before final experimental testingWhile simple rules to evaluate lsquodrug-likenessrsquo and oral bioavail-ability like lsquoLipinskirsquos rule of fiversquo and similar rule sets [57 58]already enable a fast pre-selection of compounds machinelearning techniques provide opportunities for more accurate

and detailed assessments of a wider range of outcome meas-ures The computational prediction of ADMETox properties(ie Absorption Distribution Metabolism Elimination andToxicity properties) is therefore gaining increasing attention

For this purpose quantitative structure-property relation-ship (QSPR) models ie regression or classification modelsrelating molecular descriptors to a target property of interesthave been developed to predict various pharmacokinetic andbiopharmaceutical properties While classical QSPRs are mostlydesigned as simple linear models depending on only a fewdescriptors more recently advanced statistical learning meth-ods combining feature selection with support vector machinespartial least squares discriminant analysis and artificial neuralnetworks have been used to build more reliable ADMETox pre-diction models [171 172] To evaluate and compare differentmodels performance statistics like the mean cross-validatedaccuracy or squared error the standard deviation and FisherrsquosF-value can be used (see [173] for a review of QSPR validationmethods)

Apart from QSPR models rule-based expert systems likeMETEOR [174] MetabolExpert [175] and META [176] use largeknowledge bases of biotransformation reactions to providerough indications of the possible metabolic routes for a com-pound Expert systems have also been proposed to combinelarge collections of rules for toxicity prediction as QSPR modelsare mostly limited to specific toxicity endpoints Changes in asingle reactive group can turn a non-toxic into a toxic com-pound and long-term toxicities are generally difficult to identifyand study hence the available prediction software mainlyfocuses on established fragment-based rules for acute toxicity(relevant software includes COMPACT [177] OncoLogic [178]CASE [179 180] MultiCASE [181] Derek Nexus [182] TOPKAT[183] HazardExpert Pro [184] ProTox [185] and the open-sourceToxtree [186])

A further option to identify adverse effects resulting fromoff-target binding are inverse screening approaches screening aligand against many possible receptor proteins using docking orsimilarity scoring to their known binders Fast heuristicapproaches for this purpose include idTarget [187] TarFisDock[188] INVDOCK [189] ReverseScreen3D [190] PharmMapper[191] SEA [192] SwissTargetPrediction [193] and SuperPred[194]

Overall current in silico ADMETox modelling and predictionmethods are still limited in their accuracy and coverage for esti-mating biomedically relevant compound properties but mayprovide useful preliminary filters to exclude subsets of com-pounds with high likelihood of being toxic or having insufficientbioavailability

Generic screening framework and workflowmanagement

Implementing an in silico screening project for preclinical drugdevelopment requires the set-up of a complex analysis pipelineinterlinking multiple task-specific software tools in an efficientmanner In Figure 1 a generic screening framework is showncovering the typical steps in computational small-moleculescreening projects and providing examples of free softwaretools for each task

The four main phases in the framework (1) data collection(2) pre-processing (3) screening and (4) selectivity andADMETox filtering are common across different projectsand sub-divided into more specific sub-tasks for data collection

8 | Glaab

at University of L



ownloaded from

and pre-processing whose implementation will depend on theavailable resources the chosen strategy and study type (eg dif-fering for screening studies based on receptors or ligands)

To implement a corresponding software pipeline and facili-tate the interlinked reproducible and automated application ofscreening software various workflow management tools havebeen developed over the past years The most widely used sys-tems in structural bioinformatics are the open-source softwareKNIME [195 196] and the commercial Pipeline pilot (Accelrys)

but several other tools exist including Taverna [197] KDEBioscience [198] Galaxy [199] Kepler [200] VisTrails [201]Vision [202] Triana [203] and SOMA2 [204] These approachesmainly differ in terms of the supported level of parallelism egin KNIME a new task can only start after completing the preced-ing one whereas in pipelining tools like Pipeline pilot taskoperations continue on the next records in the data streamwhile already processed data records are passed on to the nexttask Pipelining approaches often have advantages in terms of

Figure 1 Generic framework for in silico small-molecule screening (examples of free software tools for each step are listed in brackets)


at University of L



ownloaded from

efficiency however the workflow methodology used in KNIMEmay make it easier for the user to inspect intermediate outputsidentify task-specific issues and resume the execution of inter-rupted workflows (eg after a power-cut)

KNIME also supports the integration of different databases(eg MySQL SQLite Oracle IBM DB2 Postgres) to load manipu-late and store data efficiently and similarly Pipeline pilotcan integrate standard databases via the Open DatabaseConnectivity (ODBC) protocol (specifically for the integration ofmolecular and biological databases templates are already avail-able) Due to the small sizes of ligand files and the limited spacerequired to store compressed numerical screening data thetotal disk space required for a screening study is typically not amajor limiting factor with current hard disk capacities in par-ticular because workflow management tools like KNIME areable to store only the differences between consecutive nodesHowever frequent disk-access operations can slow down theexecution of screening workflows The available options toaddress this issue include data caching in-memory storage andthe use of efficient database queries Thus workflow manage-ment systems like KNIME and Pipeline pilot are not meant toreplace database systems for effective storage and retrieval ofscreening results but rather integrate these databases and pro-vide additional features to simplify the set-up monitoringadjustment and sharing of screening workflows

Other workflow management systems are mostly used fordifferent applications but partly also provide dedicated featuresfor virtual screening For example the free Taverna system canbe interlinked with the open-source cheminformatics Javalibrary CDK [205] and the Bioclipse workbench [206] for QSARanalyses and molecular visualizations Some of the systems aredesigned specifically for visual data exploration and users withlimited programming experience allowing the set-up of com-plex workflows and subsequent data analysis in an almostpurely visual manner eg Vision [202] and VisTrails [201]Together with Taverna VisTrails also stands out for its strongfocus on data reproducibility and provenance management

The set-up of reproducible screening pipelines can also befacilitated via open virtualization platforms to run distributedapplications eg the Docker platform (httpswwwdockercom) As a complementary software to this review article adownloadable cross-platform system for reproducible virtualscreening using Docker has been implemented and made pub-licly available for the reader (httpsregistryhubdockercomuvscreeningscreening) It integrates several free tools coveringthe different phases of the proposed generic framework forscreening based on receptors or ligands eg OpenBabel [52]for file format conversions and filtering AutoDock Vina [63 64]for molecular docking CyScore [105] for binding affinity predic-tion and ToxTree [186] to estimate toxicity hazards among vari-ous others (see httpsregistryhubdockercomuvscreeningscreening for details) A script to run an example screening forinhibitors of HIV-1 protease using compounds from the NCIDiversity Set 2 [207] is also provided and the user can simplychange the input files to study alternative targets and com-pound libraries

In summary workflow management and virtualizationtools provide new means to obtain reproducible and portablescreening pipelines which can be adjusted and extendedwith minimal effort The framework and software proposedhere may serve as a starting point to test and comparecombinations of different public tools or to expand and alterthe framework to meet the goals of a specific new screeningproject

Conclusions

Virtual small-molecule screening is still a highly challengingtask with many possible pitfalls eg due to errors in the inputstructures and limitations in the scoring and search space ex-ploration methods However as highlighted in the genericframework for in silico screening presented here free softwareand relevant public databases have now become available foreach common task in a screening project This is partly due tothe recent expiration of patent protection for some fundamen-tal cheminformatics techniques (eg CoMFA [146]) but mainlydue to a growing open-source community developing fre-quently updated and freely modifiable screening tools More re-cently such non-proprietary software alternatives are alsobecoming more widespread for the workflow management ofcomplex screening pipelines on diverse computing platformsAs a result efficient and reproducible screening workflows cannow be implemented at lower cost and effort making preclin-ical drug research projects more feasible within an academicsetting

Key Points

bull A wide range of free tools and resources for each com-mon task in virtual small-molecule screening have be-come available in recent years These tools can becombined into professional screening pipelines usingtypical hardware facilities in an academicenvironment

bull Molecular structure files from public databases areusually not pre-processed for virtual screening pur-poses In particular PDB files for protein crystal struc-tures are often affected by several errors and missingresidues Therefore care must be taken to apply ad-equate pre-processing and quality control methodsduring the initial stages of a screening project

bull Workflow management systems can greatly facilitatethe set-up monitoring and adjustment of virtualscreening pipelines They allow users to build reprodu-cible workflows that can be scaled from desktop sys-tems to high-performance grid and cloud computingplatforms

Funding

This work was supported by the Fonds Nationale de laRecherche Luxembourg (grant no C13BM5782168)

References1 Schames JR Henchman RH Siegel JS et al Discovery of a

Novel Binding Trench in HIV Integrase J Med Chem 2004471879ndash81

2 Clark DE What has virtual screening ever done for drug dis-covery Expert Opin Drug Discov 20083841ndash51

3 MacConnachie AM Zanamivir (Relenza)ndasha new treatmentfor influenza Intensive Crit Care Nurs 199915369ndash70

4 Hajduk PJ Huth JR Tse C Predicting protein druggabilityDrug Discov Today 2005101675ndash82

5 Berman HM The protein data bank Nucleic Acids Res 200028235ndash42

6 Oberai A Ihm Y Kim S et al A limited universe of mem-brane protein families and folds Protein Sci 2006151723ndash34

10 | Glaab

at University of L



ownloaded from

7 Oshiro C Bradley EK Eksterowicz J et al Performance of3D-Database Molecular Docking Studies into HomologyModels J Med Chem 200447764ndash7

8 Pieper U Webb BM Dong GQ et al ModBase a database ofannotated comparative protein structure models and asso-ciated resources Nucleic Acids Res 201432D21ndash22

9 Kiefer F Arnold K Kunzli M et al The SWISS-MODELRepository and associated resources Nucleic Acids Res 200937D387ndash9

10 Haas J Roth S Arnold K et al The Protein Model Portalndashacomprehensive resource for protein structure and modelinformation Database (Oxford) 20132013bat031

11 Vyas V Ukawala R Chintha C et al Homology modeling afast tool for drug discovery current perspectives IndianJ Pharm Sci 2012741

12 Bordogna A Pandini A Bonati L Predicting the accuracy ofprotein-ligand docking on homology models J Comput Chem20113281ndash98

13 Gabanyi MJ Adams PD Arnold K et al The StructuralBiology Knowledgebase a portal to protein structuressequences functions and methods J Struct Funct Genomics20111245ndash54

14 Irwin JJ Shoichet BK ZINC - A free database of commerciallyavailable compounds for virtual screening J Chem Inf Model200545177ndash82

15 Bolton EE Wang Y Thiessen PA et al PubChem integratedplatform of small molecules and biological activities AnnuRep Comput Chem 20084217ndash41

16 Pence HE Williams A Chemspider an online chemicalinformation resource J Chem Educ 2010871123ndash24

17 Hurst T Chemnavigatorcom an IResearch (TM) system forthe acquisition of compounds for pharmaceutical lead fol-low-up Abstr Pap Am Chem Soc 2000219U462

18 Liu T Lin Y Wen X et al BindingDB a web-accessible data-base of experimentally determined protein-ligand bindingaffinities Nucleic Acids Res 200735D198ndash201

19 Schneider G Tanrikulu Y Schneider P Self-organizingmolecular fingerprints a ligand-based view on drug-likechemical space and off-target prediction Future Med Chem20091213ndash18

20 Tseng YY Dupree C Chen ZJ et al SplitPocket identificationof protein functional surfaces and characterization of theirspatial patterns Nucleic Acids Res 200937W984ndash9

21 Volkamer A Kuhn D Rippmann F et al DoGSiteScorer aweb-server for automatic binding site prediction ana-lysis and druggability assessment Bioinformatics 2012282074ndash5

22 IUPAC-IUB Commission on Biochemical NomenclatureIUPAC-IUB commission on biochemical nomenclature J MolBiol 1970521ndash17

23 Laskowski RA MacArthur MW Moss DS et al PROCHECK aprogram to check the stereochemical quality of proteinstructures J Appl Crystallogr 199326283ndash91

24 Gunther S Kuhn M Dunkel M et al SuperTarget and mata-dor resources for exploring drug-target relationshipsNucleic Acids Res 200836D919ndash22

25 Roth BL Lopez E Patel S et al The multiplicity of serotoninreceptors uselessly diverse molecules or an embarrassmentof riches Neuroscience 20006252ndash62

26 Benson ML Smith RD Khazanov NA et al Binding MOAD ahigh-quality protein-ligand database Nucleic Acids Res 200836D674ndash78

27 Vriend G WHAT IF a molecular modeling and drug designprogram J Mol Graph 1990852ndash6

28 Laurie ATR Jackson RM Methods for the prediction ofprotein-ligand binding sites for structure-based drugdesign and virtual ligand screening Curr Protein Pept Sci 20067395ndash406

29 Roider HG Pavlova N Kirov I et al Drug2Gene an exhaust-ive resource to explore effectively the drug-target relationnetwork BMC Bioinformatics 20141568

30 Wang R Fang X Lu Y et al The PDBbind database collectionof binding affinities for protein-ligand complexes withknown three-dimensional structures J Med Chem 2004472977ndash80

31 Tang J Szwajda A Shakyawar S et al Making sense of large-scale kinase inhibitor bioactivity data sets a comparativeand integrative analysis J Chem Inf Model 201454735ndash43

32 Kramer C Kalliokoski T Gedeck P et al The experimentaluncertainty of heterogeneous public K i data J Med Chem2012555165ndash73

33 Madhavi Sastry G Adzhigirey M Day T et al Protein and lig-and preparation parameters protocols and influence onvirtual screening enrichments J Comput Aided Mol Des 201327221ndash34

34 Friesner RA Banks JL Murphy RB et al Glide a newapproach for rapid accurate docking and scoring 1 Methodand assessment of docking accuracy J Med Chem 2004471739ndash49

35 Huang B MetaPocket a meta approach to improve proteinligand binding site prediction OMICS 200913325ndash30

36 Dundas J Ouyang Z Tseng J et al CASTp computed atlas ofsurface topography of proteins with structural and topo-graphical mapping of functionally annotated residuesNucleic Acids Res 200634W116ndash8

37 Eisenberg D Luthy R Bowie JU VERIFY3D assessment ofprotein models with three-dimensional profiles MethodsEnzymol 1997277396ndash406

38 Joosten RP Salzemann J Bloch V et al PDB-REDO auto-mated re-refinement of X-ray structure models in the PDBJ Appl Crystallogr 200942376ndash84

39 Gopalakrishnan K Sowmiya G Sheik SS et alRamachandran plot on the web (20) Protein Pept Lett 200714669ndash71

40 Pettersen EF Goddard TD Huang CC et al UCSF Chimerandashavisualization system for exploratory research and analysisJ Comput Chem 2004251605ndash12

41 Rother K Introduction to PyMOL Methods Mol Biol Clift Nj20056351ndash32

42 Humphrey W Dalke A Schulten K VMD visual moleculardynamics J Mol Graph 19961433ndash8

43 Krieger E Vriend G YASARA View - molecular graphicsfor all devices - from smartphones to workstationsBioinformatics 2014302981ndash2

44 Goodsell DS Representing structural information withRasMol Curr Protoc Bioinformatics 2005Chapter 5Unit 54

45 Guex N Peitsch MC SWISS-MODEL and the Swiss-Pdbviewer an environment for comparative protein modelingElectrophoresis 1997182714ndash23

46 Moll A Hildebrandt A Lenhof HP et al BALLView an object-oriented molecular visualization and modeling frameworkJ Comput Aided Mol Des 200519791ndash800

47 Rossato G Ernst B Vedani A et al AcquaAlta a directionalapproach to the solvation of ligand-protein complexesJ Chem Inf Model 2011511867ndash81

48 Labute P Protonate3D assignment of ionization states andhydrogen coordinates to macromolecular structuresProteins Struct Funct Bioinform 200975187ndash205


at University of L



ownloaded from

49 Zsoldos Z Reid D Simon A et al eHiTS a new fast exhaust-ive flexible ligand docking system J Mol Graph Model 200726198ndash212

50 Anandakrishnan R Aguilar B Onufriev AV Hthornthorn 30 auto-mating pK prediction and the preparation of biomolecularstructures for atomistic molecular modeling and simula-tions Nucleic Acids Res 201240W537ndash41

51 Dalton JAR Jackson RM An evaluation of automated hom-ology modelling methods at low target template sequencesimilarity Bioinformatics 2007231901ndash8

52 OrsquoBoyle NM Banck M James CA et al Open Babel an openchemical toolbox J Cheminform 2011333

53 Chen IJ Foloppe N Drug-like bioactive structures and con-formational coverage with the ligprepconfgen suite com-parison to programs MOE and catalyst J Chem Inf Model 201050822ndash39

54 Shelley JC Cholleti A Frye LL et al Epik a software programfor pKa prediction and protonation state generation fordrug-like molecules J Comput Aided Mol Des 200721681ndash91

55 Brink T Ten Exner TE Influence of protonation tautomericand stereoisomeric states on protein-ligand docking resultsJ Chem Inf Model 2009491535ndash46

56 Oehme DP Brownlee RTC Wilson DJD Effect of atomiccharge solvation entropy and ligand protonation state onMM-PB(GB)SA binding energies of HIV protease J ComputChem 2012332566ndash80

57 Lipinski CA Lead- and drug-like compounds the rule-of-fiverevolution Drug Discov Today Technol 20041337ndash41

58 Walters WP Going further than Lipinskirsquos rule in drugdesign Expert Opin Drug Discov 2012799ndash107

59 Abreu RM V Froufe HJC Daniel POM et al ChemT an open-source software for building template-based chemical libra-ries SAR QSAR Environ Res 201122603ndash10

60 Kuntz ID Structure-based strategies for drug design and dis-covery Science 19922571078ndash82

61 Lang PT Brozell SR Mukherjee S et al DOCK 6 combiningtechniques to model RNA-small molecule complexes RNA2009151219ndash30

62 Kim D-S Kim C-M Won C-I et al BetaDock shape-prioritydocking method based on beta-complex J Biomol Struct Dyn201129219ndash42

63 Goodsell DS Olson AJ Automated docking of substrates toproteins by simulated annealing Proteins 19908195ndash202

64 Morris GM Ruth H Lindstrom W et al AutoDock4 andAutoDockTools4 automated docking with selective receptorflexibility J Comput Chem 2009302785ndash91

65 Bohm HJ LUDI rule-based automatic design of new sub-stituents for enzyme inhibitor leads J Comput Aided Mol Des19926593ndash606

66 Kramer B Metz G Rarey M et al Ligand docking and screen-ing with FLEXX Med Chem Res 19999463ndash78

67 Csermely P Palotai R Nussinov R Induced fit conform-ational selection and independent dynamic segments anextended view of binding events Trends Biochem Sci 201035539ndash46

68 Vogt AD Di Cera E Conformational selection or induced fitA critical appraisal of the kinetic mechanism Biochemistry2012515894ndash902

69 Totrov M Abagyan R Flexible ligand docking to multiplereceptor conformations a practical alternative Curr OpinStruct Biol 200818178ndash84

70 Hartmann C Antes I Lengauer T Docking and scoring withalternative side-chain conformations Proteins Struct FunctBioinform 200974712ndash26

71 Zhao Y Sanner MF FLIPDock docking flexible ligands intoflexible receptors Proteins Struct Funct Genet 200768726ndash37

72 Cavasotto CN Kovacs JA Abagyan RA Representing recep-tor flexibility in ligand docking through relevant normalmodes J Am Chem Soc 20051279632ndash40

73 Meiler J Baker D ROSETTALIGAND protein-small moleculedocking with full side-chain flexibility Proteins Struct FunctGenet 200665538ndash48

74 Tietze S Apostolakis J GlamDock development and valid-ation of a new docking tool on several thousand protein-ligand complexes J Chem Inf Model 2007471657ndash72

75 Venkatachalam CM Jiang X Oldfield T et al LigandFit anovel method for the shape-directed rapid docking ofligands to protein active sites J Mol Graph Model 200321289ndash307

76 Jones G Willett P Glen RC et al Development and validationof a genetic algorithm for flexible docking J Mol Biol 1997267727ndash48

77 Corbeil CR Englebienne P Moitessier N Docking ligandsinto flexible and solvated macromolecules 1 Developmentand validation of FITTED 10 J Chem Inf Model 200747435ndash49

78 Kramer B Rarey M Lengauer T Evaluation of the FlexX in-cremental construction algorithm for protein- ligand dock-ing Proteins Struct Funct Genet 199937228ndash41

79 Jain AN Surflex fully automatic flexible molecular dockingusing a molecular similarity-based search engine J MedChem 200346499ndash511

80 McGann M FRED pose prediction and virtual screeningaccuracy J Chem Inf Model 201151578ndash96

81 Zsoldos Z Reid D Simon A et al eHiTS an innovativeapproach to the docking and scoring function problemsCurr Protein Pept Sci 20067421ndash35

82 Ravitz O Zsoldos Z Simon A Improving molecular dockingthrough eHiTSrsquo tunable scoring function J Comput Aided MolDes 2011251033ndash51

83 Schmidt B DOCKTITE-a highly versatile step-by-step work-flow for covalent docking and virtual screening in MOEJ Chem Inf Model 2015 55398ndash406 doi101021ci500681r

84 Chemical Computing Group Inc Molecular OperatingEnvironment (MOE) Sci Comput Instrum 20042232

85 Ouyang X Zhou S Su CTT et al CovalentDock automatedcovalent docking with parameterized covalent linkageenergy estimation and molecular geometry constraintsJ Comput Chem 201334326ndash36

86 Brooks BR Bruccoleri RE Olafson BD et al CHARMM a pro-gram for macromolecular energy minimization anddynamics calculations J Comput Chem 19834187ndash17

87 Zhu K Borrelli KW Greenwood JR et al Docking covalentinhibitors a parameter free approach to pose prediction andscoring J Chem Inf Model 2014541932ndash40

88 London N Miller RM Irwin JJ et al Covalent Docking ofLarge Libraries for the Discovery of Chemical Probes BiophysJ 2014106264a

89 Klon AE Glick M Davies JW Combination of a naive bayesclassifier with consensus scoring improves enrichment ofhigh-throughput docking results J Med Chem 2004474356ndash9

90 Teramoto R Fukunishi H Supervised consensus scoringfor docking and virtual screening J Chem Inf Model 200747526ndash34

91 Plewczynski D Łazniewski M Grotthuss M Von et alVoteDock consensus docking method for prediction of pro-tein-ligand interactions J Comput Chem 201132568ndash81

12 | Glaab

at University of L



ownloaded from

92 Cross JB Thompson DC Rai BK et al Comparison of severalmolecular docking programs pose prediction and virtualscreening accuracy J Chem Inf Model 2009491455ndash74

93 Case DA Cheatham TE Darden T et al The Amberbiomolecular simulation programs J Comput Chem 2005261668ndash88

94 Huang SY Zou X Inclusion of solvation and entropy in theknowledge-based scoring function for protein-ligand inter-actions J Chem Inf Model 201050262ndash73

95 Charifson PS Corkery JJ Murcko MA et al Consensus scor-ing a method for obtaining improved hit rates from dockingdatabases of three-dimensional structures into proteinsJ Med Chem 1999425100ndash9

96 Beerenwinkel N Lengauer T Selbig J et al Geno2phenointerpreting genotypic HIV drug resistance tests IEEE IntellSyst Their Appl 20011635ndash41

97 Sun Y Ewing TJ Skillman AG et al CombiDOCK structure-based combinatorial docking and library design J ComputAided Mol Des 199812597ndash604

98 Verdonk ML Chessari G Cole JC et al Modeling water mol-ecules in protein-ligand docking using GOLD J Med Chem2005486504ndash15

99 Eldridge MD Murray CW Auton TR et al Empirical scoringfunctions I The development of a fast empirical scoringfunction to estimate the binding affinity of ligands in recep-tor complexes J Comput Aided Mol Des 199711425ndash45

100 Hao G-F Yang S-G Yang G-F Structure-based Design ofConformationally Flexible Reverse Transcriptase Inhibitorsto Combat Resistant HIV Curr Pharm Des 201420725ndash39

101 Esser L Yu C-A Xia D Structural basis of resistance to anti-cytochrome bc1 complex inhibitors implication for drugimprovement Curr Pharm Des 201320704ndash24

102 Rarey M Kramer B Lengauer T et al A fast flexible dockingmethod using an incremental construction algorithm J MolBiol 1996261470ndash89

103 Wang R Lai L Wang S Further development and validationof empirical scoring functions for structure-based bindingaffinity prediction J Comput Aided Mol Des 20021611ndash26

104 Gehlhaar DK Verkhivker GM Rejto PA et al Molecular rec-ognition of the inhibitor AG-1343 by HIV-1 protease confor-mationally flexible docking by evolutionary programmingChem Biol 19952317ndash24

105 Cao Y Li L Improved protein-ligand binding affinity predic-tion by using a curvature-dependent surface-area modelBioinformatics 2014301674ndash80

106 Li GB Yang LL Wang WJ et al ID-score a new empiricalscoring function based on a comprehensive set of descrip-tors related to protein-ligand interactions J Chem Inf Model201353592ndash600

107 Jain AN Surflex-Dock 21 robust performance from ligandenergetic modeling ring flexibility and knowledge-basedsearch J Comput Aided Mol Des 200721281ndash306

108 Gohlke H Hendlich M Klebe G Knowledge-based scoringfunction to predict protein-ligand interactions J Mol Biol2000295337ndash56

109 Neudert G Klebe G DSX a knowledge-based scoring func-tion for the assessment of protein-ligand complexes J ChemInf Model 2011512731ndash45

110 Muegge I PMF scoring revisited J Med Chem 2006495895ndash902

111 DeWitte RS Shakhnovich EI SMoG de novo design methodbased on simple fast and accurate free energy estimates 1Methodology and supporting evidence J Am Chem Soc 199611811733ndash44

112 Grinter SZ Zou X A Bayesian statistical approach of improv-ing knowledge-based scoring functions for protein-ligandinteractions J Comput Chem 201435932ndash43

113 Mooij WTM Verdonk ML General and targeted statisticalpotentials for protein-ligand interactions Proteins StructFunct Genet 200561272ndash87

114 Reulecke I Lange G Albrecht J et al Towards an integrateddescription of hydrogen bonding and dehydration decreas-ing false positives in virtual screening with the HYDE scor-ing function ChemMedChem 20083885ndash97

115 Guimaraes CRW Cardozo M MM-GBSA rescoring of dock-ing poses in structure-based lead optimization J Chem InfModel 200848958ndash70

116 Sgobba M Caporuscio F Anighoro A et al Application of apost-docking procedure based on MM-PBSA and MM-GBSAon single and multiple protein conformations Eur J MedChem 201258431ndash40

117 Kruger DM Evers A Comparison of structure- and ligand-based virtual screening protocols considering hit list com-plementarity and enrichment factors ChemMedChem 20105148ndash58

118 Huang N Shoichet BK Irwin JJ Benchmarking sets formolecular docking J Med Chem 2006496789ndash801

119 Rohrer SG Baumann K Maximum unbiased validation(MUV) data sets for virtual screening based on PubChem bio-activity data J Chem Inf Model 200949169ndash84

120 Shoichet BK Kuntz ID Bodian DL Molecular docking usingshape descriptors J Comput Chem 199213380ndash97

121 Norgan AP Coffman PK Kocher JPA et al Multilevel parallel-ization of autodock 42 J Cheminform 2011312

122 Collignon B Schulz R Smith JC et al Task-parallel mes-sage passing interface implementation of Autodock4for docking of very large databases of compounds usinghigh-performance super-computers J Comput Chem 2011321202ndash9

123 Korb O Stutzle T Exner TE Accelerating molecular dockingcalculations using graphics processing units J Chem InfModel 201151865ndash76

124 Pechan I Feher B Molecular docking on FPGA and GPU plat-forms Proc - 21st Int Conf F Program Log Appl FPL 20112011474ndash7

125 Pechan I Feher B Berces A FPGA-based acceleration of theAutoDock molecular docking software PhD ResMicroelectron Electron (PRIME) 2010 Conf 2010

126 Sastry GM Inakollu VSS Sherman W Boosting virtualscreening enrichments with data fusion coalescing hitsfrom two-dimensional fingerprints shape and dockingJ Chem Inf Model 2013531531ndash42

127 Miteva MA In Silico Lead Discovery Sharjah United ArabEmirates Bentham Science Publishers 2011

128 Bikker JA Brooijmans N Wissner A et al Kinase domainmutations in cancer implications for small molecule drugdesign strategies J Med Chem 2009521493ndash509

129 Longley DB Johnston PG Molecular mechanisms of drugresistance J Pathol 2005205275ndash92

130 Trott O Olson AJ AutoDock Vina J Comput Chem 201031445ndash61

131 Schnecke V Swanson CA Getzoff ED et al Screening apeptidyl database for potential ligands to proteinswith side-chain flexibility Proteins Struct Funct Genet 19983374ndash87

132 Zavodszky MI Rohatgi A Van Voorst JR et al Scoring ligandsimilarity in structure-based virtual screening J MolRecognit 200922280ndash92


at University of L



ownloaded from

133 Grosdidier A Zoete V Michielin O SwissDock a protein-small molecule docking web service based on EADock DSSNucleic Acids Res 201139W270ndash7

134 Grosdidier A Zoete V Michielin O Fast docking using theCHARMM force field with EADock DSS J Comput Chem 2011322149ndash59

135 Hsu K-C Chen Y-F Lin S-R et al iGEMDOCK a graphicalenvironment of enhancing GEMDOCK using pharmaco-logical interactions and post-screening analysis BMCBioinformatics 201112 (Suppl 1)S33

136 Ruiz-Carmona S Alvarez-Garcia D Foloppe N et al rDock afast versatile and open source program for docking ligandsto proteins and nucleic acids PLoS Comput Biol 2014101003571

137 Shin WH Kim JK Kim DS et al GalaxyDock2 protein-liganddocking using beta-complex and global optimizationJ Comput Chem 2013342647ndash56

138 Abagyan R Totrov M Kuznetsov D ICM - A new method forprotein modeling and design applications to docking andstructure prediction from the distorted native conformationJ Comput Chem 199415488ndash506

139 Bottegoni G Kufareva I Totrov M et al Four-dimensionaldocking a fast and accurate account of discrete receptorflexibility in ligand docking J Med Chem 200952397ndash406

140 Thomsen R Christensen MH MolDock a new techniquefor high-accuracy molecular docking J Med Chem 2006493315ndash21

141 Lie MA Thomsen R Pedersen CNS et al Molecular dockingwith ligand attached water molecules J Chem Inf Model 201151909ndash17

142 Weininger D SMILES a chemical language and informationsystem 1 Introduction to methodology and encoding rulesJ Chem Inf Comput Sci 19882831ndash6

143 Rarey M Dixon JS Feature trees a new molecular similaritymeasure based on tree matching J Comput Aided Mol Des199812471ndash90

144 Bravi G Gancia E Mascagni P et al MS-WHIM new 3D theor-etical descriptors derived from molecular surface propertiesa comparative 3D QSAR study in a series of steroidsJ Comput Aided Mol Des 19971179ndash92

145 Mundy JL Huang C Liu J et al MORSE a 3D object recogni-tion system based on geometric invariants Image UnderstWork 1994II1393ndash402

146 Cramer RD Patterson DE Bunce JD Comparativemolecular field analysis (CoMFA) 1 Effect of shape on bind-ing of steroids to carrier proteins J Am Chem Soc 19881105959ndash67

147 Klebe G Comparative Molecular Similarity Indices AnalysisCoMSIA Comp Gen Pharmacol 1998387ndash104

148 Tosco P Balle T Open3DQSAR a new open-source softwareaimed at high-throughput chemometric analysis of molecu-lar interaction fields J Mol Model 201117201ndash8

149 Mekenyan O Dynamic QSAR techniques applications indrug design and toxicology Curr Pharm Des 200281605ndash21

150 Todeschini R Consonni V Molecular descriptors for chemo-informatics Recent Adv QSAR Stud 201029ndash103

151 Chen X Reynolds CH Performance of similarity measures in2D fragment-based similarity searching comparison ofstructural descriptors and similarity coefficients J Chem InfComput Sci 2002421407ndash14

152 Vitanyi PMB Balbach FJ Cilibrasi RL et al Normalized infor-mation distance Inf Theory Stat Learn 200945ndash82

153 Melville JL Riley JF Hirst JD Similarity by compressionJ Chem Inf Model 20074725ndash33

154 Clement OO Mehl AT HipHop pharmacophores based onmultiple common-feature alignments In Pharmacophore per-ception Development and Use in Drug Design La JollaInternational University Line 2000 69ndash84

155 Mills JEJ de Esch IJP Perkins TDJ et al SLATE a method forthe superposition of flexible ligands J Comput Aided Mol Des20011581ndash96

156 Martin YC Bures MG Danaher EA et al A fast new approachto pharmacophore mapping and its application to dopamin-ergic and benzodiazepine agonists J Comput Aided Mol Des1993783ndash102

157 Jones G Willett P Glen RC A genetic algorithm for flexiblemolecular overlay and pharmacophore elucidation J ComputAided Mol Des 19959532ndash49

158 Richmond NJ Abrams CA Wolohan PRN et al GALAHAD 1Pharmacophore identification by hypermolecular alignmentof ligands in 3D J Comput Aided Mol Des 200620567ndash87

159 Jones G GAPE an improved genetic algorithm for pharma-cophore elucidation J Chem Inf Model 2010502001ndash18

160 Schneidman-Duhovny D Dror O Inbar Y et al PharmaGista webserver for ligand-based pharmacophore detectionNucleic Acids Res 200836W223ndash8

161 Lemmen C Lengauer T Klebe G FLEXS a method forfast flexible ligand superposition J Med Chem 1998414502ndash20

162 Pozzan A Molecular descriptors and methods for ligandbased virtual high throughput screening in drug discoveryCurr Pharm Des 2006122099ndash110

163 Gohlke H Klebe G Approaches to the description and pre-diction of the binding affinity of small-molecule ligands tomacromolecular receptors Angew Chemie Int Ed 2002412644ndash76

164 Wegner JK Frohlich H Zell A Feature selection for descrip-tor based classification models 1 Theory and GA-SEC algo-rithm J Chem Inf Comput Sci 200444921ndash30

165 Kubinyi H QSAR Hansch Analysis and Related ApproachesNew York Wiley Interscience 2008 1

166 Venkatraman V Perez-Nueno VI Mavridis L et alComprehensive comparison of ligand-based virtual screen-ing tools against the DUD data set reveals limitations of cur-rent 3D methods J Chem Inf Model 2010502079ndash93

167 Pan Y Li L Kim G et al Identification and validation of novelhuman pregnane X receptor activators among prescribeddrugs via ligand-based virtual screening Drug Metab Dispos201139337ndash44

168 Franke L Schwarz O Muller-Kuhrt L et al Identification ofnatural-product-derived inhibitors of 5-lipoxygenase activ-ity by ligand-based virtual screening J Med Chem 2007502640ndash6

169 Yamazaki K Kusunose N Fujita K et al Identification ofphosphodiesterase-1 and 5 dual inhibitors by a ligand-basedvirtual screening optimized for lead evolution BioorganicMed Chem Lett 2006161371ndash9

170 Swann SL Brown SP Muchmore SW et al A unified prob-abilistic framework for structure- and ligand-based virtualscreening J Med Chem 2011541223ndash32

171 Hou T Wang J Zhang W et al Recent advances in computa-tional prediction of drug absorption and permeability indrug discovery Curr Med Chem 2006132653ndash67

172 Varnek A Baskin I Machine learning methods for propertyprediction in chemoinformatics quo vadis J Chem Inf Model2012521413ndash37

173 Tropsha A Gramatica P Gombar VK The importance ofbeing earnest validation is the absolute essential for

14 | Glaab

at University of L



ownloaded from

successful application and interpretation of QSPR modelsQSAR Comb Sci 20032269ndash77

174 Testa B Balmat AL Long A et al Predicting drug metabolism- An evaluation of the expert system METEOR ChemBiodivers 20052872ndash85

175 Darvas F METABOLEXPERT an expert system for predictingmetabolism of substances QSAR Environ Toxicol 198771ndash81

176 Klopman G Tu M Talafous J META 3 A genetic algorithmfor metabolic transform priorities optimization J Chem InfComput Sci 199737329ndash34

177 Lewis DF V Ioannides C Parke D V COMPACT and molecu-lar structure in toxicity assessment a prospective evalu-ation of 30 chemicals currently being tested for rodentcarcinogenicity by the NCINTP Environ Health Perspect 19961041011ndash16

178 Benigni R Bossa C Alivernini S et al Assessment andValidation of US EPArsquos OncoLogicVR Expert System andAnalysis of Its Modulating Factors for Structural AlertsJ Environ Sci Health C Environ Carcinog Ecotoxicol Rev201230152ndash73

179 Saiakhov R Chakravarti S Klopman G Effectiveness ofCASE ultra expert system in evaluating adverse effects ofdrugs Mol Inform 20133287ndash97

180 Chakravarti SK Saiakhov RD Klopman G Optimizing pre-dictive performance of CASE ultra expert system modelsusing the applicability domains of individual toxicity alertsJ Chem Inf Model 2012522609ndash18

181 Saiakhov RD Klopman G MultiCASE Expert Systems andthe REACH Initiative Toxicol Mech Methods 200818159ndash75

182 Marchant CA Prediction of rodent carcinogenicity using theDEREK system for 30 chemicals currently being tested by theNational Toxicology Program Environ Health Perspect 19961041065ndash73

183 Venkatapathy R Moudgal CJ Bruce RM Assessment of theoral rat chronic lowest observed adverse effect level modelin TOPKAT a QSAR software package for toxicity predictionJ Chem Inf Comput Sci 2004441623ndash9

184 Dearden JC In silico prediction of drug toxicity J ComputAided Mol Des 200317119ndash27

185 Drwal MN Banerjee P Dunkel M et al ProTox a web serverfor the in silico prediction of rodent oral toxicity NucleicAcids Res 201442W53ndash8

186 Mombelli E Devillers J Evaluation of the OECD (Q)SARApplication Toolbox and Toxtree for predicting and profilingthe carcinogenic potential of chemicals SAR QSAR EnvironRes 201021731ndash52

187 Wang JC Chu PY Chen CM et al idTarget a web server foridentifying protein targets of small chemical molecules withrobust scoring functions and a divide-and-conquer dockingapproach Nucleic Acids Res 201240W393ndash9

188 Li H Gao Z Kang L et al TarFisDock a web server for iden-tifying drug targets with docking approach Nucleic Acids Res200634W219ndash24

189 Chen YZ Ung CY Prediction of potential toxicity and sideeffect protein targets of a small molecule by a ligand-proteininverse docking approach J Mol Graph Model 200120199ndash18

190 Kinnings SL Jackson RM ReverseScreen3D a structure-based ligand matching method to identify protein targetsJ Chem Inf Model 201151624ndash34

191 Liu X Ouyang S Yu B et al PharmMapper server aweb server for potential drug target identification usingpharmacophore mapping approach Nucleic Acids Res 201038W609ndash14

192 Keiser MJ Roth BL Armbruster BN et al Relating proteinpharmacology by ligand chemistry Nat Biotechnol 200725197ndash206

193 Gfeller D Grosdidier A Wirth M et al SwissTargetPredictiona web server for target prediction of bioactive small mol-ecules Nucleic Acids Res 201442W32ndash8

194 Dunkel M Gunther S Ahmed J et al SuperPred drug clas-sification and target prediction Nucleic Acids Res 200836W55ndash9

195 Mazanetz MP Marmon RJ Reisser CBT et al Drug discoveryapplications for KNIME an open source data mining plat-form Curr Top Med Chem 2012121965ndash79

196 Beisken S Meinl T Wiswedel B et al KNIME-CDK workflow-driven cheminformatics BMC Bioinformatics 201314257

197 Oinn T Addis M Ferris J et al Taverna a tool for the com-position and enactment of bioinformatics workflowsBioinformatics 2004203045ndash54

198 Lu Q Hao P Curcin V et al KDE Bioscience platform forbioinformatics analysis workflows J Biomed Inform 200639440ndash50

199 Goecks J Nekrutenko A Taylor J Galaxy a comprehensiveapproach for supporting accessible reproducible and trans-parent computational research in the life sciences GenomeBiol 201011R86

200 Ludascher B Altintas I Berkley C et al Scientific workflowmanagement and the Kepler system Concurr Comput PractExp 2006181039ndash65

201 Callahan SP Freire J Santos E et al VisTrails visualizationmeets data management In 2006 ACM SIGMOD InternationalConference Management Data New York ACM 2006 745ndash7

202 Sanner MF Stoffler D Olson AJ ViPEr a visual programmingenvironment for Python In Proceedings 10th InternationalPython Conference Austin N p (ISBN 1-930792-05-0) 2002103ndash15

203 Taylor I Shields M Wang I et al The triana workflow envir-onment architecture and applications Work e-Science SciWork Grids 2007320ndash39

204 Lehtovuori PT Nyronen TH SOMA - Workflow for smallmolecule property calculations on a multiplatform comput-ing grid J Chem Inf Model 200646620ndash5

205 Steinbeck C Han Y Kuhn S et al The ChemistryDevelopment Kit (CDK) an open-source Java library forchemo-and bioinformatics J Chem Inf Comput Sci 200343493ndash500

206 Spjuth O Helmus T Willighagen EL et al Bioclipse an opensource workbench for chemo- and bioinformatics BMCBioinformatics 2007859

207 Holbeck SL Update on NCI in vitro drug screen utilities Eur JCancer 200440785ndash93


at University of L



ownloaded from

Data collectionmolecular structure andinteraction databasesProtein structure databases

The availability of 3D structure data for a target protein of inter-est is a major benefit for virtual screening studies althoughpurely ligand-based screening methods may provide an alterna-tive if no suitable target structure can be obtained (see sectionon ligand-based screening below) An overview of the main pub-lic repositories for experimentally derived and in silico modelledprotein structures is given in Table 1 Among these the ProteinData Bank (PDB) [5] is the standard international archive forexperimental structural data of biological macromolecules cov-ering 107 000 structures as of March 2015 It provides access tothe most comprehensive collection of public X-ray crystal struc-tures and is the default resource to obtain protein structures forreceptor-based screening In spite of the rapid growth of thePDB almost doubling in size over the past six years many pro-tein families are still not covered by a representative structureand even in an ideal model scenario the coverage is notexpected to reach 80 before 2020 and 90 before 2027 [6] Asthe structures in the PDB are biased towards proteins that canbe purified and studied using X-ray crystallography nuclearmagnetic resonance (NMR) spectroscopy and electron micros-copy certain types of proteins including pharmacologically im-portant membrane proteins are underrepresented in thedatabase Importantly the quality of PDB structures is alsorestricted by limitations of the experimental methodologies eghydrogen atoms and flexible components cannot be resolvedvia X-ray diffraction and NMR techniques usually provide lowerresolutions than X-ray crystallography Often the experimentalmethods fail to determine the entire protein structure andmany PDB files have missing residues or atoms (see section onprotein structure pre-processing and quality control for guide-lines on how to deal with these and other potential shortcom-ings of PDB files)

If no suitable experimental structure for molecular dockingsimulations can be identified for a chosen target protein a bind-ing site structural model may alternatively be derived fromcomparative modelling if a template protein with close hom-ology to the target is available While the performance of dock-ing simulations using homology models will depend on thesequence similarity of the template(s) to the target protein thequality of the template structure(s) and the modelling approachthe analyses from a previous large-scale validation study byOshiro et al can provide a guideline on the results to be

expected in different scenarios [7] The authors assessed theperformance of docking into homology models using CDK2 andfactor VIIa screening data sets and found that when thesequence identity between the model and template near thebinding site is greater than 50 roughly 5 times more activecompounds are identified than by random chance (a perform-ance that was comparable with docking into crystal structuresaccording to their observations) Their publication providesa plot of the enrichment of true-positive discoveries versus thepercentage sequence identity between the template and targetwhich can serve as an orientation for future studies Large-scalecollections of existing protein structure models includingModBase [8] SWISS-MODEL [9] and PMP [10] are listed in Table1 as resources for proteins not covered by known experimentalstructures Alternatively new comparative models for spe-cific target proteins can be generated using dedicated homologymodelling tools reviewed in detail elsewhere [11] To preventspurious results due to low-quality models users can estimatethe accuracy of docking simulations based on homologymodels a priori via established indices for model quality assess-ment [12]

Small-molecule databases

Screening projects to identify new selective and potent inhibi-tors of a chosen target protein typically use large-scale com-pound libraries containing several thousands or millions ofsmall molecules to start the filtering process Depending on thegoal and type of the study (eg drug development toxin identifi-cation pesticide development) the compound library may con-tain already known drug substances for repositioning syntheticsubstances similar to lead or drug compounds for subsequentstructural optimization or other natural or xenobiotic com-pounds To design suitable compound libraries in terms of thetype number and commercial availability of the included mol-ecules access to large structured and well-annotated reposito-ries of small molecules is needed Some of the mostcomprehensive free databases include ZINC [14] (35 millioncompounds) PubChem [15] (64 million compounds) andChemSpider [16] While many of the largest databases(eg ChemNavigator [17] with 60 million compounds) are com-mercial and only provide restricted data access for academicresearch in recent years public initiatives and vendors ofsmall-molecule compounds have made several structured libra-ries publicly accessible When downloading structure files fromthese repositories users should note that they are usually not

Table 1 The main public repositories for experimentally derived and in silico modelled protein structures including details on content typeapproximate number of current entries and accessibility

Database Content type Approx no of entries Webpage

PDB [5] X-ray Solution NMR ElectronMicroscopy among others

107k (95k X-ray 11k SolutionNMR 700 Electron Microscopy300 others)

httpwwwrcsborg

ModBase [8] 3D protein models from compara-tive modelling

38 million models httpsalilaborgmodweb

SWISS-MODELRepository [9]

3D protein models from homologymodelling

32 million models httpswissmodelexpasyorgrepository

Protein Model Portal(PMP) [10]

Integration of modelled structuresfrom multiple servers

218 million models httpwwwproteinmodelportalorg

Structural BiologyKnowledgebase(SBKB) [13]

PDB structures and associatedhomology models

See PDB and PMP database httpsbkborg

2 | Glaab

at University of L



ownloaded from














at University of L



ownloaded from











4 | Glaab

at University of L



ownloaded from








at University of L



ownloaded from



























6 | Glaab

at University of L



ownloaded from









at University of L



ownloaded from














8 | Glaab

at University of L



ownloaded from






at University of L



ownloaded from






Conclusions


Key Points




Funding









10 | Glaab

at University of L



ownloaded from












































at University of L



ownloaded from












































12 | Glaab

at University of L



ownloaded from











































at University of L



ownloaded from










































14 | Glaab

at University of L



ownloaded from





































at University of L



ownloaded from














at University of L



ownloaded from











4 | Glaab

at University of L



ownloaded from








at University of L



ownloaded from



























6 | Glaab

at University of L



ownloaded from









at University of L



ownloaded from














8 | Glaab

at University of L



ownloaded from






at University of L



ownloaded from






Conclusions


Key Points




Funding









10 | Glaab

at University of L



ownloaded from












































at University of L



ownloaded from












































12 | Glaab

at University of L



ownloaded from











































at University of L



ownloaded from










































14 | Glaab

at University of L



ownloaded from





































at University of L



ownloaded from











4 | Glaab

at University of L



ownloaded from








at University of L



ownloaded from



























6 | Glaab

at University of L



ownloaded from









at University of L



ownloaded from














8 | Glaab

at University of L



ownloaded from






at University of L



ownloaded from






Conclusions


Key Points




Funding









10 | Glaab

at University of L



ownloaded from












































at University of L



ownloaded from












































12 | Glaab

at University of L



ownloaded from











































at University of L



ownloaded from










































14 | Glaab

at University of L



ownloaded from





































at University of L



ownloaded from








at University of L



ownloaded from



























6 | Glaab

at University of L



ownloaded from









at University of L



ownloaded from














8 | Glaab

at University of L



ownloaded from






at University of L



ownloaded from






Conclusions


Key Points




Funding









10 | Glaab

at University of L



ownloaded from












































at University of L



ownloaded from












































12 | Glaab

at University of L



ownloaded from











































at University of L



ownloaded from










































14 | Glaab

at University of L



ownloaded from





































at University of L



ownloaded from



























6 | Glaab

at University of L



ownloaded from









at University of L



ownloaded from














8 | Glaab

at University of L



ownloaded from






at University of L



ownloaded from






Conclusions


Key Points




Funding









10 | Glaab

at University of L



ownloaded from












































at University of L



ownloaded from












































12 | Glaab

at University of L



ownloaded from











































at University of L



ownloaded from










































14 | Glaab

at University of L



ownloaded from





































at University of L



ownloaded from









at University of L



ownloaded from














8 | Glaab

at University of L



ownloaded from






at University of L



ownloaded from






Conclusions


Key Points




Funding









10 | Glaab

at University of L



ownloaded from












































at University of L



ownloaded from












































12 | Glaab

at University of L



ownloaded from











































at University of L



ownloaded from










































14 | Glaab

at University of L



ownloaded from





































at University of L



ownloaded from














8 | Glaab

at University of L



ownloaded from






at University of L



ownloaded from






Conclusions


Key Points




Funding









10 | Glaab

at University of L



ownloaded from












































at University of L



ownloaded from












































12 | Glaab

at University of L



ownloaded from











































at University of L



ownloaded from










































14 | Glaab

at University of L



ownloaded from





































at University of L



ownloaded from






at University of L



ownloaded from






Conclusions


Key Points




Funding









10 | Glaab

at University of L



ownloaded from












































at University of L



ownloaded from












































12 | Glaab

at University of L



ownloaded from











































at University of L



ownloaded from










































14 | Glaab

at University of L



ownloaded from





































at University of L



ownloaded from






Conclusions


Key Points




Funding









10 | Glaab

at University of L



ownloaded from












































at University of L



ownloaded from












































12 | Glaab

at University of L



ownloaded from











































at University of L



ownloaded from










































14 | Glaab

at University of L



ownloaded from





































at University of L



ownloaded from












































at University of L



ownloaded from












































12 | Glaab

at University of L



ownloaded from











































at University of L



ownloaded from










































14 | Glaab

at University of L



ownloaded from





































at University of L



ownloaded from












































12 | Glaab

at University of L



ownloaded from











































at University of L



ownloaded from










































14 | Glaab

at University of L



ownloaded from





































at University of L



ownloaded from











































at University of L



ownloaded from










































14 | Glaab

at University of L



ownloaded from





































at University of L



ownloaded from










































14 | Glaab

at University of L



ownloaded from





































at University of L



ownloaded from





































at University of L



ownloaded from

building a virtual ligand screening pipeline using free

Documents