computational tools for protein-dna interactions

Upload: dr-muhammad-atif-attari

Post on 05-Apr-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/2/2019 Computational Tools for Protein-DNA Interactions

    1/15

    Advanced Review

    Computational tools forproteinDNA interactionsChristopher Kauffman and George Karypis

    Interactions between deoxyribonucleic acid (DNA) and proteins are central to liv-ing systems, and characterizing how and when they occur would greatly enhanceour understanding of working genomes. We review the computational problemsassociated with proteinDNA interactions and the various methods used to solvethem. A wide range of topics is covered including physics-based models for di-rect and indirect recognition, identification of transcription-factor-binding sites,and methods to predict DNA-binding proteins. Our goal is to introduce this im-portant problem domain to data mining researchers by identifying the key issuesand challenges inherent to the area as well as provide directions for fruitful futureresearch. C 2011 Wiley Periodicals, Inc.

    How to cite this article:

    WIREs Data Mining Knowl Discov 2012, 2: 1428 doi: 10.1002/widm.48

    INTRODUCTION

    I nteractions between deoxyribonucleic acid (DNA)and proteins are widely recognized as central toliving systems. These interactions come in a varietyof forms including repair of damaged DNA and tran-scription of genes into RNA. More recently it hasbeen found that, by binding to certain DNA segments,proteins can promote or repress the transcription ofgenes in the vicinity of the binding site. Proteins of this

    kind are referred to as transcription factors (TFs). Thenumber of TFs in an organism appears to be relatedto the complexity of the underlying genome: as thenumber of of genes increases, the number of TFs in-creases according to a power law.1 This many-foldincrease of TFs appears to be required in order tomanage transcription in higher organisms.

    Characterizing how and when proteinDNA in-teractions occur would greatly enhance our under-standing of the genome at work. A full picture of theinteractions will eventually allow characterization ofwhich genes are transcribed at any given time in order

    for the organism to react dynamically to a changingenvironment. ProteinDNA interactions are studiedboth in the wet lab and computationally. Here a syn-

    This article contains supplementary material availablefrom the authors upon request or via the Internet athttp://wileyonlinelibrary.com.Correspondence to: [email protected]

    Department of Computer Science, College of Science and Engineer-ing, University of Minnesota, Minneapolis, MN, USA

    DOI: 10.1002/widm.48

    ergy exists: lab experiments provide data and prob-lems for computational methods to solve while com-putation provides hypotheses which guide additionallaboratory experiments.

    The goal of this paper is to review three ma-jor areas of interest for computational studies ofproteinDNA interactions: (1) physics-based studiesof proteinDNA interaction, (2) identification of TF-binding sites, and (3) identification of DNA-binding

    proteins.

    HOW MANY BINDING PROTEINSEXIST?

    Accounts of how many DNA-binding proteins existvary through the literature. Attention is particularlyfocused on TFs. Older sources estimated that 23%of a prokaryotic genome and 67% of a eukary-otic genome encodes DNA-binding proteins.2 Thisnumber was taken from the automatic gene annota-tion tool PEDANT.3 Though contemporary estimatesof the number of TFs range as high as 10% of all

    mammalian genes,4 averaging across genomes in theDBD database5 classifies 4.65% of Metazoan (ani-mal) genes as TFs (806 genes per animal genome).1

    According to gene ontology annotations inPEDANT, there are currently 1714 genes in the hu-man genome identified as coding for DNA-bindingproteins with 885 of them identified as TFs.a Thisis slightly smaller than the numbers currently in theAMIGO gene ontology browserb which are given inTable 1.

    14 Volume 2 , January /February 2012c 2011 John Wi ley & Sons , Inc .

  • 8/2/2019 Computational Tools for Protein-DNA Interactions

    2/15

    WIREs Data Mining and Knowledge Discovery Computational tools for proteinDNA interactions

    T A B L E 1 Counts of Genes with DNA-Binding Annotations for

    Human in the AMIGO Gene Ontology Browser

    GO Term Count1 %all2 %func3

    All gene products 18,269 100.0 115.6Molecular function given 15,801 86.5 100.0

    DNA binding 2375 13.0 15.0Transcription factor activity 969 5.3 6.1

    1Count is the raw number of genes.2%all is the percentage of all genes that have the given GO term.3%func is the percentage of genes with a molecular function which have the given GOannotation.

    For researchers interested in DNA-binding pro-tein structures, the protein data bank (PDB)6 currentlyholds structures for 2372 proteins with DNA-bindinggene ontology terms while 1400 of these actuallyhave DNA structural information present in them.

    However, many of these structure entries are redun-dant in that their sequences are nearly identical: thelargest data set of nonredundant structures reportedin the literature contains 179 proteins.7 See the dis-cussion on available data sets later in this paper.

    Most proteins are composed of several indepen-dent units called domains. A domain which interactswith DNA is referred to as a DNA-binding domainand contains a structural motif that enables binding(see section 7.4 of Ref 8). A DNA-binding protein hasa binding domain and possibly several other domainsthat determine its function. Multiple copies of DNA-binding domains may be present in a DNA-binding

    protein. This leads to some ambiguity in the literatureas DNA-binding protein may sometimes refer onlyto the binding domain or the whole protein includingboth binding and nonbinding domains. Here we dealsolely with interactions between binding domains andDNA. An overview of DNA-binding domains can befound in Ref 2.

    PHYSICAL MODELS AND ENERGETICS

    Insight can be gained about DNAprotein interac-tions by studying them using physics models. Ap-

    proaches in the literature examine bound proteinDNA complexes and either apply existing software toobtain interaction energy or develop new energy func-tions. Both approaches make use of complex struc-tures from the PDB. The goals of such studies areusually to establish why binding happens, to quan-tify energy changes between the bound and unboundstates, and to understand how mutation in either pro-tein or DNA may affect binding affinity. Basic under-standing of binding physics guides both the develop-

    ment of TF-binding site models and the generation ofprotein and DNA features used in machine learning.

    Early WorkAn early review of the structure motifs used by TFsprovided a number of principles used by the proteins

    to recognize DNA.9 Subsequent studies on proteinDNA interactions characterized binding based on thefrequency of protein residue contacts with nucleicacids in crystallized complexes.10,11 The propensityfor each type of contact to form was determined bycomparing the expected and observed number of con-tacts between each type of amino acid and nucleicacid to the number expected if contacts formed solelybased on the frequency of each residue/nucleic acidtype. The resulting propensities seemed to agree fairlywell with the limited available experimental data:data from Ref 12 is fit in Ref 13 and in many cases the

    correct residue/base-pair combinations of zinc fingervariants were predicted correctly. Stormo14 reviewssome early developments of representations of bind-ing preferences. Simple models of recognition such asthe theory of a linear code were popular early on:structural motifs in proteins allow matching of spe-cific amino acids to specific DNA base pairs. This ideais still relevant to certain families of TFs such as thezinc finger domains.15

    Once a sufficient number of different DNA-binding protein families became available, it becameapparent that various protein structures use diversemeans of binding and achieving binding specificity to

    targeted sequences of DNA calling for more complexmodeling techniques.16 The current understanding isthat binding is a stochastic process making probabilis-tic models more appropriate for modeling proteinDNA interactions.13,17 In addition, contacts betweenthe protein and DNA backbone, between the proteinbackbone and DNA, and the presence of differenttypes of interactions (electrostatic, van der Waals, hy-drogen bonds, water-mediated bonds) has led to moredetailed consideration of the energy change involvedin binding.

    Physics of Recognition MechanismsProteinDNA binding is thought to occur because thebound pair has lower free energy than the unboundmolecules. A variety of factors governing free energychange are considered by Jayaram et al. in Ref 18 suchas desolvation of DNA and protein and electrostaticand van der Waals interactions. Some of these factorsaffect all molecular interactions, but order-studies ofproteinDNA binding have identified two categoriesof binding mechanisms which allow specificity to be

    Volume 2 , January /February 2012 15c 2011 John Wi ley & Sons , Inc .

  • 8/2/2019 Computational Tools for Protein-DNA Interactions

    3/15

    Advanced Review wires.wiley.com/widm

    achieved. The first category involves energetically fa-vorable interactions between atoms in the protein andDNA, sometimes called direct recognition or baserecognition. The second category concerns the en-ergy required to deform DNA to accommodate bind-ing to the protein, referred to as indirect recognitionor shape recognition. Both categories are describedin detail in a review of recognition mechanisms16

    as well as a more recent review of the subject. 19 Afew studies estimate both direct and indirect energies(e.g., Ref 18) while other work has studied direct20 orindirect21 recognition mechanisms separately. The re-view by Rohs et al.19 advocates that these two recog-nition mechanisms are used together by almost allDNA-binding proteins, that binding site specificity isachieved by combining direct and indirect effects.

    Specificity Tests: Mutating DNA and

    Protein SequencesA common use of binding energetics models is tostudy DNA mutations and their effects on bindingenergy. Determining which DNA sequences result inlow-energy binding to the protein indicates the pro-teins likely binding sites on the genome.20,22 Themodels of binding energetics can be verified using ex-perimental techniques, which measure binding speci-ficity of proteins. These are reviewed by Stormo andZhao in Ref 23. Aside from pure physics-based ap-proaches, machine learning has been employed insome cases to aid in this task. An interesting exam-ple is in Ref 24 where DNA-binding sequences for

    proteins were predicted by training a perceptron ondeformation energy. This is in contrast to TF-bindingsite location, described in the next sections, whichemploy statistics and motif identification rather thanphysical models.

    TRANSCRIPTION-FACTOR-BINDINGSITE IDENTIFICATION

    TFs are DNA-binding proteins whose primary pur-pose is to regulate the transcription of genes. Thoughthere are some exceptions, many TFs accomplish reg-

    ulation by binding to DNA at specific sites. The pres-ence of the bound TF will attract or obstruct RNApolymerase thus promoting or repressing gene expres-sion, respectively. TFs appear in greater abundance ineukaryotes and higher animals allowing more com-plex regulatory control of how and when genes aretranscribed.1

    In order to form a picture of the workinggenome, it has become important to identify the genesthat TFs affect by finding the genomic locations to

    which they bind. Computational tools comprise animportant part of this discovery process.

    Reviews of TF-Binding Site DiscoveryTF-binding site identification is a well-studied areaand continues to develop rapidly. Here we mention a

    few good reviews of the area which are useful to un-derstanding the data and tools available for analysis.

    Narlikar and Ovcharenko25 provide a goodoverview of the lab science behind TF-binding siteidentification and 184 citations to past literature.They include a brief section on computational tools toderive TF properties/models like position weight ma-trices. Computational methods for discovering othergenomic regulatory elements are also discussed.

    Hannenhalli26 gives a review of current compu-tational techniques for various representations of TF-binding sites and how they are derived. The review

    also briefly covers techniques for identification ofother regulatory modules. Vingron at al.27 describe re-cent developments of computational techniques thatexpand on the capabilities of TF-binding site identifi-cation. An older but focused review from Bulyk is inRef 28 while Elnitski et al.29 cover methods specificto TF-binding sites in mammals.

    Charoensawan et al.1 give a current review ofthe resources available for study of TFs includingdatabases of TFs with known binding sites and thetypes of annotations available for the TFs.

    Finally, Das and Dai30 surveyed motif discoveryalgorithms which may be of use to determine appro-

    priate algorithms for a particular task. Supplementingthis is a slightly older benchmark of motif discoveryalgorithms performed by Tompa et al.31 A blind eval-uation of the algorithms was done on both syntheticand experimental data which may still be used as aguide for algorithm selection.

    Motif IdentificationTypically, biologists are interested in which genesa TF regulates. This can be determined by identify-ing the genomic locations to which the TF binds. Inmotif identification, one starts with a collection of

    DNA sequences thought to contain TF-binding sites.The computational task is to identify the TF-bindingsite amongst these DNA sequences. Early approachesused simple models such as exact DNA motif se-quences. These have largely been supplanted by po-sition weight matrices [PWMs, alternatively referredto as position specific scoring matrices (PSSMs)] asthey more accurately model the probabilistic nature ofbinding. Though the assumption of independent con-tributions from each position of PWM is not entirely

    16 Volume 2 , January /February 2012c 2011 John Wi ley & Sons , Inc .

  • 8/2/2019 Computational Tools for Protein-DNA Interactions

    4/15

    WIREs Data Mining and Knowledge Discovery Computational tools for proteinDNA interactions

    realistic,32 PWM methods are sufficient for the pur-pose of motif identification3335 especially when usedin the context of locating entire regulatory modules.36

    More sophisticated models explore interdependenceof DNA positions37,38 and use prior probability mod-els based on TF class.39 The newest models incor-porate additional information specific to the experi-mental technique used to derive the DNA sequencecollection.40

    An alternative to direct motif detection is phy-logenetic footprinting. Homologous genomes arealigned to identify conserved noncoding regionswhich are likely to assume regulatory roles such asworking as a TF-binding sites. A number of such ap-proaches are reviewed in Refs 41 and 42.

    The function of a new gene can be inferred fromthe TFs associated with it. Using a library of TF-binding sites, one can detect TF-binding sites in thenoncoding region near a gene. Enrichment of a partic-

    ular TF indicates the gene may share a function withother genes that the TF affects.43,44

    Obtaining DNA Sequences for MotifIdentification: Experimental MethodsComputational motif identification requires a col-lection of DNA sequences which contain a DNA-binding motif. Several wet lab techniques can providesuch a collection by determining the approximategenomic location TF-binding sites. Chromatin im-munoprecipitation (ChIP) is a fundamental tool usedin most wet lab TF-binding site identification tech-

    niques. ChIP allows an in vivo snapshot of the pro-teins bound to DNA to be obtained. Traditionally,ChIP was followed by microarray analysis, togethercalled ChIP-chip.45 More recently, the ChIP-seqapproach follows chromatin immunoprecipitationwith sequencing of the DNA.46 Another wet lab tech-nology directly measures in vitro binding affinitiesbetween DNA and proteins using protein-bindingmicroarrays.47

    Alternatively, coregulated genes may be used asa source for approximate TF-binding sites. Genes thatare up- and down-regulated together are typically af-

    fected by the same TFs. Thus, the noncoding regionsnear these genes constitute a collection of DNA se-quences which are likely to contain binding sites fora TF.48

    IDENTIFICATION OF BINDINGPROTEINS AND BINDING RESIDUES

    Although studies of TFs tend to focus on DNA motifsand binding locations in the genome, attributes spe-

    cific to DNA-binding proteins are also of interest. Af-ter isolating a new protein, biologists frequently wantto discern its function. Data mining may be used todistinguish DNA-binding proteins from other types.Once it is established that a given protein interactswith DNA, a biologist may be interested in whichof the proteins residues are involved with binding.Computational methods are of service here again toperform binding residue identification.

    Both binding protein and binding residue iden-tification may be addressed using techniques fromsupervised machine learning. The goal is to train amodel which differentiates between the binding (posi-tive) class and nonbinding (negative) class. The classesmay represent either whole proteins or individualresidues. The usual process for supervised learning isthe following: establish a set of proteins as training ex-amples, determine which features of the proteins willbe given to the computational model as input, and

    then train the model to discriminate between bind-ing and nonbinding classes. Predictive performance isevaluated on proteins which are excluded from thetraining process in order to judge the methods capa-bilities on future data.

    Whole Protein versus Residue-LevelPredictionsMost methods focus on predicting at either the wholeprotein level or residue level. Some methods accom-plish both tasks simultaneously, but for the most

    part, addressing these two problems calls for differenttechniques.In the first case, the task is to identify DNA-

    binding proteins amongst proteins with other func-tions. This has increasing relevance as both se-quencing and structural genomics projects havedramatically increased the number of proteins withunknown function. A variety of methods have beendeveloped to accomplish this task.7,4955

    Prediction of DNA-binding residues assumesthat that the protein under scrutiny binds DNA andpredicts which residues are involved at the interface.Again, a wide array of approaches utilizing both se-

    quence and structure features have been developedfor residue-level prediction.50,5668

    Although DNA-binding protein predictions areused primarily to elucidate the function of a new pro-tein, there are several uses for DNA-binding residuepredictions. They may be used to guide wet lab muta-tion experiments that affect binding affinity betweenprotein and DNA. Rather than trying every residuein the protein, attention may be focused on mutatingonly residues which are predicted to play a role in

    Volume 2 , January /February 2012 17c 2011 John Wi ley & Sons , Inc .

  • 8/2/2019 Computational Tools for Protein-DNA Interactions

    5/15

    Advanced Review wires.wiley.com/widm

    F I G U R E 1 | Identification tasks suitable for machine learning. (a) DNA binding-protein identification using sequence. (b) DNA binding-proteinidentification using structure. (c) DNA binding-residue identification using sequence. (d) DNA binding-residue identification using structure.

    binding. When structure is available but unbound, it

    may be possible to use predicted binding residues usedto help identify the geometric binding site on proteinas has been done for small ligands,69 though no workhas yet explored this approach for DNAproteininteractions.

    Prediction of DNA-Binding FunctionIn the current literature, most methods approachbinding protein and binding residue identification as-suming either (1) the protein of interest has knownstructure, or (2) only the the proteins sequence isavailable. Figure 1 illustrates these two assumptions

    in the context of binding protein and binding residueprediction. A third class of methods, known as ho-mology modeling or threading, make predictions byassessing the compatibility of a target protein withDNA-binding structures.

    Prediction from StructureKnowledge of the proteins structure can be veryhelpful in determining its DNA-binding status. The

    structure may come from several sources. Tradition-

    ally, a proteins structure has been determined ex-perimentally due to specific interest in how it fulfillsits role in a biological system. Thus X-ray diffrac-tion is used to determine the structure of proteinDNA complexes and this information is depositedin structural databases, primarily the PDB. Thesedatabase entries provide examples for learning pre-dictive models as the proteins function is typicallywell characterized. In some cases, two structuresof the protein are available: the bound complexwhich has DNA present (holo protein conforma-tion) and the unbound protein with no DNA present

    (apo conformation).Though studies of single proteins have tradi-tionally been the source for structure information,structural genomics projects are producing the struc-tures of many new proteins for which no functioninformation is available.70 DNA-binding proteinsproduced by structural genomics efforts are usuallydetermined in their apo (unbound) conformation. An-notating the protein as DNA-binding would greatlyilluminate its biological role.

    18 Volume 2 , January /February 2012c 2011 John Wi ley & Sons , Inc .

  • 8/2/2019 Computational Tools for Protein-DNA Interactions

    6/15

    WIREs Data Mining and Knowledge Discovery Computational tools for proteinDNA interactions

    A very simple method of determining whether aprotein is DNA-binding is to identify similar struc-tures of known function using any of a numberof structure alignment methods. However, the pres-ence of a good structural match does not definitivelyestablish the function of protein as similar struc-ture/different function proteins exist. DNA-bindingresidues can be inferred as those structurally alignedto known binding residues.

    Rather than rely directly on structural sim-ilarity to known DNA-binding proteins to clas-sify the function of new proteins, there are severallines of research which exploit structure featuresfor identification of DNA-binding proteins. Exam-ples of these include direct use of structural mo-tifs and electrostatics to predict function, or theencoding of structural information into featuresamenable to machine learning methods.51,53,54,61,71,72

    Such structural features and techniques are also

    heavily employed in methods for binding residueprediction.50,57,58,61,68

    Prediction from SequenceDifficulty in determining a proteins structure has mo-tivated the development of binding predictors whichutilize only sequence information. Such methods pre-dict whether a protein binds DNA and which residuesare involved in the process without relying on the ge-ometry of the protein.

    Aside from using standard sequence database

    searches such as BLAST and PSI-BLAST, few purelysequence-based methods are available for bindingprotein prediction.49 This is likely due to the dif-ficulty of encoding an information about an en-tire protein in the type of fixed-length feature vec-tor required by most machine learners. Due to thesimplicity of representing single-residue sequence in-formation to a machine learner, more work hasfocused on methods for binding residue predictionfrom sequence.56,59,61,62,66,67,7375

    There have been some claims that thesetemplate-free models, which do not consider struc-tural aspects of the protein, give inferior performance

    to their structure-based counterparts.55 This may sim-ply be due to sequence-based methods relinquishingpotentially useful structure information in order tomake predictions when it is not available. When agood structure model is available, binding predictionswill likely be improved by employing it. However, notrue head-to-head benchmark between structure andsequence methods has yet been executed to illustratethe superiority of one over the other.

    Homology Modeling and ThreadingA technique that has proved effective for DNA-binding prediction but does not constitute traditionalmachine learning is homology modelingand its rela-tive threading. In both techniques, a target proteinwith unknown structure is modeled by identifying

    a template protein of known structure. The targetsequence is then mapped onto the known templatestructure and refined (e.g., Ref 76). The key to thisprocess is identifying a good template with knownstructure, which is usually done via a combinationof sequence similarity and energy calculations on thesequence-structure mapping. In that sense, homologymodeling may be likened to a nearest neighbor com-putation with sequence-structure compatibility actingas a specialized similarity measure. However, map-ping a target sequence onto the template to producea model structure is above and beyond the typicalnearest neighbor method.

    Threading methods can handle both whole pro-tein and residuewise binding prediction.7 Merelyfinding a good template match which is a DNA-binding protein is an indication that the target mayalso bind DNA, but it is not sufficient evidenceto declare the target is DNA binding. Other non-binding templates will likely score well, necessitat-ing additional information such as interaction en-ergy analysis between the targets model structureand DNA.55 After structurally mapping a targetonto its template, binding residues may be identifiedbased on the corresponding binding residues in the

    template.When no structure is available for a target pro-tein, in some cases it may be possible to generate afull three-dimensional model using homology mod-eling or threading. In most cases, homology mod-els are not entirely accurate, but for the purposeof determining whether the protein binds DNA, re-cent work has demonstrated that the use of homol-ogy models has promise.22,77 The second study alsoshowed good prediction performance when train-ing on both bound and unbound structures, makingit viable for characterization of structural genomicstargets.

    Producing a homology model of the proteinsstructure may fail for several reasons, most commonlybecause no suitable template is available. Dependenceon a good structural template is the primary dis-advantage of template-based methods.55 Accordingto the literature, this happens with some frequency:over 40% of DNA-binding proteins have no suit-able template for homology modeling (see Ref 56,section 1.2).

    Volume 2 , January /February 2012 19c 2011 John Wi ley & Sons , Inc .

  • 8/2/2019 Computational Tools for Protein-DNA Interactions

    7/15

    Advanced Review wires.wiley.com/widm

    Machine Learning FeaturesNumerous features have been employed in predictionschemes for binding proteins and binding residues.These are divided into structure and sequence fea-tures. There is mild overlap in some cases: forinstance, secondary structure is available from the

    proteins structure or it may be predicted fromsequence.

    Structural Features Electrostatic potentials: Molecular dynamics

    software is used to compute the charges foreach atom which is usually averaged to as-sign an electrostatic score to each residue.5053

    Software is also available for this specifictask.78

    Dipole and quadrupole moments: Charge mo-ments measure how widely distributed elec-

    tric charge is across the protein. Fairly sim-ple methods can calculate the electric dipoleand quadrupole from structure and accord-ing to the cited study, dipoles in combinationwith overall charge make a fairly discrimina-tory feature between binding and nonbindingproteins.79

    Structural motifs: Certain structural motifs(patterns) are known for interaction withDNA. Identifying such a motif in a novel pro-tein can lend support to its classification as aDNA-inding protein.51

    Structural neighborhood: A simple represen-tation of residue environment is to count theother amino acids inside a ball centered onthe residue of interest.61

    Surface curvature: In order to accommodatebound DNA, proteins may exhibit certaincurvature, at least locally at the binding site.52

    Secondary structure: Proteins assume local,repeated geometric patterns called secondarystructure which may be calculated fromits coordinates80 or may be predicted fromsequence.81,82 Several studies have shown

    that secondary structure is not a particu-larly informative feature for DNA-bindingidentification.58,83

    Solvent accessible surface area (SASA): Bind-ing residues are almost always well exposedto solvent to enable them to form contactswith DNA making SASA a useful predictivefeature. Like secondary structure, SASA canbe calculated from the protein structure orpredicted from sequence. Some studies limit

    their focus to only surface residues from theoutset.57

    Sequence Features Amino acid sequence: The most common fea-

    ture to any sequence-based predictor, the pro-

    teins amino acid sequence provides baselineinformation to the predictor. Raw sequenceis usually encoded as a 20-dimensional bi-nary vector. Positively charged residues suchas arginine are more likely to interact with thenegatively charged backbone of DNA accord-ing to both physical and statistical studies.83

    Residue class/type: The 20 amino acids maybe grouped according to physical propertiessuch as charge and hydrophobicity, which isthen used as an additional sequence featuresuch as the six classes in Ref 66.

    Sequence profiles: The majority of machinelearning approaches to bioinformatics prob-lems now employ sequence profiles ratherthan raw sequence as profiles are gener-ally acknowledged to provide better infor-mation. Profiles are usually generated usingPSI-BLAST84 and represent the probability ofsubstituting a different amino acid for the oneobserved at a specific position. This is encodedas a 20-dimensional vector at each sequenceposition, with positive numbers indicating fa-vorable substitutions and negative numbers

    denoting unfavorable substitutions. Global composition of AAs: When attempt-

    ing to identify DNA-binding proteins countsor frequencies of each type of amino acid areoften used, typically as a 20-dimensional vec-tor. Pairs of adjacent residues have also beenused as a compositional feature.85

    Hydrophobicity: Measures of residue hy-drophobicity, the degree to which the residueis repelled by water, are a commonly used fea-ture. A typical example is the hydrophobicityscale in Ref 86 which assigns a fixed numer-

    ical value to each of the twenty amino acidtypes.

    Evolutionarily conserved residues: Residuesthat mitigate interactions between proteinsand DNA are usually conserved through evo-lution. Thus identifying conserved residuescan yield a powerful feature. This may bedone using only sequence or combined withstructural information to yield collections ofconserved residues which are proximal in

    20 Volume 2 , January /February 2012c 2011 John Wi ley & Sons , Inc .

  • 8/2/2019 Computational Tools for Protein-DNA Interactions

    8/15

    WIREs Data Mining and Knowledge Discovery Computational tools for proteinDNA interactions

    T A B L E 2 Commonly Used Data Sets for DNA-Binding Protein and DNA-Binding Residue Identification

    ID Study Notes

    DB179 7 179 DNA-binding proteins, almost entirely nonredundant at 40% sequence identityNB3797 7 3797 nonbinding proteins, significant redundancy at 35% sequence identity level (only 3482 independent

    clusters)APO/HOLO104 7 104 unbound/bound pairs of DNA-binding proteins, maximum 30% identity, 10 apo/holo pairs have less than

    90% sequence identity.PD138 77 138 DNA-binding proteins, almost entirely nonredundant at 35% sequence identity, divided into seven

    structural classesAPO/HOLO54 77 54 unbound/bound pairs of DNA-binding proteins, maximum BLAST e-value for pairs of 0.1, 100% identity

    between APO/HOLO pairs. A few homologous sequences are present.DISIS 56 78 DNA-binding proteins, close to nonredundant at 20% sequence identityPDNA62 58 62 DNA-binding proteins, 78 chains, 57 nonredundant sequences at 30% identity.NB110 58 110 nonbinding proteins, nonredundant at 30% sequence identity level, derived from the RS126 secondary

    structure data set90 by removing entries related to DNA.BIND54 53 Reported as 54 binding proteins, actually 58 chains, nonredundant at 30% sequence identity, original list of

    proteins was reported in Ref 2.NB250 53 250 nonbinding proteins, mostly nonredundant at 35% sequence identity

    DBP374 66 374 DNA-binding proteins, significant redundancy at 25% sequence identity levelTS75 66 75 DNA-binding proteins, designed to be independent from DBP374 and PDNA62 but has some redundantentries in both at 35% sequence identity level

    space.54,87 However, both approaches assumea sufficient number of close homologs to thetarget are available.

    In addition, sequence features are commonlyaugmented via sliding windows to capture the lo-cal sequence environment of a residue. Features of

    residues immediately to the left and right are con-catenated onto those of a central residue before beingpresented to the machine learner. Window sizes be-tween one (only the central residue) and eleven (fiveresidues on either side) are commonly used. Many ofthe features described above are used in sliding win-dows in the approaches that describe them.

    Machine Learning ToolsMost standard machine learning tools have beenapplied to DNA-binding protein and DNA-bindingresidue prediction. The short list includes support

    vector machines (SVMs),56,71 neural networks,53,57decision trees,67 Bayesian inference,63 logistic regres-sion,77 and random forests.54,66

    Comparisons between methods to determine anoptimal approach are hindered by the different dataused for evaluation and the variation of basic as-sumptions amongst studies. For example, both Refs56 and 57 do binding residue prediction, but theformer uses only sequence-based features while the

    latter uses structural information and evaluates per-formance only on surface residues. Direct compari-son of their reported performance is not particularlyinformative.

    Data Sets

    If possible, new studies of DNAprotein interactionsshould employ a data set that has already been usedin the literature. This facilitates direct comparison toprevious efforts. Some common data sets in use arelisted in Table 2 with relevant properties. In thesecases, we have checked the sequence independence ofthe data sets to verify whether they correspond to thelevels reported in the literature. Of particular interestare the the data sets used for DBD-Hunter. These arethe largest and most sequence-independent data setsin the literature making them a good place to startfor new work. The data sets in Table 2 are availableas supplemental information to this paper. In addi-

    tion, there are several new databases devoted solely toproteinDNA interactions which aggregate and aug-ment information available from the PDB.88,89

    For new data sets, authors should report themaximum level of sequence similarity amongst pro-teins in the set. The similarity level should be kept ator below 3035% to be comparable to current meth-ods. This can be accomplished using a sequence clus-tering program such as blastclust (available from

    Volume 2 , January /February 2012 21c 2011 John Wi ley & Sons , Inc .

  • 8/2/2019 Computational Tools for Protein-DNA Interactions

    9/15

    Advanced Review wires.wiley.com/widm

    NCBI) to group similar sequences and then selecta single representative from each cluster. It is alsoimportant to eliminate proteins that are subsequenceof other proteins in a data set which can also be donewith blastclust. For example, the following use ofblastclust will cluster sequences at 35% identityand detect subsequences that are as little as 10% ofthe length of other sequences.

    blastclust-iseq.fa-oseq.bc-S35-L0.1

    This is the mechanism that was used to analyzesequence redundancy of the datsets in Table 2. An-other popular method of clustering is the PISCES webserver for sequence culling.91

    When dividing the data set for cross-validation,ensure divisions are done at the protein level evenfor binding residue prediction: residues from thesame protein should not appear in both trainingand testing sets. When reporting performance, a va-riety of measures should be included, particularly

    an ROC analysis92 and the Matthews correlationcoefficient.

    Current State of the ArtThe current crop of DNA-binding protein predictorsprovide good results when sequences homologous tothe target protein are available. Table 3 and Table 4compile results for DNA-binding protein and DNA-binding residue prediction, respectively. These tablesshould be interpreted carefully keeping the follow-ing points in mind. The methods are grouped by row

    based on the data set which is used in evaluation,many of which appear in Table 2. Refer to it for detailson the level of sequence redundancy of the data set:moderate levels of sequence redundancy artificiallymake it easy to achieve good predictions rates. Withineach data set, evaluation strategies vary between stud-ies as some use leave-one-out cross-validation (alsoreferred to as jackknife evaluation), while othersemploy fivefold or 10-fold cross-validation. Wherepossible, we have included footnotes on the strat-egy as these variation in splits of training/testingsets make also affect the inferred performance. Fi-nally, background information is used in various ways

    by the different methods. For example, many meth-ods use a sequence database to generate PSI-BLASTprofiles56,57,75 while the threading methods7,55,77 relyon large numbers of known structures for their tem-plate libraries. Changing background information canaffect performance as is noted for DBD-HUNTER be-tween Refs 7 and 55.

    DISIS provides a model of how machinelearning methods can be applied to DNA-binding

    residue prediction.56 For researchers implementingnew sequence-based methods, it serves as a good ex-ample of how to describe the features derived for pro-teins, machine learning tools employed, and the eval-uation framework used to gauge performance. Thegeneral methodology is equally applicable to set upDNA-binding protein prediction experiments. Theonly exception is that future studies should reporta variety of performance measures, particularly aMatthews correlation coefficient (MCC) and receiveroperating characteristic (ROC). The work of Langloisand Lu49 is an excellent example of how to comparenew work to older studies.

    DBD-Threader provides a state-of-the-artthreading approach which is likely amongst thebest predictors when good templates are avail-able for a target.55 New structure-based methodsshould compare against its performance again withthe addition of reporting performance in terms of

    ROC.Most current DNA-binding classification meth-

    ods rely upon the availability of similar proteins, ei-ther explicitly in the case of threading methods, orimplicitly through the similarity measures used inmachine learning methods and sequence comparison.When a homologs to the target protein are not avail-able, the task of identifying DNA-binding proteinsand residues is significantly more difficult. The workin Refs 22 and 77 finds that homology modeling willusually fail when no good template is found. Forsequence-based methods, this situation can be simu-

    lated by leaving out an entire structural classes whiletraining. Testing on the left out structural fold led toonly a modest drop in prediction performance for asequence-based machine learner according to a smallscale study in Ref 95. Thus, sequence-based methodsmay be the best approach when predictions for trulynew proteins are required.

    The number of experimentally verified DNA-binding structures is likely to continue increasingwhich will extend the capabilities of similarity-based methods. However, until homologs are avail-able for all protein families, predicting DNA-bindingattributes of new proteins is likely to remain a

    challenge.

    FUTURE DIRECTIONS

    Machine learning has already impacted the study ofproteinDNA interactions, particularly the identifi-cation of DNA-binding proteins. These innovationsare set to continue down a number of avenues. The

    22 Volume 2 , January /February 2012c 2011 John Wi ley & Sons , Inc .

  • 8/2/2019 Computational Tools for Protein-DNA Interactions

    10/15

    WIREs Data Mining and Knowledge Discovery Computational tools for proteinDNA interactions

    T A B L E 3 Summary of DNA-Binding Protein Prediction Resultsa

    Method Type Data Set ACC SEN SPE MCC ROC

    BLAST in Ref 49 Q 79.3 27.8 90.4 21.5 66.0Langlois and Lu49,b Q 89.1 48.1 98.0 66.2 90.3Langlois and Lu, LOO49,c Q BIND54 and 91.1Nimrod et al.54 T NB25053 87.0 94.0 Stawiski et al.53 T 92.0 81.0 94.4 74.0 Szilagyi and Skolnick77 T 89.0 85.0 73.0 93.0BLAST in Ref 49 Q 81.4 80.8 81.8 70.4 90.5Langlois and Lu49 Q PDNA62,d 89.9 84.6 93.6 84.9 97.1Ahmad and Sarai79 T NB11058 83.9 80.8 87.0 68.0 Szilagyi and Skolnick77 T 92.0 85.0 79.0 95.0BLAST in Ref 49 Q Bhardwaj71 82.4 75.2 86.1 70.2 90.3Langlois and Lu49,c Q 94.7 88.4 97.9 88.8 96.7Bhardwaj et al. CV571,e T 89.1 82.1 93.9 Bhardwaj et al.71,e T Bhardwaj71 90.3 67.4 94.9 Nimrod et al.54,e,f T Filteredg 73.6 94.9 BLAST in Ref 49 Q Langlois72 72.7 42.7 83.2 70.4 90.5

    Langlois and Lu49,

    b Q 89.6 69.3 96.7 74.3 91.3AdaC4.5 in Langlois et al.72,e T 88.5 66.7 96.3 88.7BLAST in Ref 49 Q PD138, 71.8 79.7 61.8 45.1 80.1Langlois and Lu49 Q NB11077 85.9 89.9 80.9 74.8 93.4Szilagyi and Skolnick77 T 74.0 93.0Nimrod et al.54 T 90.0 90.0 80.0 96.0BLAST in Ref 49 Q LEAC3560 72.9 59.4 80.4 46.3 74.9Langlois and Lu49 Q 84.0 68.8 92.4 69.5 92.3BLAST in Ref 49 Q LEAC2560 69.4 42.6 82.4 28.6 67.8Langlois and Lu49 Q 84.7 64.8 94.4 66.2 91.5Szilagyi and Skolnick77 T APO5477 85.0 85.0 72.0 93.0DBD-Hunterg T 84.0 66.0 93.0 Szilagyi and Skolnick77 T HOLO5477 80.0 85.0 68.0 91.0

    DBD-Hunterg

    T 89.0 68.0 93.0 PSI-BLAST in Ref 7 Q 44.0 99.3 56.0 PSI-BLAST (Uniprot DB) in Ref 55 Q 43.0 99.3 55.3 DBD-Hunterg,h T DB179 and 58.0 99.5 69.0 DBD-Hunter55,h T NB3797g 56.0 99.6 68.1 DBD-Threader55 Q 61.0 99.2 68.0 PROSPECTOR93 Q 53.0 99.1 60.9 Ahmad et al.58 Q NRTF-91558 64.5 68.6 63.4

    aColumns are: Method with citation; Type of T for structure based and Q for sequenced based; Data Set which was used inevaluation; ACC for accuracy; SEN for sensitivity; SPE for specificity; MCC for Matthews Correlation Coefficient, scaled to 100 to 100;ROC for area under receiver operating curve, scaled from 0 to 100.bTen-fold cross-validation.cLeave-one-out cross-validation.dPDNA62 is referred to as PD78 in Ref 77.e

    Five-fold cross-validation.fReported in Supplementary Text S1 of Ref 54.gFiltered the data set in Ref 71 to be nonredundant at 20% sequence identity.hThe differing performance of DBD-Hunter between Refs 7 and 55 is due to an updated template library.

    capability of machine learning to identify bindingresidues in a protein may be used to guide physicalsimulations of proteinDNA interactions. This capa-bility has been utilized in some studies of proteininteractions with small molecules to guide ligand

    docking69 and improve models of the binding site.96

    The same methodology may also be employed forDNA-binding proteins.

    Another avenue of pursuit is applying machinelearning to identify the genomic binding sites for TFs.

    Volume 2 , January /February 2012 23c 2011 John Wi ley & Sons , Inc .

  • 8/2/2019 Computational Tools for Protein-DNA Interactions

    11/15

    Advanced Review wires.wiley.com/widm

    T A B L E 4 Summary of DNA-Binding Residue Prediction Resultsa

    Method Type Data Set ACC SEN SPE MCC ROC

    DBD-Hunter7 T DB1797,b 90.0 72.0 93.0 DBD-Threader55,c Q 87.5 60.0 92.0 52.0

    DISPLAR

    57,d

    T DISPLAR

    57

    76.4 60.1 Ahmad et al.58 T PDNA6258 40.3 81.8 79.1 DP-BIND Structure61 T 78.1 79.2 77.2 49.0 84.0Ahmad and Sarai59 Q 66.4 68.2 66.0 DP-BIND Sequence61 Q 76.0 76.9 74.7 45.0 83.6BindN62 Q 70.3 69.4 70.5 75.2Ho et al.73 Q PDNA6258 80.2 80.1 80.2 BindN-RF74 Q 78.2 78.1 78.2 86.1BindN+75 Q 79.0 77.3 79.3 44.0 85.9Carson et al.67 Q 78.5 79.7 77.2 57.0 85.7DISIS56 Q DISIS56 89.0 Carson et al.67 Q 86.4 84.6 87.0 72.5 93.1DNABINDPROT68 T 78.6 9.3 90.5 DP-BIND61,94,e T Ozbek68 74.0 63.7 75.0

    DISPLAR57,

    e T 80.1 45.1 86.2 DBD-HUNTER7,e T 95.2 43.6 44.5 Random Forests, Wu et al.66 T DBP37466 91.4 76.6 73.2 70.0 91.3BindN62 Q TS7566 78.2SVM, Wu et al.66 Q 3.5 cutoff f 84.3Random Forests, Wu et al.66 Q 85.5Random Forests, Wu et al.66 Q TS7566 80.5 67.2 81.8 34.1 DP-BIND61 Q 4.5 cutoff f 78.0 67.8 79.0 31.6 Random Forests, Wu et al.66 Q TS7566 78.2 51.4 84.6 34.1 DISIS56 Q 6.0 cutoff f 81.6 7.7 99.2 19.0

    aColumns are: Method with citation; Type of T for structure based and Q for sequenced based; Data Set which was used inevaluation; ACC for accuracy; SEN for sensitivity; SPE for specificity; MCC for Matthews Correlation Coefficient, scaled to 100 to 100;ROC for area under receiver operating curve, scaled to 0 to 100.bOnly did binding residue prediction on 103 proteins predicted as DNA-binding proteins by DBD-Hunter and DBD-Threader, respectively.Average per-protein statistics reported.cEstimated from Figure 3 of Ref 55.dEvaluation was done only on surface residues only.eResults reported in Supplementary Table S2 of Ref 68.fRefers to the distance cutoff determining DNA-binding residues.

    There has already been some work done to developmodels for various structural classes of TFs.39 Fea-tures of both the genome binding site (DNA sequence)and the protein are used to train classifier for eachTF family. An analogous problem in cheminformat-

    ics is to classify small molecules according to whetherthey activate a particular protein. Recent work whichemploys multitask learning97 to characterize activecompounds for different proteins98 may carry overdirectly to the case of TF-binding site identificationon multiple TFs.

    Finally, a true head-to-head comparison of thevarious methods for DNA-binding protein identifi-cation and DNA-binding residue prediction would

    guide further development in this area. Dividing abenchmark into sequence-based and structure-basedpredictions would elucidate how much inferencecapability is gained when a proteins structure isavailable.

    NOTESahttp://pedant.gsf.de/pedant3htmlview/pedant3view?Method=analysis&Db=p3 p168 Hom sapie. (Acces-sed August 26, 2010).bhttp://amigo.geneontology.org/cgi-bin/amigo/browse.cgi. (Accessed August 25, 2010).

    24 Volume 2 , January /February 2012c 2011 John Wi ley & Sons , Inc .

  • 8/2/2019 Computational Tools for Protein-DNA Interactions

    12/15

    WIREs Data Mining and Knowledge Discovery Computational tools for proteinDNA interactions

    ACKNOWLEDGMENTS

    This work was supported in part by NSF (IIS-0905220, OCI-1048018, IOS-0820730), NIH(RLM008713A), and the Digital Technology Center at the University of Minnesota. Access

    to research and computing facilities was provided by the Digital Technology Center and theMinnesota Supercomputing Institute.

    REFERENCES

    1. Charoensawan V, Wilson D, Teichmann SA. Ge-nomic repertoires of DNA-binding transcription fac-tors across the tree of life. Nucleic Acids Res 2010.

    2. Luscombe NM, Austin SE, Berman HM, ThorntonJM. An overview of the structures of protein-DNAcomplexes. Genome Biol 2000, 1:REVIEWS001.1REVIEWS001.37.

    3. Walter MC, Rattei T, Arnold R, Guldener U,Mnsterkotter M, Nenova K, Kastenmller G, TischlerP, Wulling A, Volz A, et al. Pedant covers all completeRefSeq genomes. Nucleic Acids Res 2009, 37:D408D411.

    4. Kuznetsov VA, Singh O, Jenjaroenpun P. Statistics ofprotein-DNA binding and the total number of bind-ing sites for a transcription factor in the mammaliangenome. BMC Genomics 2010, 11:S12.

    5. Wilson D, Charoensawan V, Kummerfeld SK,Teichmann SA. Dbdtaxonomically broad transcrip-tion factor predictions: new content and functionality.Nucleic Acids Res 2008, 36:D88D92.

    6. Berman HM, Westbrook J, Feng Z, Gilliland G, BhatTN, Weissig H, Shindyalov IN, Bourne PE. The proteindata bank. Nucl Acids Res 2000, 28:235242.

    7. Gao M, Skolnick J. DBD-Hunter: a knowledge-basedmethod for the prediction of DNA-protein interactions.Nucleic Acids Res 2008, 36:39783992.

    8. Brown TA. Genomes. 2nd ed. Oxford, UK: Wiley-Liss;2002.

    9. Pabo CO, Sauer RT. Transcription factors: Structuralfamilies and principles of DNA recognition. Annu RevBiochem 1992, 61:10531095.

    10. Mandel-Gutfreund Y, Schueler O, Margalit H. Com-prehensive analysis of hydrogen bonds in regulatory

    protein DNA-complexes: in search of common princi-ples. J Mol Biol1995, 253:370382.

    11. Luscombe NM, Laskowski RA, Thornton JM. Aminoacid-base interactions: a three-dimensional analysis ofprotein-DNA interactions at an atomic level. NucleicAcids Res 2001, 29:28602874.

    12. Desjarlais JR, Berg JM. Length-encoded multiplexbinding site determination: application to zinc fingerproteins. Proc Natl Acad Sci USA 1994, 91:1109911103.

    13. Mandel-Gutfreund Y, Margalit H. Quantitative pa-rameters for amino acid-base interaction: implicationsfor prediction of protein-DNA binding sites. NucleicAcids Res 1998, 26:23062312.

    14. Stormo GD. DNA binding sites: representation anddiscovery. Bioinformatics 2000, 16:1623.

    15. Klug A, Schwabe JW. Protein motifs 5. Zinc fingers.

    FASEB J1995, 9:597604.16. Sarai A, Kono H. Protein-DNA recognition patterns

    and predictions. Annu Rev Biophys Biomol Struct2005, 34:379398.

    17. Pabo CO, Nekludova L. Geometric analysis and com-parison of protein-DNA interfaces: why is there no sim-ple code for recognition? J Mol Biol 2000, 301:597624.

    18. Jayaram B, McConnell K, Dixit SB, Das A, BeveridgeDL. Free-energy component analysis of 40 protein-DNA complexes: a consensus view on the thermody-namics of binding at the molecular level. J ComputChem 2002, 23:114.

    19. Rohs R, Jin X, West SM, Joshi R, Honig B, MannRS. Origins of specificity in protein-DNA recognition.Annu Rev Biochem 2010, 79:233269.

    20. Donald JE, Chen WW, Shakhnovich EI. Energetics ofprotein-DNA interactions. Nucleic Acids Res 2007,35:10391047.

    21. Aeling KA, Opel ML, Steffen NR, V. Tretyachenko-Ladokhina, Hatfield GW, Lathrop RH, Senear DF. In-direct recognition in sequence-specific DNA binding byescherichia coli integration host factor: the role of DNAdeformation energy. J Biol Chem 2006, 281:3923639248.

    22. Morozov AV, Havranek JJ, Baker D, Siggia ED.

    Protein-DNA binding specificity predictions with struc-tural models. Nucleic Acids Res 2005, 33:57815798.

    23. Stormo GD, Zhao Y. Determining the specificityof protein-dna interactions. Nat Rev Genet 2010,11:751760.

    24. Aeling KA, Steffen NR, Johnson M, Hatfield GW,Lathrop RH, Senear DF. DNA deformation energy asan indirect recognition mechanism in protein-DNA in-teractions. IEEE/ACM Trans Comput Biol Bioinform2007, 4:117125.

    Volume 2 , January /February 2012 25c 2011 John Wi ley & Sons , Inc .

  • 8/2/2019 Computational Tools for Protein-DNA Interactions

    13/15

    Advanced Review wires.wiley.com/widm

    25. Narlikar L, Ovcharenko I. Identifying regulatory el-ements in eukaryotic genomes. Brief Funct GenomicProteomic 2009, 8:215230.

    26. Hannenhalli S. Eukaryotic transcription factor bindingsitesmodeling and integrative search methods. Bioin-formatics 2008, 24:13251331.

    27. Vingron M, Brazma A, Coulson R, van Helden J,

    Manke T, Palin K, Sand O, Ukkonen E. Integratingsequence, evolution and functional genomics in regu-latory genomics. Genome Biol2009, 10:202.

    28. Bulyk ML. Computational prediction of transcription-factor binding site locations. Genome Biol 2003,5:201.

    29. Elnitski L, Jin VX, Farnham PJ, Jones SJM. Locat-ing mammalian transcription factor binding sites: asurvey of computational and experimental techniques.Genome Res 2006, 16:14551464.

    30. Das MK, Dai H-K. A survey of DNA motif findingalgorithms. BMC Bioinform 2007, 8:S21.

    31. Tompa M, Li N, Bailey TL, Church GM, Moor BD,Eskin E, Favorov AV, Frith MC, Fu Y, Kent WJ,et al. Assessing computational tools for the discoveryof transcription factor binding sites. Nat Biotechnol2005, 23:137144.

    32. Bulyk ML, Johnson PLF, Church GM. Nucleotides oftranscription factor binding sites exert interdependenteffects on the binding affinities of transcription factors.Nucleic Acids Res 2002, 30:12551261.

    33. Benos PV, Bulyk ML, Stormo GD. Additivity inprotein-DNA interactions: how good an approxima-tion is it? Nucleic Acids Res 2002, 30:44424451.

    34. Berger MF, Bulyk ML. Protein binding microarrays(PBMS) for rapid, high-throughput characterization ofthe sequence specificities of DNA binding proteins.Methods Mol Biol2006, 338:245260.

    35. Kaplan T, Friedman N, Margalit H. Ab initio pre-diction of transcription factor targets using structuralknowledge. PLoS Comput Biol2005, 1:e1.

    36. Loo PV, Marynen P. Computational methods for thedetection of Cis-regulatory modules. Brief Bioinform2009, 10:509524.

    37. Barash Y, Elidan G, Friedman N, Kaplan T. Mod-eling dependencies in protein-DNA binding sites, In:Proceedings of the seventh annual international con-ference on Research in computational molecular biol-ogy, RECOMB 03. New York, USA: ACM; 2003, 2837.

    38. Zhou Q, Liu JS. Modeling within-motif dependencefor transcription factor binding site predictions. Bioin-formatics 2004, 20:909916.

    39. Narlikar L, Gordan R, Ohler U, Hartemink AJ. Infor-mative priors based on transcription factor structuralclass improve de novo motif discovery. Bioinformatics2006, 22:e384e392.

    40. Hu M, Yu J, Taylor JMG, Chinnaiyan AM, Qin ZS.On the detection and refinement of transcription factor

    binding sites using ChIP-Seq data. Nucleic Acids Res2010, 38:21542167.

    41. Zhu Z, Pilpel Y, Church GM. Computational iden-tification of transcription factor binding sites via atranscription-factor-centric clustering (TFCC) algo-rithm. J Mol Biol2002, 318:7181.

    42. Bussemaker HJ, Li H, Siggia ED. Regulatory element

    detection using correlation with expression, Nat Genet2001, 27:167174.

    43. Frith MC, Fu Y, Yu L, Chen J-F, Hansen U, Weng Z.Detection of functional DNA motifsvia statistical over-representation. Nucleic Acids Res 2004, 32:4, 13721381.

    44. Roider HG, Manke T, S. OKeeffe, Vingron M, HaasSA. PASTAA: identifying transcription factors associ-ated with sets of co-regulated genes. Bioinformatics2009, 25:435442.

    45. Kim TH, Ren B. Genome-wide analysis of protein-DNA interactions. Annu Rev Genomics Hum Genet2006, 7:81102.

    46. Park PJ. Chip-seq: advantages and challenges of a ma-turing technology. Nat Rev Genet2009, 10:669680.

    47. Bulyk ML. Analysis of sequence specificities of DNA-binding proteins with protein binding microarrays.Methods Enzymol2006, 410:279299.

    48. van Helden J, Rios AF, Collado Vides J. Discoveringregulatory elements in non-coding sequences by analy-sis of spaced dyads. Nucleic Acids Res 2000, 28:18081818.

    49. Langlois RE, Lu H. Boosting the prediction and un-derstanding of DNA-binding domains from sequence.Nucleic Acids Res 2010, 38:31493158.

    50. Jones S, Shanahan HP, Berman HM, Thornton JM.Using electrostatic potentials to predict DNA-bindingsites on DNA-binding proteins. Nucleic Acids Research2003, 31:24, 71897198.

    51. Shanahan HP, Garcia MA, Jones S, Thornton JM.Identifying DNA-binding proteins using structural mo-tifs and the electrostatic potential. Nucleic Acids Res2004, 32:47324741.

    52. Tsuchiya Y, Kinoshita K, Nakamura H. Structure-based prediction of DNA-binding sites on proteins us-ing the empirical preference of electrostatic potentialand the shape of molecular surfaces. Proteins 2004,55:885894.

    53. Stawiski EW, Gregoret LM, Mandel-Gutfreund Y. An-notating nucleic acid-binding function based on proteinstructure. J Mol Biol2003, 326:10651079.

    54. Nimrod G, Szilagyi A, Leslie C, Ben-Tal N. Identifi-cation of DNA-binding proteins using structural, elec-trostatic and evolutionary features. J Mol Biol 2009,387:10401053.

    55. Gao M, Skolnick J. A threading-based method forthe prediction of DNA-binding proteins with applica-tion to the human genome. PLoS Comput Biol 2009,5:e1000567.

    26 Volume 2 , January /February 2012c 2011 John Wi ley & Sons , Inc .

  • 8/2/2019 Computational Tools for Protein-DNA Interactions

    14/15

    WIREs Data Mining and Knowledge Discovery Computational tools for proteinDNA interactions

    56. Ofran Y, Mysore V, Rost B. Prediction of DNA-binding residues from sequence. Bioinformatics 2007,23:i347i353.

    57. Tjong H, Zhou H-X. Displar: an accurate method forpredicting DNA-binding sites on protein surfaces. Nu-cleic Acids Res. Fairmont Orchid Hotel, Big Island ofHawaii; 2007, 35:14651477.

    58. Ahmad S, Gromiha MM, Sarai A. Analysis and pre-diction of DNA-binding proteins and their bindingresidues based on composition, sequence and structuralinformation. Bioinformatics 2004, 20:477486.

    59. Ahmad S, Sarai A. PSSM-based prediction of DNAbinding sites in proteins. BMC Bioinform 2005, 6:33.

    60. Bhardwaj N, Langlois R, Zhao G, Lu H. Structurebased prediction of binding residues on DNA-bindingproteins. Conf Proc IEEE Eng Med Biol Soc 2005,3:26112614.

    61. Kuznetsov IB, Gou Z, Li R, Hwang S. Using evolution-ary and structural information to predict DNA-bindingsites on DNA-binding proteins, Proteins 2006, 64:19

    27.

    62. Wang L, Brown SJ. BindN: a web-based tool for ef-ficient prediction of DNA and RNA binding sitesin amino acid sequences. Nucleic Acids Res 2006,34:W243W248.

    63. Yan C, Terribilini M, Wu F, Jernigan RL, Dobbs D,Honavar V. Predicting DNA-binding sites of proteinsfrom amino acid sequence. BMC Bioinform 2006,7:262.

    64. Bhardwaj N, Lu H. Residue-level prediction of DNA-binding sites and its application on DNA-binding pro-tein predictions. FEBS Lett2007, 581:10581066.

    65. Andrabi M, Mizuguchi K, Sarai A, Ahmad S. Predic-tion of mono- and di-nucleotide specific DNA-bindingsites in proteins using neural networks. BMC StructBiol2009, 9:30.

    66. Wu J, Liu H, Duan X, Ding Y, Wu H, Bai Y, Sun X.Prediction of DNA-binding residues in proteins fromamino acid sequences using a random forest modelwith a hybrid feature. Bioinformatics 2009, 25:3035.

    67. Carson MB, Langlois R, Lu H. NAPS: a residue-levelnucleic acid-binding prediction server. Nucleic AcidsRes 2010, 38:W431W435.

    68. Ozbek P, Soner S, Erman B, Haliloglu T. DNAbind-prot: fluctuation-based predictor of DNA-bindingresidues within a network of interacting residues. Nu-cleic Acids Res 2010, 38:W417W423.

    69. Ghersi D, Sanchez R. Improving accuracy and effi-ciency of blind protein-ligand docking by focusingon predicted binding sites. Proteins 2009, 74:417424.

    70. Friedberg I. Automated protein function predictionthe genomic challenge. Brief Bioinform 2006, 7:225242.

    71. Bhardwaj N, Langlois RE, Zhao G, Lu H. Kernel-basedmachine learning protocol for predicting DNA-bindingproteins. Nucleic Acids Res 2005, 33:64866493.

    72. Langlois RE, Carson MB, Bhardwaj N, Lu H. Learningto translate sequence and structure to function: iden-tifying DNA binding and membrane binding proteins.Ann Biomed Eng2007, 35:10431052.

    73. Ho S-Y, Yu F-C, Chang C-Y, Huang H-L. Designof accurate predictors for DNA-binding sites in pro-teins using hybrid svm-pssm method, Biosystems 2007,90(1):234241.

    74. Wang L, Yang MQ, Yang JY. Prediction of DNA-binding residues from protein sequence informationusing random forests. BMC Genomics 2009, 10:S1.

    75. Wang L, Huang C, Yang MQ, Yang JY. BindN+ foraccurate prediction of DNA and RNA-binding residuesfrom protein sequence features. BMC Syst Biol 2010,4:S3.

    76. Sali A, Blundell TL. Comparative protein modellingby satisfaction of spatial restraints. J Mol Biol 1993,

    234:779815.

    77. Szilagyi A, Skolnick J. Efficient prediction of nu-cleic acid binding function from low-resolution proteinstructures. J Mol Biol2006, 358:922933.

    78. Shazman S, Celniker G, Haber O, Glaser F, Mandel-Gutfreund Y. Patch finder plus (PFplus): a web serverfor extracting and displaying positive electrostaticpatches on protein surfaces. Nucleic Acids Res 2007,35:W526W530.

    79. Ahmad S, Sarai A. Moment-based prediction of DNA-binding proteins. J Mol Biol2004, 341:6571.

    80. Kabsch W, Sander C. Dictionary of protein secondarystructure: pattern recognition of hydrogen-bonded andgeometrical features. Biopolymers 1983, 22:2577637.

    81. Jones DT. Improving the accuracy of transmembraneprotein topology prediction using evolutionary infor-mation. Bioinformatics 2007, 23:538544.

    82. Karypis G. YASSPP: better kernels and coding schemeslead to improvements in svm-based secondary struc-ture prediction. Proteins: Struct Funct Bioinform2006,64:575586.

    83. Kauffman C, Karypis G. An analysis of informationcontent present in protein-DNA interactions. In: PacificSymposium on Biocomputing. Fairmont Orchid Hotel,Big Island of Hawaii; 2008, 477488.

    84. Altschul S, Madden T, Schaffer A, Zhang J, Zhang Z,Miller W, Lipman D. Gapped BLAST and PSI-BLAST:a new generation of protein database search programs.Nucl Acids Res 1997, 25:33893402.

    85. Kumar M, Gromiha MM, Raghava GPS. Identifica-tion of DNA-binding proteins using support vectormachines and evolutionary profiles. BMC Bioinform2007, 8:463.

    Volume 2 , January /February 2012 27c 2011 John Wi ley & Sons , Inc .

  • 8/2/2019 Computational Tools for Protein-DNA Interactions

    15/15

    Advanced Review wires.wiley.com/widm

    86. Kyte J, Doolittle RF. A simple method for display-ing the hydropathic character of a protein. J Mol Biol1982, 157:105132.

    87. Ahmad S, Keskin O, Sarai A, Nussinov R. Protein-DNA interactions: structural, thermodynamic andclustering patterns of conserved residues in DNA-binding proteins. Nucleic Acids Res 2008, 36:5922

    5932.88. Norambuena T, Melo F. The protein-DNA interface

    database. BMC Bioinform 2010, 11:262.

    89. Contreras-Moreira B. 3d-footprint: a database for thestructural analysis of protein-DNA complexes. NucleicAcids Res 2010, 38:D91D97.

    90. Rost B, Sander C. Prediction of protein secondarystructure at better than 70 accuracy. J Mol Biol1993,232:584599.

    91. Wang G, Dunbrack J, Roland L. PISCES: recent im-provements to a PDB sequence culling server. NuclAcids Res 2005, 33:W9498.

    92. Fawcett T. An introduction to ROC analysis. Pattern

    Recognit Lett2006, 27:861874.93. Skolnick J, Kihara D, Zhang Y. Development and

    large scale benchmark testing of the PROSPEC-

    TOR 3 threading algorithm. Proteins 2004, 56:502518.

    94. Hwang S, Gou Z, Kuznetsov IB. DP-Bind: a webserver for sequence-based prediction of DNA-bindingresidues in DNA-binding proteins. Bioinformatics2007, 23:634636.

    95. Gou Z, Kuznetsov IB. On the accuracy of sequence-

    based computational inference of protein residues in-volved in interactions with DNA. Trends Appl Sci Res2008, 3:285291.

    96. Kauffman C, Rangwala H, Karypis G. Improving ho-mology models for protein-ligand binding sites. In: LSSComput Syst Bioinformatics Conference, San Fran-cisco, CA; 2008. Available at: http://www.cs.umn.edu/karypis. (Accessed December 10, 2009).

    97. Caruana R. Multitask learning: a knowledge-basedsource of inductive bias. In: ICML 1993, 4148.

    98. Ning X, Rangwala H, Karypis G. Multi-assay-based structure-activity relationship models: improv-

    ing structure-activity relationship models by incor-porating activity information from related targets.

    J Chem Inf Model 2009, 49:24442456.

    28 Volume 2 January /February 2012c 2011 John Wi ley & Sons Inc