challenges in proteome-scale prediction, annotation and assessment

22
Interpreting functional effects of coding variants: challenges in proteome-scale prediction, annotation and assessment Khader Shameer,* Lokesh P. Tripathi,* Krishna R. Kalari, Joel T. Dudley and Ramanathan Sowdhamini Corresponding author. Ramanathan Sowdhamini, National Centre for Biological Sciences, UAS-GKVK campus, Bellary Road, Bangalore 560 065, India. E-mail: [email protected] *Denotes equal contribution. Abstract Accurate assessment of genetic variation in human DNA sequencing studies remains a nontrivial challenge in clinical gen- omics and genome informatics. Ascribing functional roles and/or clinical significances to single nucleotide variants identified from a next-generation sequencing study is an important step in genome interpretation. Experimental characterization of all the observed functional variants is yet impractical; thus, the prediction of functional and/or regulatory impacts of the various Khader Shameer, PhD, obtained his MSc in Bioinformatics at Mahatma Gandhi University, Kerala and PhD from Manipal University, Karnataka and National Centre for Biological Sciences, Bangalore. He completed his postdoctoral training at the Mayo Clinic in the Departments of Biomedical Informatics and Statistics, Health Science Research and Cardiovascular Diseases and also led the genomic data analyses for MIGENES trial. Currently, Dr Khader is a senior biomedical and health care data scientist in Dr Joel Dudley’s laboratory at the Mount Sinai Health System and Harris Center for Precision Wellness. Dr Khader is working at the interface of biomedical informatics, drug repositioning and genomic medicine and pioneering efforts to translate the advances in translational bioinformatics to population health management and precision medicine. Lokesh P. Tripathi, PhD, obtained his PhD in Life Sciences (Computational Biology) from Tata Institute of Fundamental Research and National Centre for Biological Sciences, Bangalore, India. Dr Tripathi is presently a postdoctoral research scientist at the National Institutes of Biomedical Innovation, Health and Nutrition, Osaka, Japan. His research interests include application of integrative systems biology approaches that leverage multiple biological data types to investigate cellular networks to better understand complex biological systems. His research also aims to unravel the causal mechanisms underly- ing disease outcomes that may elucidate biomarkers and be eventually harnessed for the development of better therapeutic strategies. Krishna R. Kalari, PhD, focuses on the development of computational methods in the context of cancer genomics and pharmacogenomics. Rani develops novel algorithms to integrate multilayer omics data to understand pathophysiological mechanisms of breast cancer disease or variation to drug response. As part of the Cancer Genome Atlas analysis working group, Dr Kalari’s group at the Mayo Clinic are analyzing breast cancer genomics data sets. In the course of our studies, we have also completed the analysis of various breast cancer cell lines, primary breast tumors and breast tumors after treatment. Dr Kalari also leads projects novel analytical pipelines for integrating our data with large-scale projects data related to mRNA abundance, germ line and som- atic mutations, gene copy number, alternative splicing and fusion gene transcripts. Joel T. Dudley, Ph.D. is an assistant professor of the department of genetics and genomics at Icahn Institute of Multiscale Biology, health policy at Icahn School of Medicine at Mount Sinai, NYC. He currently directs the biomedical informatics programs of Department of Genetics and Genomics, The Icahn Institute for Genomics and Multiscale Biology at Mount Sinai, Clinical and Translational Science Activities, Mount Sinai Health System and Harris Center for Precision Wellness. Dr. Dudley obtained his Ph.D. in Biomedical Informatics from Stanford University, CA. Dr. Dudley’s research focuses on the devel- opment and use of translational informatics approaches to address critical challenges in systems medicine. Dudley laboratory is involved in various proj- ects for identification and development of novel therapeutic and diagnostic approaches for human disease through integration and analysis of healthcare, biomedical and wellness data, and also perform research to develop and evaluate methods to incorporate multi-scale data including genomic sequencing data and wellness science into clinical practice. Ramanathan Sowdhamini, PhD, is a professor in bioinformatics. She leads the computational biology group at the National Centre for Biological Sciences (TIFR), Bangalore. Sowdhamini laboratory focuses on understanding of the functionally significant regions of proteins for the theoretical prediction of protein function. Sowdhamini leads several research themes to understand sequence, structure, interaction and function of various protein families, including drug tar- gets. Dr Sowdhamini developed and improved on a wide range of computational methods, prediction algorithms, open-access web servers, publicly databases and software systems for the analysis of genomes and proteomes. Dr Sowdhamini also leads a portfolio of research projects under the theme of the application of bioinformatics for agro-socioeconomic challenges and develops better strategies to perform sustainable breeding in experimental and crop plants. Submitted: 14 May 2015; Received (in revised form): 15 July 2015 V C The Author 2015. Published by Oxford University Press. For Permissions, please email: [email protected] 1 Briefings in Bioinformatics, 2015, 1–22 doi: 10.1093/bib/bbv084 Paper Briefings in Bioinformatics Advance Access published October 22, 2015 at New York University on February 16, 2016 http://bib.oxfordjournals.org/ Downloaded from

Upload: dokhanh

Post on 23-Dec-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: challenges in proteome-scale prediction, annotation and assessment

Interpreting functional effects of coding variants

challenges in proteome-scale prediction

annotation and assessmentKhader Shameer Lokesh P Tripathi Krishna R Kalari Joel T Dudley andRamanathan SowdhaminiCorresponding author Ramanathan Sowdhamini National Centre for Biological Sciences UAS-GKVK campus Bellary Road Bangalore 560 065 IndiaE-mail minincbsresinDenotes equal contribution

Abstract

Accurate assessment of genetic variation in human DNA sequencing studies remains a nontrivial challenge in clinical gen-omics and genome informatics Ascribing functional roles andor clinical significances to single nucleotide variants identifiedfrom a next-generation sequencing study is an important step in genome interpretation Experimental characterization of allthe observed functional variants is yet impractical thus the prediction of functional andor regulatory impacts of the various

Khader Shameer PhD obtained his MSc in Bioinformatics at Mahatma Gandhi University Kerala and PhD from Manipal University Karnataka andNational Centre for Biological Sciences Bangalore He completed his postdoctoral training at the Mayo Clinic in the Departments of BiomedicalInformatics and Statistics Health Science Research and Cardiovascular Diseases and also led the genomic data analyses for MIGENES trial Currently DrKhader is a senior biomedical and health care data scientist in Dr Joel Dudleyrsquos laboratory at the Mount Sinai Health System and Harris Center forPrecision Wellness Dr Khader is working at the interface of biomedical informatics drug repositioning and genomic medicine and pioneering efforts totranslate the advances in translational bioinformatics to population health management and precision medicineLokesh P Tripathi PhD obtained his PhD in Life Sciences (Computational Biology) from Tata Institute of Fundamental Research and National Centre forBiological Sciences Bangalore India Dr Tripathi is presently a postdoctoral research scientist at the National Institutes of Biomedical Innovation Healthand Nutrition Osaka Japan His research interests include application of integrative systems biology approaches that leverage multiple biological datatypes to investigate cellular networks to better understand complex biological systems His research also aims to unravel the causal mechanisms underly-ing disease outcomes that may elucidate biomarkers and be eventually harnessed for the development of better therapeutic strategiesKrishna R Kalari PhD focuses on the development of computational methods in the context of cancer genomics and pharmacogenomics Rani developsnovel algorithms to integrate multilayer omics data to understand pathophysiological mechanisms of breast cancer disease or variation to drug responseAs part of the Cancer Genome Atlas analysis working group Dr Kalarirsquos group at the Mayo Clinic are analyzing breast cancer genomics data sets In thecourse of our studies we have also completed the analysis of various breast cancer cell lines primary breast tumors and breast tumors after treatment DrKalari also leads projects novel analytical pipelines for integrating our data with large-scale projects data related to mRNA abundance germ line and som-atic mutations gene copy number alternative splicing and fusion gene transcriptsJoel T Dudley PhD is an assistant professor of the department of genetics and genomics at Icahn Institute of Multiscale Biology health policy at IcahnSchool of Medicine at Mount Sinai NYC He currently directs the biomedical informatics programs of Department of Genetics and Genomics The IcahnInstitute for Genomics and Multiscale Biology at Mount Sinai Clinical and Translational Science Activities Mount Sinai Health System and Harris Centerfor Precision Wellness Dr Dudley obtained his PhD in Biomedical Informatics from Stanford University CA Dr Dudleyrsquos research focuses on the devel-opment and use of translational informatics approaches to address critical challenges in systems medicine Dudley laboratory is involved in various proj-ects for identification and development of novel therapeutic and diagnostic approaches for human disease through integration and analysis ofhealthcare biomedical and wellness data and also perform research to develop and evaluate methods to incorporate multi-scale data including genomicsequencing data and wellness science into clinical practiceRamanathan Sowdhamini PhD is a professor in bioinformatics She leads the computational biology group at the National Centre for Biological Sciences(TIFR) Bangalore Sowdhamini laboratory focuses on understanding of the functionally significant regions of proteins for the theoretical prediction of proteinfunction Sowdhamini leads several research themes to understand sequence structure interaction and function of various protein families including drug tar-gets Dr Sowdhamini developed and improved on a wide range of computational methods prediction algorithms open-access web servers publicly databasesand software systems for the analysis of genomes and proteomes Dr Sowdhamini also leads a portfolio of research projects under the theme of the applicationof bioinformatics for agro-socioeconomic challenges and develops better strategies to perform sustainable breeding in experimental and crop plantsSubmitted 14 May 2015 Received (in revised form) 15 July 2015

VC The Author 2015 Published by Oxford University Press For Permissions please email journalspermissionsoupcom

1

Briefings in Bioinformatics 2015 1ndash22

doi 101093bibbbv084Paper

Briefings in Bioinformatics Advance Access published October 22 2015 at N

ew Y

ork University on February 16 2016

httpbiboxfordjournalsorgD

ownloaded from

mutations using in silico approaches is an important step toward the identification of functionally significant or clinicallyactionable variants The relationships between genotypes and the expressed phenotypes are multilayered and biologicallycomplex such relationships present numerous challenges and at the same time offer various opportunities for the design of insilico variant assessment strategies Over the past decade many bioinformatics algorithms have been developed to predictfunctional consequences of single nucleotide variants in the protein coding regions In this review we provide an overview ofthe bioinformatics resources for the prediction annotation and visualization of coding single nucleotide variants We discussthe currently available approaches and major challenges from the perspective of protein sequence structure function andinteractions that require consideration when interpreting the impact of putatively functional variants We also discuss therelevance of incorporating integrated workflows for predicting the biomedical impact of the functionally important variationsencoded in a genome exome or transcriptome Finally we propose a framework to classify variant assessment approachesand strategies for incorporation of variant assessment within electronic health records

Key words human genome functional variant human variation variant interpretation mutation human proteome non-synonymous mutations prediction algorithms functional genomics sequence analysis structure analysis

Introduction

Genomic technologies are redefining the understanding of geno-typendashphenotype relationships In the early 2000s array-basedgenomic technologies enabled gene expression analysis usingmicroarrays followed by single nucleotide polymorphism (SNPgenetic variation observed in a population) genotyping platformsin mid-2000s and subsequently by low-cost high-throughputmassively parallel sequencing platforms in the late 2000s [1ndash3]Genomic sequencing data offer insights into the relationships be-tween the human genomic variations and various molecular anddisease phenotypes Sequencing a biological or clinical sample tocharacterize genomic exomic or transcriptomic variations usingnext-generation sequencing (NGS) technologies helps to identifygenomic variations underlying complex diseases Moreoverapproaches such as targeted sequencing of disease-susceptiblegenomic regions whole exome sequencing (WES) whole genomesequencing (WGS) and RNA sequencing (RNA-Seq) providedeeper insights into the genetic bases of familial diseases and abetter understanding of the biological processes underlying dis-ease phenotypes such as tumors

Over the past decade the genetic basis of several complex dis-eases and clinically relevant quantitative traits were examinedusing SNP genotyping array-based genome-wide associationstudies (GWAS) Subsequently sequencing [4ndash6] studies have im-proved the understanding of the associations between SNPs withclinically relevant traits and human diseases (See httpwwwebiacukfgptgwas) SNP arrays have uncovered the more com-mon variations while sequencing has unravelled much of therare variants spectrum The frequency of SNPs in the humangenome is approximately 1300 base pair (bp) SNPs are generallycategorized as common variants [in which the minor allele fre-quency (MAF)frac14 1ndash5] or rare variants (MAFlt 1) according totheir population frequencies Both GWAS and targeted WES andWGS studies have expanded the catalog of genotypendashphenotypeassociations and offered insights into the roles of previouslyuncharacterized genetic regions in complex diseases [7ndash12]

The rapid increase in the high-quality sequencing data gener-ation using low-cost NGS experimental platforms coupled withspeedy bioinformatics algorithms have enhanced the identifica-tion of a large number of sequence and structural variants vari-ations (characterized by genomic DNAgt 1 kb in size) SNPs areassociated with medically relevant phenotypes as well as dis-eases [13ndash15] and are often observed on a population scalewhereas single nucleotide variations (SNVs) are specific poly-morphisms observed in an individual Recent studies have high-lighted the roles of the different types of structural variants (copy

number variations insertions and deletions (indels) inversionstranslocations linking anchored split mapping gainloss) in thegenetic bases of several complex diseases [16ndash19] Furthermoresequencing studies have helped to identify rare personal variantsand variants of unknown significance (VUS)

Public repositories of genomic sequencing and variation datahave experienced an exponential growth in the past decade Forexample Single Nucleotide Polymorphism database (dbSNP) [20]and Ensembl Variation database [21] that archive short geneticvariants and structural variants and Database of GenomicVariants [22] that archives a catalog of curated large-scale gen-omic structural variants in the human genome are expandingSince the first GWAS study reported in 2005 so far genetic basisof 1251 traits were discovered These investigations also ledphenotypic annotations for 15 396 SNPs A large number of clin-ical-grade genomes exomes or transcriptomes sequenced forindividualized medicine [23ndash25] and population-scale sequencing[26] projects such as 1000 genomes [27] Genome10K (httpwwwgenome10korg) and UK10K (httpwwwuk10korg) will furtheradd to the size of variant-centric databases in the futureAnalysis of the data from the sequencing experiments can bebroadly divided into four major tasks (i) quality assessment ofthe sequencing reads (ii) alignment of the sequencing reads withthe reference genome (iii) variant calling and (iv) functional andor clinical assessment and prioritization of variants

The variants identified from the NGS studies present severaldata interpretation challenges in bioinformatics The initial datainference is a key filtering step in the identification and prioritiza-tion of a subset of variants for functionally important cues andvalidation studies The following sections describe a simpleframework to classify the available and widely used bioinfor-matics resources for prediction annotation and interpretation ofcoding SNVs We also discuss 10 different analytical themes thatcan potentially be investigated from a bioinformatics perspectiveto obtain a better understanding of coding SNVs

Landscape of genetic variants

The mutation spectrum of the human genome is complex andhas been classified based on diverse criteria such as the mode ofinheritance heterozygosity pattern impact on chromosome or al-leles impact on protein sequence structure and function impacton the population or evolutionary role and penetrance (Figure 1)On the other hand SNVs may be broadly classified as codingSNVs (functional variations) that perturb the function of a tran-script or a protein or variants may be classified as noncoding

2 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

SNVs (regulatory variations) that are located in the regulatoryelements [promoters untranslated regions (UTRs) enhancershuman accelerated regions noncoding RNA genes transcriptionfactor binding sites (TFBS)] that regulate the expression levels of atranscript or a protein Regulatory variants may also perturb geneor protein functions via complex cis or trans-activation mechan-isms and may thus play important roles in influencing the ex-pression and functions of other genes An example of a functionalvariant (P33S) in the ribonucleoside-diphosphate reductase sub-unit M2 B (RRM2B) gene associated with autosomal recessive pro-gressive external ophthalmoplegia is provided in Figure 2(i) and2(ii) Another example of a missense variant R363H (rs11556924) inzinc finger C3HC-type containing 1 (ZC3HC1) gene is associatedwith coronary heart disease [28] P33S in RRM2B is part of a con-served haribonucleotide reductase domain The variant R363H inZC3HC1 does not lie within a known functional domain and thusthe variation is located in an unassigned region of the proteinThe domain architectures of RRM2B and ZC3HC1 along with thelocation of functional variants are highlighted in Figure 2(ii)Regulatory variants may influence the expression levels of cis- ortrans-acting genes through gene regulation networks For ex-ample an intergenic variant (rs10811661) in the 9p21 locus whichis harbored near CDKN2A and CDKN2B on chromosome 9 wasassociated with myocardial infarction [29] and coronary heart dis-ease [9] A targeted deletion study in mice that investigated the9p21 region revealed a regulatory role of the variant in perturbingthe expression of two genes (Cdkn2a and Cdkn2b) via a cis-acting

mechanism [30] Another regulatory variant rs342293 (intergenicvariant between FLJ36031-PIK3CG) which influences a quantitativetrait (mean platelet volume) [31] perturbed a TFBS of ecotropicviral integration site-1 (EVI1) and influenced the expressionlevels of PIK3CG [32] The genomic locations of the regulatory vari-ants rs10811661 and rs342293 that were generated using theEnsembl Genome Browser are highlighted in Figure 2(iii)Regulatory variants too impact the biological function by alteringthe level of transcription that may influence protein levels as wellThe majority of regulatory variants are not within the protein cod-ing regions but may indirectly influence geneprotein function viaalternative splicing mechanisms [33] See the recent reviewsthat summarized the impact of regulatory variation of protein ex-pression and function for a detailed account of regulatory variants[34 35]

Prediction annotation and visualizationof coding SNVs

Scientific literature often refers to variants using diverse sets ofterms including recommended nomenclature [36] or standardterms from Sequence Ontology (SO) [37] In this review we broadlyclassify genomic variants as lsquofunctional variantsrsquo and lsquoregulatoryvariantsrsquo and we primarily focus on coding SNVs and computa-tional approaches for the prediction annotation visualization andinterpretation of coding SNVs The different terms used to define

Mu

tatio

n

Chromosomalallelic

SequenceStructure

Function

Population

Loss of Heterozygozity

Inheritance

Germ line mutation

Somatic mutation

Passenger mutation

Driver mutation

Harmful mutation

Benign mutation

Neutral mutation

Deleterious mutation

Advantageous mutation

Nearly neutral mutation

Point mutation

Frameshift mutation

Nonsense mutation

Missense mutation

Synonymous mutation

Non-synonymous mutation

Silent mutation

Transitions

Transversions

Deletions

Chromosomal inversions

Amplifications

Deletions

Chromosomal translocations

Loss of Heterozygozity

Interstitial deletions

Loss-of-function mutation

Lethal mutation

Gain-of-function mutation

Gain-of-function mutation

Dominant negative mutations

Allele frequency

Common variant

Rare variant

Private variant

Penetrance

High penetrance

Low penetrance

Recessive

Dominant

Homozygous

Heterozygous

Spontaneous mutation

Induced mutation

Figure 1 The human mutation spectrum

Interpreting functional effects of coding variants | 3

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

variants in the Ensembl Variation resources and dbSNP and theircorresponding SO identifiers and descriptions are summarized inTable 1 The interpretation of variants can be broadly divided intothree steps lsquoannotationrsquo lsquopredictionrsquo and lsquovisualizationrsquo

A given set of variant(s) identified from a sequencing studywill be segregated into various classes of genomic variants in thefirst step in variant annotation (Table 1) Cross-referencing thevariants with reference databases and clinical annotation data-bases like ClinVar [38] can be used to assess the novelty of gen-omic variants Fully automated pipelines can be used to annotatevariants based on coordinate specific mapping to genomic re-gions using proprietary andor public databases and various fea-tures associated with variant can be derived from suchannotation mapping and variant-specific layering approach Forexample a variant can be mapped to a protein-coding regionjunction regions or noncoding regions of the genome

Based on the location of the variant in the protein-coding ornoncoding regions variants (non-synonymous missense non-sense or frameshift variants) can be further examined to under-stand their impact on protein functions Tools like CombinedAnnotation-Dependent Depletion (CADD) [39] SIFT (as toleratedor damaging variants) [40] PolyPhen2 (as benign possibly

damaging or probably damaging variants) [41] or Condel (meta-predictor that combines prediction scores from multiple tools)[42] can be leveraged for predictive assessment of genetic vari-ants Annotation primarily provides localization of a genetic vari-ant using genome coordinates prediction aims to hypothesizethe probable impact of a particular genetic variant on function orregulation The combination of annotation and prediction pro-vides an integrated view of genomic variants Tools such assnpEff or the Ensembl Variant Predictor can be used to predict theimpact of a variation in comparison with the reference sequenceThe impact of a sequence variant with respect to the evolutionaryconservation can also be predicted or derived from GRANTHAMscore [43] genomic evolutionary rate profiling (GERP) score [44]phylogenetic P-value (PhyloP) score [45] and PhastCons score [46]The development of substitution matrices in the early 1970s wasinstrumental in fueling the design and implementation of com-putational approaches to infer the functional impact of genomicvariants [43 47] Matrices such as Point Accepted Mutations(PAM1 matrix substitution probabilities for sequences with a mu-tation rate of 1100 amino acids PAM250 matrix 250 mutations100 amino acids) BLOCK SUbstitution Matrix (BLOSUM) matrices[48] and sequence search algorithms designed to identify the

Figure 2 Examples of the coding (functional) and noncoding (regulatory) variants (i) Functional variant (Pro33Ser) in RRM2B associated with autosomal recessive progres-

sive external ophthalmoplegia visualized on a protein structure (PDB ID 2vux Quaternary assembly is generated using PISAPDBe) Functional variant (Pro33Ser) is high-

lighted in red color inside the green circle on chains A (part of loop) and B (part of helix) Visualization was created using UCSF Chimera (wwwcglucsfeduchimera) (ii)

Protein domain architectures and functional variants mapped to ii (a) RRM2B and ii (b) ZC3HC1 Ribonuc_red_smfrac14Ribonucleotide reductase domain zffrac14C3HC zinc finger-

like domain Rsm1frac14Rsm1-like domain Functional variants are highlighted using red vertical line Figure was generated using MyDomains (httpprositeexpasyorgmydo-

mains) (iii) Genomic localization of the regulatory variants (a) rs10811661 and (b) rs342293 The location of the variants are highlighted using a vertical red line Regulatory

variant rs10811661 regulated the expression of nearby genes CDKN2A and CDKN2B (highlighted in red boxes) Intergenic variant rs342293 located between FLJ36031

(CCDC71L)-PIK3CG is located in a TFBS of EVI1 that regulates the expression (repress) of PIK3CG Genomic regions were visualized using Ensembl Genome Browser v 67

4 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

similarity between any two sequences (pairwise sequence align-ment) and the evolutionary relationships between two or moresequences (multiple sequence alignments) have spawned the de-velopment of improved heuristic homology search tools [49 50]The early 2000s witnessed the development of several predictivemethods based on sequence conservation amino acid substitu-tions and perturbation of the local structural environment to as-sess the impact of functional variants in proteins [51ndash57] The kaks ratio (or dNdS) test which estimates the ratio of the totalnumber of non-synonymous substitutions per non-synonymoussites (dN) to the total number of synonymous substitutions to

synonymous sites (dS) is widely used to quantify the selectionpressure on functional genes [52] Furthermore bioinformaticstools databases and statistical methods for the identification an-notation and analysis of SNPs from genotyping and sequencingdata have been amply surveyed in literature [58ndash66]

Integrative approaches can provide additional annotationsfor variants such as the location of the variants within the con-served protein sequencestructure domains (contiguous unit inprotein sequence or structure with evidence of functional anno-tation from experimental or computational function associationmethods) within known functional sites Gene Ontology (GO)

Table 1 The naming convention used to describe the sequence variants in the Ensembl Variation resources and dbSNP

Ensembl variantconsequences

dbSNPfunctionalclasses

SO ID SO Definition

Essential splice site splice-3 SO0001574SO0001575

A splice variant that changes the two base regions at the 30 end of anintron a splice variant that changes the two base regions at the 50

end of an intronStop gained splice-5 SO0001587 A sequence variant whereby at least one base of a codon is changed

resulting in a premature stop codon leading to a shortenedtranscript

Stop lost nonsense SO0001578 A sequence variant where at least one base of the terminator codon(stop) is changed resulting in an elongated transcript

Complex indel NA SO0001577 A transcript variant with a complex INDELmdashInsertion or deletion thatspans an exonintron border or a coding sequenceUTR border

Frameshift coding frameshift SO0001589 A sequence variant which causes a disruption of the translationalreading frame because the number of nucleotides inserted ordeleted is not a multiple of three

Non-synonymouscoding

missense SO0001582SO0001652SO0001651SO0001583

A codon variant that changes at least one base of the first codon of atranscript an inframe non-synonymous variant that deletes basesfrom the coding sequence an inframe non-synonymous variantthat inserts bases into in the coding sequence a sequence variantwhere the change may be longer than three bases and at least onebase of a codon is changed resulting in a codon that encodes for adifferent amino acid

Splice site NA SO0001630 A sequence variant in which a change has occurred within the regionof the splice site either within 1ndash3 bases of the exon or 3ndash8 bases ofthe intron

Partial codon NA SO0001626 A sequence variant where at least one base of the final codon of an in-completely annotated transcript is changed

Synonymous coding cds-synon SO0001567SO0001588

A sequence variant where at least one base in the terminator codon ischanged but the terminator remains a sequence variant wherethere is no resulting change to the encoded amino acid

Coding unknown NA SO0001580 A sequence variant that changes the coding sequenceWithin mature miRNA NA SO0001620 A transcript variant located with the sequence of the mature miRNA50 UTR untranslated_5

UTR-5SO0001623 A UTR variant of the 50 UTR

30 UTR untranslated_3UTR-3

SO0001624 A UTR variant of the 30 UTR

Intronic intron SO0001627 A transcript variant occurring within an intronNMD transcript SO0001621 A variant in a transcript that is the target of NMDWithin non-coding

genencRNA SO0001619 A transcript variant of a non-coding RNA gene

Upstream near-gene-5 SO0001636SO0001635

A sequence variant located within 2 KB 50 of a gene a sequence variantlocated within 5 KB 50 of a gene

Downstream near-gene-3 SO0001634SO0001633

A sequence variant located within a half KB of the end of a gene a se-quence variant located within 5 KB of the end of a gene

Regulatory region NA SO0001566 A sequence variant located within a regulatory regionTranscription factor

binding motifNA SO0001782 A sequence variant located within a transcription factor-binding site

Intergenic intergenic SO0001628 A sequence variant located in the intergenic region between genes

SO identifiers and descriptions were obtained from MISO (wwwsequenceontologyorgmiso)

Interpreting functional effects of coding variants | 5

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

terms and pathways associated with the variant-containinggenes associations mapped within genetic databases such asdbSNP Ensembl Online Mendelian Inheritance in Man [67]Human Gene Mutation Database [68] Catalogue of SomaticMutations in Cancer (COSMIC) [69] and other clinically relevantgenomic regions also enable enhanced variant annotationTogether integrative data analysis platforms (such asTargetMine [70]) and integrated annotation tools such asANNOVAR [71] or webservers that can handle Variant CallFormat (VCF) [72] files facilitate the annotations for a largenumber of sequence variants in a small amount of timeAdopting a standard set of terms from the SO to define the typeof variants would also help in streamlining the interoperabilityof results from the different prediction tools A comprehensivelist of tools for rapid variant annotation is provided in Table 2

The visualization of a variant in the context of multiplelayers of biological information is helpful to interpret and priori-tize variants for functional studies A growing number of gen-ome browsers and data visualization libraries enable theinteractive and static visualizations of variants in the context ofthe human genome or transcriptome with biological clinicaland population scale annotation data compiled from multipleresources Genome browsers [Integrative Genomics Viewer(IGV) UCSC Genome Browsers NCBI Sequence Viewer etc] en-able the visualization of variants in the context of a referencegenome Resources such as Distributed Annotation System(DAS) and DASTY (a protein-centric DAS client) can be leveragedfor interactive visualization of coding variants in a protein withrich annotations A list of genome browsers and visualizationlibraries capable of visualizing variants and multiple annotationcomponents is provided in Table 3

Text mining and natural language processingfor knowledge aggregation for SNVs

Biomedical knowledge about relationships between genes dis-eases phenotypes and genetic variations are scattered across alarge number of unstructured literature databases Applicationof natural language processing and text-mining thereforeoffers a useful approach for function assignment of codingSNVs Further text mining of large biomedical literature data-bases like PubMed and Medline helps to provide clues for fur-ther investigations and leads to hypothesis [73] For examplePhenGen [74] offers links to a variety of literature evidence tosupport genotypendashphenotype connections Integrated databases

like T-HOD database [75] and PolySearch [76] provide text-mining tools to derive meaningful biological inferences tointerpret coding SNVs

Emerging challenges in annotation andinterpretation of coding SNVs

The growing number of tools for the prediction annotation andvisualization of coding SNVs can address several gaps in thecurrent state of knowledge on variant interpretation The devel-opment of new algorithms for variant interpretation could beconsidered for several emerging themes of the protein se-quence-structure-function paradigm Several tools [See Tables4 5 and 6] are currently available to assess sequence and struc-ture-based features yet a reliable interpretation on how a gen-omic variant could perturb a protein or a protein network isoften a challenging task

VUS (also known as incidental variations or secondary vari-ations) are a class of variants hitherto unknown in known dis-ease genes but the influence of these variants on the diseasephenotype is largely unknown Pilot data from 1000 Genomesproject on exon capture sequencing of 1092 individuals selectedfrom 14 populations across Europe East Asia sub-SaharanAfrica and the Americas reported an observed frequency of 2500non-synonymous variants at conserved positions 20ndash40 vari-ants identified as damaging 24 at conserved sites and about 150loss-of-function (LOF) variants that includes stop-gains frame-shift insertions and deletions (indels) in a coding sequence anddisruptions to essential splice sites Emerging evidence suggeststhat rare variants may have a higher collective impact on dis-ease incidence rate than common variants and considering theallele frequency as a metric to assess the clinical impact of thevariant may help to assess the clinical impact of VUS Majorityof the variants identified in individuals in 1000 Genomes projectare common with gt5 of MAF or low frequency (MAF 05ndash5)than rare variants (MAF lt05) Rare variant frequencies esti-mated as 130ndash400 non-synonymous variants per individual thatincludes 10ndash20 LOF variants 2ndash5 damaging mutations and 1ndash2variants identified from cancer genome sequencing projects[77ndash79] As an ever-increasing number of VUS are being charac-terized their annotation and interpretation is becoming morechallenging With the advent of WGS and WES in clinical set-tings the repertoire of VUS associated with complex chronicand rare diseases is rapidly expanding We surveyed theHuman polymorphisms and disease mutations index (humsa-vartxt) to gather unclassified variants reported in UniProtKB acurated database of functional information on proteins Thecurrent release (2014_05) of the humsavar lists 6564 variants aslsquoUnclassified variantsrsquo these variants were mapped to 1910 pro-tein coding genes A manually curated annotation of the vari-ants and disease ontology-based disease term-gene enrichmentanalyses indicated that the genes were largely from cancer pa-tients (322 genes tested Plt 005) We noted that 497 genesencoded polymorphism disease and unclassified variants sug-gesting that VUS-labeled variants may directly influence thegenes that are relevant to human diseases and clinically rele-vant phenotypes A proportional Venn diagram of genesmapped to variants labeled as lsquoPolymorphismrsquo lsquoDiseasersquo andlsquoUnclassifiedrsquo is illustrated in Figure 3 The top 10 disease ontol-ogy terms enriched among the genes within the three classes ofvariants are summarized in Table 2 We observed that certaindiseases were represented by genes that share multiple classesof variants thereby suggesting that unclassified variants may

Table 2 Top-10 terms from a gene-disease enrichment analyses per-formed using list of genes with shared polymorphism disease andunclassified variants using disease ontology

DO term No ofgenes

P-values(Bonferronicorrected)

Congenital abnormality 38 913E-34Cancer 64 7222E-33Diabetes mellitus 39 8913E-23Breast cancer 36 356E-17Atherosclerosis 26 1478E-16Retinal disease 16 2044E-16Adenovirus infection 15 8816E-14Hypertension 21 2072E-13Alzheimerrsquos disease 22 7548E-13

6 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

play a key role in certain clinical conditions such as cancerschronic diseases (diabetes atherosclerosis hypertension andkidney disease) neurodegenerative disease (Alzheimerrsquos dis-ease) and infection (Plt 005 Bonferroni corrected) Annotatingfunctional coding SNVs including VUS validated using an or-thogonal method also requires detailed sequence structure andinteraction-based analyses In this context we discuss some ofthe predictive features and analytical approaches that onceincorporated in coding variant annotation algorithms may po-tentially enhance the interpretation of the functional variantson a proteome-wide scale

Sequence and structural propertiesperturbed by coding SNVs

A protein performs its defined function after attaining a specifictertiary or quaternary structure [80 81] This is often mediatedby cross-links of inter-chain and intra-chain amino-acid residueinteractions within a protein These interactions (hydrogenbond disulphide bonds salt bridges ionic interactions electro-static interactions hydrophobic interactions etc) stabilize the

fold of individual protein chains and the overall quaternary as-sembly Individual amino acids within a protein are under theinfluence of the varied elements of the sequence-structure-function paradigm that modulate their biochemical role [82]Primarily a residue may be located within an evolutionarilyconserved compact globular domain [83] or in an unassigned re-gion with no known protein domain association [84] A residuecan be a part of a motif which can be functional [85] that of alinear motif [86] propeptide [87] signaling peptide [88] or a partof structural [89] and spatially interacting sites which partici-pate in higher order interactions [90] catalytic sites [91] ligandbinding sites [92] and allosteric sites [93] associated with exten-sive inter-chain and intra-chain interactions [90] based onthe oligomeric property of the proteins Amino acids can also bepart of specific sub-cellular localization signals that directthe proteins to specific locations in the cell [94] The lengths ofthe proteins [62] sequential localization (N-terminalC-terminal or other region of protein sequence) and the locationof a coding SNV with respect to the surfacemdashinterior or inter-face of the protein structuremdashcould influence disease manifest-ation [95]

Table 3 Rapid variant annotation tools

Name Description URL

ANNOVAR Efficient software tool to use up-to-date information tofunctionally annotate genetic variants detected fromdiverse genomes

httpwwwopenbioinformaticsorgannovar

AnnTools Comprehensive and versatile annotation toolkit for gen-omic variants

httpanntoolssourceforgenet

dbNSFP A lightweight database of human non-synonymousSNPs and their functional predictions

httpssitesgooglecomsitejpopgendbNSFP

EVA An efficient and versatile tool for filtering strategies inmedical genomics

httpplateforme-genomique-iribuniv-rouenfrEVAindexphp

Exome VariantServer

Provides different calculated values (GERP GRANTHAMetc) and annotations for SNPs

httpevsgswashingtoneduEVS

gSearch gSearch compares sequence variants in the GenomeVariation Format (GVF) or VCF with a pre-compiledannotation or with variants in other genomes

httpmlssuackrgSearchindexhtml

HugeSeq A pipeline for detection and annotation of geneticvariations

httphugeseqsnyderlaborg

MuSiC Comprehensive mutational analysis pipeline to segre-gate passenger and driver mutations from cancergenomes

httpgmtgenomewustledugenome-musiccurrent

NGS-SNP Collection of command-line scripts for providing richannotations for SNPs

httpstothardafnsualbertacadownloadsNGS-SNP

SeattleSeq Annotation Provides annotation of known and novel SNPs httpsnpgswashingtoneduSeattleSeqAnnotationsnpEff Variant annotation and effect prediction tool httpsnpeffsourceforgenetSVA Software system designed for annotation and visualiza-

tion of genetic variantshttpwwwsvaprojectorg

STORMSeq Cloud computing solution for read mapping read clean-ing and variant calling and annotation

httpsgithubcomkonradjkstormseq

TREAT Targeted RE-sequencing Annotation Tool httpndcmayoedumayoresearchbiostatstand-alone-packagescfm

VAAST Probabilistic search tool for identifying damaged genesand their disease-causing variants in personal gen-ome sequences

httpwwwyandell-laborgsoftwarevaasthtml

VARIANT VARIANT (VARIant ANalysis Tool) can report the func-tional properties of any variant in all the humanmouse or rat genes

httpvariantbioinfocipfes

Variant Reporter Generate a report of known variants and functionalconsequences

wwwncbinlmnihgovvariationtoolsreporter

Variant Tools Tool for the annotation selection and analysis of vari-ants in the context of next-gen sequencing analysis

httpvarianttoolssourceforgenet

Interpreting functional effects of coding variants | 7

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

Table 4 Genome browsers and biological data visualization libraries

Name Description URL

1000 Genomesbrowser

Genome browser to access data from 1000 Genomesproject

httpwwwncbinlmnihgovvariationtools1000genomes

AnnoJ Web 20 application designed for visualizing sequencingand annotation data

httpwwwannojorg

Artemis Genome Browser and Annotation Tool httpwwwsangeracukresourcessoftwareartemis

BioGraphics Perl modules for biological data visualization httpsearchcpanorgdistBio-GraphicsBioGraphics (Ruby) Ruby library for drawing overviews of genomic regions httpbio-graphicsrubyforgeorgBluejay Java-based integrated computational environment for the

exploration of genomic datahttpbluejayucalgaryca

CGView Circular Genome Viewer httpwishartbiologyualbertacacgviewCircos Circos is a software package for visualizing genomic data

and annotations in circular layouthttpcircosca

Dalliance Interactive genome viewer which runs directly in a modernweb browser

httpwwwbiodallianceorg

DAS DAS is an integrated visualization toolkit to share and col-late genomic annotation information

wwwbiodasorg

DASTY Web client for visualizing protein sequence feature infor-mation using DAS

httpwwwebiacukdasty

DNAPlotter DNAPlotter can be used to generate images of circular andlinear DNA maps to display regions and annotations ofinterest

httpwwwsangeracukresourcessoftwarednaplotter

Ensembl GenomeBrowser

Ensembl Genome Browser enables the visualization of gen-omic and transcriptomic sequence and related informa-tion for several vertebrate and non-vertebrate species

httpuseastensemblorgindexhtml

GASV Geometric Analysis of Structural Variants httpcompbiocsbrownedusoftwarehtmlGbrowse Generic Genome Browser (GBrowse) is a genome viewer de-

veloped as part of Generic Model Organism Database(GMOD) project

httpgmodorgwikiGBrowse

GENBOREE Customizable genome browser httpgenboreeorgjava-binloginjspGeneViTo JAVA-based workbench for genome-wide analysis through

visual interactionhttpathinabioluoagrbioinformaticsGENEVITO

GenomeGraphs R-based interface to plot genomic information fromEnsembl

httpwwwbioconductororgpackagesreleasebiochtmlGenomeGraphshtml

GenomePixelizer Tool to generate custom images of genomes out of thegiven set of genes

httpwwwatgcorgGenomePixelizer

GenomeTools Versatile genome analysis software httpgenometoolsorgGenomeView Next-generation stand-alone genome browser and editor httpgenomevieworgGremlin Interactive visualization model for analyzing genomic

rearrangementshttpcompbiocsbrownedusoftwarehtml

IGB Integrated genome browser httpbiovizorgigbindexhtmlIGV Integrative Genomics Viewer httpwwwbroadinstituteorgigvJbrowse Genome browser with a fully dynamic AJAX interface httpgmodorgwikiJBrowsejsDAS JavaScript client library for the DAS httpwwwebiacukdastyebihtmljsdashtmlMagicViewer Integrated solution for NGS data visualization and genetic

variation detection and annotationhttpbioinformaticszjcnmagicviewerindexphp

NCBI GraphicalSequence Viewer

Graphical display for the Nucleotide and Protein sequences httpwwwncbinlmnihgovprojectssviewer

Rover Genome browser framework to build custom genomic tools httpchmille4githubcomRoversitehomehtmlRviewer Interactive online tool for comparing and prioritizing gen-

omic regionshttprviewerlblgovrviewer

Savant Genome Browser for high-throughput sequencing data httpgenomesavantcomScribl HTML5 Canvas-based graphics library for visualization of

genomic data and annotationshttpchmille4githubcomScribl

UCSC GenomeBrowser

Interactive genome browser that provide access to se-quence data from different species integrated with alarge collection of layered annotations from experimentsand prediction algorithms

httpgenomeucsceducgi-binhgGateway

VISTA Browser Visualization of pairwise and multiple alignments of wholegenome assemblies

httppipelinelblgovcgi-bingateway2selectorfrac14vista

Whole GenomerVISTA

Visualization of TFBS that are conserved between speciesand overrepresented in upstream regions of groups ofgenes

httpgenomelblgovvistaindexshtml

8 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

Previously studies have established (Table 4) that functionalvariants may impact the overall structural conformation andfunction of proteins (ligand binding substrate specificity pro-teinndashprotein interaction transcription factor-target bindingetc) Residue-specific interactions can play a key role in suchvariations Sequence-based investigations will further help toidentify linear sequence motifs that confer diversity in the splicevariants [6] but the identification and analysis of the impact ofpoint mutations on spatially distributed structural motifs mayrequire structural data [55] However the limited availability ofstructural data compared with sequence data may pose add-itional challenges for incorporating structural analysis of thefunctional variants [96] Moreover additional challenges ariseowing to the dynamic protein conformations and allosteric orother long-distance effects (including higher order interactions)of few mutations on the activity of the protein

UniProtKB provides extensive information about residueproperties and curated information about globular domains (viaannotations derived from Pfam SMART or InterPro)posttranslational modifications (PTMs) and several bioinfor-matics tools and webservers are available to investigate residueproperties and structural properties (See Table 5 for a list oftools to predict various sequence features See URL httpbioinformaticsca links_directorycategoryprotein for list of toolsavailable for different protein-centric analyses)

Occasionally multiple sequence or structural features canbe annotated onto a single variant for example a residue couldbe part of a functional motif andor a structural motif and cod-ing SNVs may perturb either or both features Quantifying mul-tiple features perturbed by variants using an objective-scoringschema will help to capture the entire set of features perturbedby a functional variant Comparative analysis of the sequenceand structural features of the wild type and the variant se-quences will also enhance the understanding of the impact ofmultiple coding variants within a protein

Gain and loss of PTMs owing to SNVs

Amino acids in a mature peptide chain are targets of variedPTM [97] events PTMs include phosphorylation methylationacetylation amidation hydroxylation sulfation lipidation

glycosylation and palmitoylation [98] Several studies have nowcompiled and assessed the impact of phosphorylation in cancer[97] and have reported mutational landscapes of various PTMevents including gain or loss of particular PTMs owing to SNVs(for example gain of glycosylation [99] gain and loss of phos-phorylation in cancer [100]) Pan-cancer [101] and proteomewide studies [102] have also assessed the impact of variety ofPTMs Several tools and databases are currently available toassess impact of individual PTMs owing to SNVs [refs]However tools that can provide complete in silico profiling ofmutations will help the researchers to identify PTM-relevantmutations and use the information to assess therapeuticstratifications

Coding SNVs in unassigned regions of proteins

An unassigned region refers to the segments in proteins withno known functional domain assignments [84] Human proteinshave a variable degree of unassigned regions and small-un-assigned regions are often defined as the linker regions betweentwo domains (Figure 2i) We surveyed the human proteomeusing Pfam-based domain annotations to understand the un-assigned regions in the human proteome We computed as-signed and unassigned regions (in percentage) for 20 242 proteinsequences from the SwissProt database (reviewed sequences)Pfam-A-based domain assignments were retrieved for 20 137 se-quences A subset of 105 proteins was excluded from the ana-lysis owing to overlapping domain assignments Of the 20 137sequences that were analyzed (Figure 4A) one protein (haloaciddehalogenase-like hydrolase domain-containing protein 2) wasassigned with a conserved domain over its entire length 3234sequences were completely unassigned and the rest of the pro-teins had varying segments of unassigned regions (mean5387 SD63094) Current approaches in variant annotationand interpretation are primarily focused on highly conservedglobular domains Given that domains are presently assigned toonly 50 of the human proteome analytical methods such asPrediction of Unassigned Regions (PURE) that use intermediatesequence search techniques for domain assignments [103] orsimilar approaches that use sensitive sequence search protocols

Table 5 Examples of the functional variants located in sequence features perturbing diverse functional and structural effects in proteins

Variant location Description URL

Protein domain DMDM A database that compiles domain mapping of dis-ease mutations have information about 202 507 muta-tions associated with 10 919 domains (compiled fromCDD Pfam COG and SMART databases)

httpbioinfumbcedudmdm

Phosphorylation site Mutation of an AKT phosphorylation site of human B-raf httpwwwncbinlmnihgovpubmed15791648Propeptide Mutation in the von Willebrand factor (VWF) propeptide

affects the oligomerizationhttpwwwncbinlmnihgovpubmed20335223

Signal peptide Mutation in signal peptide of ADAMTS10 influence secre-tion of full-length enzyme

httpwwwncbinlmnihgovpubmed18567016

Active site Mutation in the active site of human deoxycytidine kinaseaffects the substrate specificity

httpwwwncbinlmnihgovpubmed18361501

Linear motif Linear motifs mediate functional diversity of transcriptvariants

httpwwwncbinlmnihgovpubmed22638587

Structural motif Heterozygous missense mutation of a spatially distributedstructural motif in human connexin (GJB3) gene causeerythrokeratodermia variabilis

httpwwwncbinlmnihgovpubmed9843209

Subcellular localization Missense mutations in the NPHS2 gene altering the traffick-ing of nephrin to the plasma membrane

wwwncbinlmnihgovpubmed15496146

Interpreting functional effects of coding variants | 9

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

Table 6 Tools for predicting various sequence and structural features

Name Description URL

3dswap-pred Classify a protein sequence as domain-swapping ornon-domain swapping using an SVM model

httpcapsncbsresin3dswap-predindexhtml

AAIndex An amino acid index is a set of 20 numerical values rep-resenting various physico-chemical and biochemicalproperties of amino acids

httpwwwgenomejpaaindex

Bioinformatics LinkDirectory (Protein)

Extensive list of tools for prediction of protein sequencefeatures structure features and function

httpbioinformaticsca links_directorycategoryprotein

dbPTM Comprehensive resource for protein PTMs httpdbptmmbcnctuedutwDISOPRED Dynamically disordered protein chains do not have sta-

ble secondary structures and have high flexibility insolution Disordered regions also play critical roles inprotein function

httpbioinfcsuclacukdisopred

ELM Eukaryotic Linear Motif server httpelmeuorgEris Eris server computes the change of the protein stability

induced by mutations using structural datahttpdokhlabuncedutoolserisindexhtml

FoldAmyloid Method for predicting of amyloidogenic regions fromprotein sequence

httpbioinfoprotresrufold-amyloidogacgi

FoldX FoldX can be used to find interactions contributing tothe stability of proteins and protein complexes usingstructural data

httpfoldxcrges

Globplot Globplot can predict disordered regions in proteinsequence

httpglobplotemblde

H-Predictor Predict hinge regions involved in protein oligomeriza-tion via the domain-swapping mechanism fromstructural data

httptrollmeduncedudokhlabindexphpSpecialHpredictor

Harmony Substitution and propensity score-based protein struc-ture assessment algorithm

httpcapsncbsresinharmony

HORI Webserver for prediction of higher order residue inter-actions in protein structures

httpcapsncbsresinhori

InterPro Integrated database of predictive protein signaturesused for the classification and automatic annotationof proteins and genomes

httpwwwebiacukinterpro

IUPred Prediction of intrinsically unstructured proteins httpiupredenzimhuLIMBO Predicts the amylogenic regions in a protein httplimbovibbeMUPRO Prediction of protein stability changes for single-site

mutations from sequenceshttpmuproproteomicsicsuciedu

NCBI-CDD Extensive protein domain and family annotationdatabase

httpwwwncbinlmnihgovStructurecddcddshtml

Pfam Database of conserved protein domain families httppfamsangeracukPFILT Program to filter various sequence regions including

low-complexity regionshttpbioinfadmincsuclacukdownloadspfilt

PIC Protein interactions calculator httppicmbuiiscernetinProtParam Compute biochemical features like Molecular Weight

Theoretical pI Grand Average of Hydropathy(GRAVY) instability index etc

httpwebexpasyorgprotparam

PSIPRED Secondary structure prediction httpbioinfcsuclacukpsipredPURE Prediction of unassigned regions in proteins httpcapsncbsresinpureSABBLE Relative solvent accessibility prediction httpsablecchmcorgScanProsite Report the functional motifspatterns encoded in the se-

quence Helps to assess the gainloss of functionalsites owing to mutation

httpprositeexpasyorgscanprosite

SignalP Predicts the presence and location of signal peptidecleavage sites in amino-acid sequences

httpwwwcbsdtudkservicesSignalP

SMART Simple modular architecture research tool for assigningdomains to protein

httpsmartembl-heidelbergde

TANGO Predicts the aggregation-prone regions in a protein httptangocrgesTargetP Predicts the subcellular location of eukaryotic proteins httpwwwcbsdtudkservicesTargetPUniProtKB Catalog of information on proteins httpwwwuniprotorgWALTZ Predicts the aggregation-prone regions in a protein httpwaltzvubacbe

10 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

(Table 5) may potentially help investigate the evolutionary andfunctional roles of variants in unassigned regions

Impact of coding SNVs in low-complexityregions

Low-complexity regions (LCRs) in the protein universe refer to astretch of amino acids with low Shannon entropy (leucine-richdomains or poly-alanine tracts) Unlike linear motifs whichhave a specific function and sequence signature the individualfunctions of LCRs are poorly characterized LCRs do not adopt adefinite secondary structure but may exist as solvent-exposedamino acids in the coiled or disordered regions in proteins LCRsare observed in functionally diverse proteins and in both eu-karyotes and prokaryotes The predominant functions of LCRsinclude promoting mRNA stability and mediating a diverse set

of proteinndashprotein interactions [104] We scanned the referencehuman proteome (reviewed subset of 20 237 sequences) usingPFILT [105] and observed that 1432 (nfrac14 2899) protein se-quences have LCRs with a median length of 16 amino acids(Figure 4B) A recent study showed that a functional variantlocalized to an LCR in Nance-Horan Syndrome (NHS) gene wasassociated with clinical features of NHS including cardiacanomaly [106] In the absence of direct functional informationto interpret the impact of functional variants scanning proteinsequences with an LCR prediction tool like PFILT is highly rec-ommended to investigate the probable gain or loss of LCRsowing to the functional variations

Coding SNVs and intrinsically disorderedregions in proteins

Proteins are believed to be functional when the structure attainsits definite globular fold [107] Recent studies however suggestthat proteins may perform their functions even when not in afully folded state Such proteins or regions within proteins thatexist in a stable conformation without attaining a definite struc-tural conformation are generally referred to as intrinsically dis-ordered proteins (IDPs) or proteins with intrinsically disorderedregions (IDRs) or simply disordered proteins [108 109] Disprotis a database that catalogs a curated list of disordered proteinsthat includes 248 experimentally validated human proteinswith disordered regions (httpwwwdisprotorgactionsearchphpkeywordfrac14humanampcriterionfrac14organism) Prediction modelssuggest that 30ndash40 of human proteins are considered to beIDPs or have IDRs and approximately 25 of eukaryotic proteinsare predicted to be fully disordered [108 109]

Regions of the intrinsically disordered segments in proteinshave been found to be functionally important (Figure 5)Disordered regions mediate various functional roles includingprotein binding and proteinndashprotein interactions Previous stud-ies have shown that IDPs can attain a definite structure on bind-ing to their interacting partner and may thus exist in anintermediate stage of disordered (unfolded) to ordered (folded)stage [110 111] Irrespective of the structural plasticity recent

Figure 3 Proportional Venn diagram of genes with coding variants annotated as

polymorphism (Polymorphism nfrac1411 527) disease (Disease nfrac142105) and un-

classified (Unclassified nfrac141910) in Human polymorphisms and disease muta-

tions index Percentages of genes shared between the three groups are provided

Unassigned regions in human proteome

Percentage of unassigned regions in human proteins

Fre

qu

en

cy in

hu

ma

n p

rote

om

e (

Sw

issp

rot

re

vie

we

d s

eq

ue

nce

s)

0 20 40 60 80 100

01

00

02

00

03

00

04

00

0

Low-complexity regions in human proteome

Percentage of LCRs in human proteins

Fre

quency in h

um

an p

rote

om

e (

Sw

isspro

t r

evie

wed s

equences)

0 20 40 60 80 100

0500

1000

1500

2000

2500

3000(a) (b)

Figure 4 Histograms depicting the distribution (in percentage) of (A) unassigned regions and (B) LCRs in the human proteome

Interpreting functional effects of coding variants | 11

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

studies have shown the presence of evolutionarily conservedfunctional motifs within IDPs [112 113]

Recently the influence of disease mutations in the dis-ordered regions was extensively surveyed and the crucialroles of disorder-promoting amino acids in imparting vari-ations in the protein structures were proposed [114 115]Disorder-promoting amino acids are a potential cause ofvariations in protein structures and structural modelsDisorder-to-order transitions are enriched among diseasemutations compared with neutral polymorphisms [116]Currently protein modeling or variant effect prediction al-gorithms do not take the intrinsic disorder characteristicsof a protein into consideration Integrating tools such asDisopred [117] for disorder region prediction into the vari-ant annotation pipelines will improve our understanding ofthe impact of coding SNVs in the disordered regions of aprotein

Influence of coding SNVs on proteinmisfolding domain swapping aggregationmacromolecular crowding and degradation

Diseases have multiple environmental as well as geneticcauses and structural biology is an active area of research tounderstand how changes in individual protein structures orprotein complexes may play a key role in disease manifest-ations Understanding the structural bases of human diseases

would help to identify better targets and design better ligandsfor more effective therapeutic interventions

Protein misfolding and aggregation

Protein folding pathways play a crucial role in mediating cellu-lar homeostasis Defective folding of a protein product is themechanistic basis of several disease phenotypes likeAlzheimerrsquos disease and Parkinsonrsquos disease [118] When non-synonymous mutations lead to the production of misfoldedproteins with aberrant function several gate-keeping pathwaysact on such misfolded proteins to clear them from the cellularenvironment [119] Misfolded proteins targeted for ubiquityla-tion may also lead to protein aggregation Protein aggregation[120ndash123] is defined as a molecular phenomenon where a pro-tein is not cleared from the cellular environment by the normalpathways (Figure 6A) This leads to the aggregation of proteinsin the cellular environment which in turn leads to cellular tox-icity and is considered to be the mechanistic basis of varioushuman diseases such as prion diseases [123 124]

Domain swapping

3D domain swapping (Figure 6B) is a phenomenon observed in asubset of proteins where intermolecular interactions arereplaced by intramolecular interactions [125 126] Domainswapping is also recognized as a mechanism for forming pro-tein aggregates via open-ended mode Domain swapping is

Figure 5 Examples of IDRs in human proteins (A) Breast cancer type 1 susceptibility protein (BRCA1) (B) Cellular tumor antigen p53 (Oncoprotein p53) (C) T-cell sur-

face glycoprotein CD4 (Disprot) (D) Secondary structure information (UniProtKB) and (E) mutation histogram (COSMIC) of BRCA1 is provided to illustrate mutations in

the unstructured regions

12 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

observed in a variety of therapeutically important proteins andis considered to be the mechanism mediating deposition dis-eases such as neurodegenerative diseases and Alzheimerrsquos dis-ease that are caused by conformational perturbations [127 128]Missense mutations in the human phenylalanine hydroxylasehave been shown to infleunce aggregation and degradationproperties [129] Functional mutations are also considered to bea causative factor for proteins to adopt swapped conformationfrom monomers to higher oligomers [130ndash132] A recent study[128] that surveyed a subset of human proteins involved in 3Ddomain swapping suggested that swapping is not only confinedto conformational diseases it is also associated with severalkey biological pathways and also plays a role in mediating di-verse diseases in humans Another study [133] that investigatedthe structural properties of proteins involved in 3D domainswapping suggested that 10 of protein folds and 5 of proteinfamilies include domain-swapped structures

Macromolecular crowding

Proteins interact with a variety of small molecules and othermacromolecules to perform a specific functional task The qua-ternary assembly of a protein is influenced by the importantyet understudied phenomenon of macromolecular crowding

(117 118) Protein function is driven by interactions mediated byinter- and intra-chain amino acids and mutations in the sur-face amino acids may affect macromolecular homeostasisMutagenesis studies have shown the impact of functional mu-tations on proteinndashprotein interfaces (119) oligomerization (120)and stability (121) Mutations that influence the intra-chaininteractions may influence the crowding effect in the cellularenvironment Probing the impact of functional mutations onsuch effects using theoretical models and molecular dynamicsimulation studies is likely to enhance the understanding of therelationship between coding SNVs and macromolecular crowd-ing A list of selected tools and databases available for the pre-diction of misfolding protein aggregation and domainswapping is summarized in Table 5

Protein degradation

Targeted biochemical studies have revealed that the proteindegradation pathway plays an important role in the clearanceof the mutated proteins to reduce cellular toxicity For examplemutant Cu Zn-superoxide dismutase associated with amyo-trophic lateral sclerosis was cleared by macroautophagy path-way that includes proteasomal cleavage phenylalaninehydroxylase (109) and NAD(P)H quinone oxidoreductase 1 (156)

(a) (b)

Figure 6 Schematic representation of the impact of functional mutation on protein misfolding folding aggregation domain swapping macromolecular crowding and

protein degradation pathways Structure of misfolded rat CD2 structure (PDB ID 1A6P) and normal CD2 (PDB ID 1HNG) is used for representing misfolded and folded

structures Functional variant is represented using red asterisk

Interpreting functional effects of coding variants | 13

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

were also cleared by similar mechanisms Functional variantsplay a crucial role in aggregation and protein degration path-ways [134 135] Computational approaches that can predict thedegradation properties of mutated proteins using sequence orstructural information will be useful for rapid characterizationof functionally active proteins Integrated models that combinefolding misfolding aggregation 3D domain swapping macro-molecular crowding and degradation pathways (Figure 6) in asystems biology approach are likely to provide additional in-sights into the regulation of these important mechanisms andthe role of coding SNVs in mediating such mechanisms

Coding SNVs and metamorphic proteins

The molecular paradigm of sequence-structure-function sug-gests that a diverse set of sequences could attain a similarstructural fold that may lead to a functional convergence Whilethis is generally true for the protein universe occasionally devi-ations are observed Metamorphic proteins refer to a relativelynew class of proteins in which a given sequence has beenshown to attain different folded conformations under nativeconditions while performing distinct functions [136ndash138]Metamorphic proteins such as human chemokine lymphotactinhave been shown to influence evolutionary transitions of struc-ture and coding SNVs which could play an important role inswitching between two or more folds and functions for a singlesequence [139] Computational approaches to catalog and pre-dict the metamorphic properties and the impact of coding SNVson metamorphic proteins would help us understand how muta-tions drive functional plasticity at the proteome level

Impact of coding SNVs on the transcriptomicdiversity

Dynamic features of the human proteome are driven by tran-script diversity and the average number of characterized tran-scripts per gene is rapidly expanding The recent adoption ofRNA-Seq as a tool for expression profiling has led to the charac-terization of a large number of novel transcripts includingfusion transcripts [140] Several novel tissue-specific orcell-type-specific transcripts were reported from RNA-Seqexperiments The cellular compartment-based functional rolesof such transcripts are determined by alternative splicingevents [141] RNA editing is a phenomenon where a pre-mRNAmolecule is altered through a chemical change in the basemakeup thereby adding to the diversity of the transcriptomeRNA editing events occur via two distinct mechanisms of sub-stitution editing and insertiondeletion editing leading to func-tional diversity Fusion transcripts [142] and RNA editing eventsare also associated with various diseases including prostatecancer and amyotrophic lateral sclerosis [143] Computationalapproaches are available for the identification of fusion tran-scripts [144] and RNA editing events [145ndash147] from sequencingdata Understanding the precise roles and the impact of variantson such a diverse set of transcripts using computationalapproaches is a challenging task Ascribing the functional roleand assessing the impact of coding SNVs for mediating noveland fusion transcripts using bioinformatics approaches is anemerging problem and being addressed in recent studies [148]Recently we designed an integrated method to simultaneousanalyses of genome and transcriptome from RNA-seq data TheeSNV-Detect method can precisely capture genetic variation(genotypes) from RNA-seq data and helps to design cost-

effective and sustainable experimental strategies [149] Analyticand interpretation strategies that rely on multiple data-typeswould provide greater confidence for variant calling annotationand interpretation and thus actionability for clinicalinterventions

Functional impact of synonymous variations

Synonymous mutations are defined as mutations that result ina variation at the DNA level that code for the same amino acidin the protein level owing to codon degeneracy Synonymousvariants may either be coding or noncoding and the functionalroles of such variants are emerging A recent study [150] sug-gests that introns are involved in functional mechanisms andintrons with positional conservation across eukaryotic lineageare classified as functional introns A comparative analysis ofsynonymous and non-synonymous variants associated withcomplex diseases has shown similar likelihood and effect sizewith disease association [151] Synonymous mutations mayalso influence the introns regulating gene expression [152] A re-cent report also suggests that the coding exons function as tis-sue-specific enhancers and synonymous variations in suchenhancer sites may influence the expression level of certaingenes [153] Synonymous variants could influence the expres-sion of intronic noncoding RNAs [154] perturb the transientproteinndashDNA [155 156] and proteinndashRNA interactions in the cellSuch perturbations can lead to diseases such as cancer neuro-logical disorders and cardiac disease [157] Developing bioinfor-matics approaches to explore the putative impact ofsynonymous mutations on different layers of the coding andnoncoding genome and their relationships will be an importantaspect in the detailed interpretation of the genomic variantsThe detailed structural exploration of proteinndashDNA [158] andproteinndashRNA interactions [159] will help to precisely map themechanistic bases of the loss or gain of interactions owing tosynonymous variations

Impact of coding SNVs on the interactomeof a protein

The individual interactome of proteins varies to a great extent[160] Proteins often interact with nucleotides (proteinndashDNA orproteinndashRNA interactions) proteins (protein complexes obligateor non-obligate) [161] and proteinndashprotein interactions (transientor permanent) [107]) and small molecules (molecular reactionsenzymatic reactions metabolic pathways) [162 163] CodingSNVs play an important role in mediating such interactions Aninteractome of a protein can be defined using information gath-ered from high-throughput experiments that systematicallyidentify interactants of proteins Public protein-protein databaseslike BioGRID [164] and STRING [165] provide large data sets forderiving a protein interactome Network-level investigations tounderstand the category of interactors that are perturbed owingto coding SNVs will help delineate the impact of such mutationson the protein interactomes Using network measures (centralitydegree stress betweenness closeness cliques radiality transi-tivity reciprocity assortativity structural equivalence networkheterogeneity network density clustering coefficients neighbor-hood connectivity shared neighbors network topology etc) asquantitative parameters in variant annotation pipelines will helpthe researchers gain better insights into the impact of codingSNVs on the protein interactome Biologically relevant networkproperties such as network modularity [166] network fragility

14 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

[166] and lethality [166] can also be derived from an interactomeIncorporating a comparative network analysis framework thatcompares wild-type and variant interactomes will help quantifythe impact of coding SNVs on a network scale Methods such asVCF2Networks employ genotype networks (ie all genotypesassociated with a single phenotype) to understand the relation-ships between the genotype space and clinical or biological phe-notypes [167] Table 7 summarizes the available tools toinvestigate gene lists and to derive global functional trends andnetwork properties

Pathway-level impact of coding SNVs

Multiple functional genomics studies have investigated the per-turbations of pathways as a result of mutations in diseasessuch as adenocarcinoma (MAPK signaling p53 signaling Wntsignaling cell cycle and mammalian target of rapamycin path-ways) [168] childhood acute lymphoblastic leukemia (RAS path-way) [169] colorectal carcinoma (WNT signaling pathway) [170]etc Several SNP-centric approaches for pathway-level inferencehave been designed for the interpretation of SNPs identifiedfrom GWA studies or differentially expressed genes from gene-expression studies [171ndash175] Coding SNVs may also influencethe cross talk between various signaling pathways Currentvariant interpretation algorithms strive to identify pathways

associated with genes harboring coding SNVs yet a detailedunderstanding of the impact of variants on biological pathwaysand pathway cross talks [176] is often lacking Incorporatinganalytical routines to quantify the effect of functional muta-tions on pathways and pathway cross talk will be useful in in-terpreting functional variants from a biological perspective

A classification schema for annotatingcoding SNVs

Identifying the entire spectrum of molecular perturbationsowing to coding variations at the level of sequence structureinteraction and function of proteins is considered the basis ofvariant interpretation Multiple automated prediction tools thatcan assess the functional effect of mutations are currentlyavailable [See Table 6] Several variant annotation tools focuson the task of predicting the type or effect of mutations provideextended annotations from precomputed databases or lookuptables and map a coding variant to its corresponding gene pro-tein functional domain or signaling pathway Theseapproaches have several limitations because additional layersof protein-centric information that can be derived from the pre-diction or computation of sequence-based features with codingvariants are lacking To deal with this challenge and to designbroad-spectrum tools for deep variant interpretation we

Table 7 Software libraries and tools for biological network analysis

Name Description URL

Bio4J Bio4j is a bioinformatics graph-based database httpwwwbio4jcomBioConductor (Graphs

and Networks view)Collection of BioConductor modules for biological net-

work analysis and visualizationhttpwwwbioconductororgpackagesrelease

BiocViewshtml___GraphsAndNetworksCytoscape Open source platform for complex network analysis and

visualization with a large collection of plug-ins forbiological network analysis

httpwwwcytoscapeorg

DAVID Integrated functional annotation tool httpdavidabccncifcrfgovhomejspFunDO Functional Disease Ontology server httpdjangonubicnorthwesternedufundogene2pathway R package for prediction of KEGG pathway membership

for individual genes based on InterPro domainsignatures

httpwwwbioconductororgpackagesreleasebiochtmlgene2pathwayhtml

GeneAnswers R package for biological or medical interpretation of thegiven one or more groups of genes by means of statis-tical test

httpwwwbioconductororgpackagesreleasebiochtmlGeneAnswershtml

GO Tools Tools for analysis of GO T term enrichment statisticalanalysis semantic similarity and functional similar-ity using GO terms derived from gene lists

httpwwwgeneontologyorgGOtoolsshtml

Gephi Open-source graph visualization and analysis software httpwwwgephiorgGremlin Graph-based programming language httpsgithubcomtinkerpopgremlinwikiiGraph Network analysis and visualization library in C Also

available R package and a Python extensionhttpigraphsourceforgenet

KEGGgraph R package for analysis of KEGG pathways httpwwwbioconductororgpackages210biochtmlKEGGgraphhtml

LEDA A broad-spectrum Cthornthorn class library for efficient datatypes and algorithms including large-scale networkanalysis

httpwwwalgorithmic-solutionscomledaaboutindexhtm

Neat Web-based network analysis tools httpgephiorgOntology Analysis

plugins for Cytoscapehttpchiantiucsdeducyto_webplugins Plugins for functional enrichment analysis using

network dataPANTHER Classification of genes and proteins httppantherdborgReactome Curated knowledgebase of molecular events and

pathwayshttpwwwreactomeorg

TargetMine Integrates different types of biological data and enableflexible queries export results and analyze lists ofdata

httptargetminemizuguchilaborg

Interpreting functional effects of coding variants | 15

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

recommend a three-level annotation schema for the interpret-ation of coding variants (Figure 7) Primary level annotation pro-vides an overview of the types of mutations and associationannotations Secondary level annotation enables the systematicinvestigation of multiple variants in a single gene Tertiary-levelannotations help to find the global characteristics of genes har-boring coding variants and various properties in a network-scale

Level 1 Primary annotation of coding SNVs

Tools that use position information from VCF files to derive thetype of mutation (synonymous or non-synonymous) the loca-tion of the corresponding gene and its mapping onto several an-notation resources using positional data may be viewed aslsquoprimary annotationsrsquo See Table 2 for a list of tools that useminimum input data to gather a diverse set of annotationsusing database lookup tables and identifier mapping

Level 2 Secondary annotation of coding SNVs- Impactof multiple variants in a single gene

Sequence and structural explorations of coding variants arereferred to as secondary annotations VCF files can have mul-tiple variants of the same gene associated with a given pheno-type or disease A systematic investigation of multiple variantsin a single gene and an assessment of the relative impact of cod-ing variants could help in classifying multiple variants in a geneand help to delineate the variants as pathogenic moderatelypathogenic or VUS Comparative analysis of wild-type and vari-ant sequences can be performed and quantified using the totalnumber of gains or losses in the sequence features Based on the

availability of the structural data a given variant can be modeledin a protein structure using in-silico mutagenesis experimentsIn the absence of experimental protein structural data a proteinstructure model can be obtained from the homology model data-base or a new model can be built using homology modelingapproaches Once the wild type and mutant type structures areobtained the impact of variation can be rapidly computed usingstructural feature analyses or using computationally expensivemolecular dynamics simulations Table 5 summarizes key se-quence and structure feature prediction tools

Level 3 Tertiary annotation of coding SNVs- Impact ofcoding SNVs on multiple genes pathways andinteractome

The global impact of multiple coding variants on a genome orexome can be assessed using a combination of both knowledge-based enrichment or depletion analysis and network or interac-tome analyses Functional profiling using GO terms have beenused as an effective approach for characterizing collective func-tional characteristics of a perturbed set of genes [177 178]Gene-list-based analyses can provide biologically relevant infor-mation about the variants Enrichment analysis is not just con-fined to a priori defined gene sets or GO annotationsenrichment can be performed using several types of annota-tions and could provide insights into the probable functional as-sociations of genes with coding variants Enrichment analysisusing protein annotations would help to identify functionallysignificant protein domains protein classes families (mem-brane proteins or kinases etc) protein superfamilies (angioten-sin receptors or G-protein coupled receptors) or protein folds(Rossman fold beta-propeller) associated with proteins

Figure 7 Three-level schema for annotation of functional variants

16 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

harboring coding variants Furthermore it is possible to identifyenriched molecular events mediated by genes enriched TFBS inthe upstream of genes and biological pathways mediated by thegenes using annotations from Reactome UCSC and KEGG repo-sitories Disease ontology can be used to find known genendashdisease associations Following the knowledge-based analysis anetwork-level analysis of genes harboring coding SNVs can beperformed using publicly available network analysis tools suchas Cytoscape and Gephi

Future directions

The rapidly increasing availability of sequence structure func-tional and interaction information offers an attractive means toobtain a detailed characterization of coding variants Howeverthe use of large data sets to systematically explore the functionalvariants is currently limited owing to the gaps in the availablesequencendashstructurendashfunction interaction data Sequence dataare available for the entire human proteome but the availabilityof protein structural data experimentally verified functional as-sociations and biomolecular interaction data is limited Theavailability of structural data of homologous protein structuresis a prerequisite for structural investigations of functional vari-ants and its impact on the structural environment Availabilityof high-quality interaction data will also help analyze the cellu-lar networks modulated by proteins harboring coding variantsThe expansion of structural data sets using structural genomics[179] homology modeling [180] and the application of computa-tional approaches such as genomics-aided structure prediction[181] is likely to lower the increasing chasm between sequencestructure function and interaction data and may help to ascer-tain the impact of variants at the biomolecular level [163 182183] Improving the annotation databases using high-qualitycurated data and integration with cloud-based or stand-alonepipelines [184ndash189] can also help facilitate such efforts

The current version of the Bioinformatics Link Directory en-lists gt900 bioinformatics tools that are capable of processing dif-ferent types of protein data (sequence structure interactionquantification and annotation) Unifying multiple resources andenabling programmatic access via application program interfacesand web services for rapid integration will significantly enhancethe efficacy of variant annotation pipelines The development ofnovel statistical methods that can quantify various residue-specific properties would enable a comparative analysis of wild-type and variant sequences and can thereby help to facilitate theidentification and prioritization of functional mutations The in-clusion of a diverse set of sequence structure and network fea-tures into a graphically depicted database and an assessment ofthe impact of various data types (sequence structure functionand interaction) using probabilistic or machine learning modelswould also help automate variant annotation interpretation

From a sequence perspective understanding the precisefunctional role of LCRs unassigned regions and disordered re-gions as well as the relationship between these features and cod-ing SNVs is a key step in variant annotation and interpretationIdentifying deviations in protein structural space that are likelyto lead to misfolding domain swapping aggregation degrad-ation or deviations that are metamorphic in nature owing to theimpact of coding SNVs is an emerging challenge Several algo-rithms are available to predict sequence or structural domainsassign distantly related domains to unassigned regions func-tional motifs structural motifs and structural features proteindisorder and aggregation propensities However algorithms andtools to predict 3D domain swapping or to analyze metamorphic

properties degradation pathways and the network-level impactof coding SNVs are less abundant This may be attributed to thefact that these concepts are continually evolving in the proteinuniverse and concerted efforts are needed to understand the im-pact of coding variants of such unique features from both experi-mental and computational biologists alike

This review summarizes the major bioinformatics chal-lenges involved in gaining a deeper understanding of codingSNVs Coding SNVs may impart a molecular or disease pheno-type by interacting with noncoding regulatory parts of the gen-ome and the structural variants may provide an importantcontribution to this phenomenon Analyzing functional andregulatory variants in a single analytical framework may furtherenhance the interpretation of the genomic variants

Conclusions

A rapid decline in sequencing costs using NGS technologies hasled to an exponential increase in the frequency of the sequenc-ing projects over the past decade A large number of personalgenomes and exomes clinical samples and cancer genomes arebeing sequenced as part of large-scale collaborative projectsThe identification of a plethora of sequence variants associatedwith diverse molecular and disease phenotypes Scalable com-putational approaches that integrate annotations using se-quence structure functional and interaction data are necessaryfor the rapid interpretation of coding SNVs We have discussedthe bioinformatics resources available for the prediction anno-tation and visualization of coding SNVs summarized the majorbioinformatics challenges into 10 different themes for a deeperinterpretation of coding SNVs and recommended a three-levelschema to assess the phenotypic impact of functional vari-ations on individual protein sequences structures differentfunctional categories biological pathways and interactomesWe envisage that addressing the key challenges discussed inthis review and adopting a comprehensive annotation schemafor variant annotation could improve genomic reports that aregenerated as part of genomic medicine investigations and ex-perimental studies to better understand the variations impli-cated in rare common and complex disease manifestations

Key Points

bull Interpretations of variants identified from next-gener-ation sequencing pose several challenges

bull Systems-level experimental investigation of the func-tional variants is expensive and time-consumingefficient computational techniques are required toidentify the impact of functional variants

bull We recommended an integrated approach that com-bines multiple data types and tools for the predictionannotation and visualization of functional variantsand we have proposed a systematic approach for func-tional variant annotation and interpretation

bull Significant challenges that need to be negotiated dur-ing the interpretation of coding single nucleotide vari-ants are presented with the help of various examples

bull A three-level annotation approach that combines theinformation at the level of an individual variant mul-tiple variants in a single protein and global trends ofmultiple genes harboring multiple variants is proposedfor an effective interpretation of coding single nucleo-tide variants

Interpreting functional effects of coding variants | 17

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

Acknowledgements

KS and JTD would like to thank Icahn Institute forGenomics and Multiscale Biology (httpicahnmssmedudepartments-and-institutesgenomics) Mount Sinai HealthSystem for infrastructural support KS would like to ac-knowledge Michael J Zimmermann and Jean-Pierre Kocher(Mayo Clinic Rochester) for useful discussions The MayoClinic Bioinformatics Core provided infrastructure and ana-lytical support to KRK RS would like to thank NationalCentre for Biological Sciences (TIFR) for infrastructuralsupport

Funding

KS and JTD was supported by grant R01-DK098242-03from the National Institute of Diabetes and Digestive andKidney Diseases KRK is funded in part by the Mayo ClinicCenter for Individualized Medicine the PharmacogenomicsResearch Network (PGRN) and the Mayo Clinic BreastSpecialized Program of Research Excellence (SPORE)

References1 Lander ES Initial impact of the sequencing of the human

genome Nature 2011470187ndash972 Mardis ER A decadersquos perspective on DNA sequencing tech-

nology Nature 2011470198ndash2033 Green ED Guyer MS Charting a course for genomic medi-

cine from base pairs to bedside Nature 2011470204ndash134 Maxmen A Exome sequencing deciphers rare diseases Cell

2011144635ndash75 Wheeler DA Srinivasan M Egholm M et al The complete

genome of an individual by massively parallel DNAsequencing Nature 2008452872ndash6

6 Weatheritt RJ Davey NE Gibson TJ Linear motifs conferfunctional diversity onto splice variants Nucleic Acids Res201240123ndash31

7 Goldstein DB Common genetic variation and human traitsN Engl J Med 20093601696ndash8

8 Manolio TA Collins FS Cox NJ et al Finding the missing her-itability of complex diseases Nature 2009461747ndash53

9 Genome-wide association study of 14 000 cases of sevencommon diseases and 3 000 shared controls Nature2007447661ndash78

10 Kullo IJ Ding K Shameer K et al Complement receptor 1gene variants are associated with erythrocyte sedimenta-tion rate Am J Hum Genet 201189131ndash8

11 Welter D MacArthur J Morales J et al The NHGRI GWASCatalog a curated resource of SNP-trait associations NucleicAcids Res 201442D1001ndash6

12 Leslie R OrsquoDonnell CJ Johnson AD GRASP analysis of geno-type-phenotype results from 1390 genome-wide associationstudies and corresponding open access databaseBioinformatics 201430i185ndash94

13 Nelson MR Wegmann D Ehm MG et al An abundance ofrare functional variants in 202 drug target genes sequencedin 14 002 people Science 2012337100ndash4

14 MacArthur DG Balasubramanian S Frankish A et al A sys-tematic survey of loss-of-function variants in human pro-tein-coding genes Science 2012335823ndash8

15 Raychaudhuri S Mapping rare and common causal allelesfor complex human diseases Cell 201114757ndash69

16 Stankiewicz P Lupski JR Structural variation in the humangenome and its role in disease Annu Rev Med201061437ndash455

17 McCarroll SA Altshuler DM Copy-number variation and as-sociation studies of human disease Nat Genet200739S37ndash42

18 Zhang F Gu W Hurles ME et al Copy number variation inhuman health disease and evolution Annu Rev GenomicsHum Genet 200910451ndash81

19 Medvedev P Stanciu M Brudno M Computational methodsfor discovering structural variation with next-generationsequencing Nat Methods 20096S13ndash20

20 Sherry ST Ward MH Kholodov M et al dbSNP the NCBI data-base of genetic variation Nucleic Acids Res 200129308ndash11

21 Rios D McLaren WM Chen Y et al A database and API forvariation dense genotyping and resequencing data BMCBioinformatics 201011238

22 Iafrate AJ Feuk L Rivera MN et al Detection of large-scale variation in the human genome Nat Genet200436949ndash51

23 Ashley EA Butte AJ Wheeler MT et al Clinical assess-ment incorporating a personal genome Lancet20103751525ndash35

24 Chen R Mias GI Li-Pook-Than J et al Personal omics profil-ing reveals dynamic molecular and medical phenotypesCell 20121481293ndash307

25 Lupski JR Reid JG Gonzaga-Jauregui C et al Whole-genomesequencing in a patient with Charcot-Marie-Tooth neur-opathy N Engl J Med 20103621181ndash91

26 Li Y Vinckenbosch N Tian G et al Resequencing of200 human exomes identifies an excess of low-fre-quency non-synonymous coding variants Nat Genet201042969ndash72

27 Clarke L Zheng-Bradley X Smith R et al The 1000 GenomesProject data management and community access NatMethods 20129459ndash462

28 Schunkert H Konig IR Kathiresan S et al Large-scale associ-ation analysis identifies 13 new susceptibility loci for coron-ary artery disease Nat Genet 201143333ndash8

29 Helgadottir A Thorleifsson G Manolescu A et al A commonvariant on chromosome 9p21 affects the risk of myocardialinfarction Science 20073161491ndash3

30 Visel A Zhu Y May D et al Targeted deletion of the 9p21non-coding coronary artery disease risk interval in miceNature 2010464409ndash12

31 Soranzo N Rendon A Gieger C et al A novel variant onchromosome 7q223 associated with mean platelet volumecounts and function Blood 20091133831ndash7

32 Paul DS Nisbet JP Yang TP et al Maps of open chromatinguide the functional follow-up of genome-wide associationsignals application to hematological traits PLoS Genet20117e1002139

33 Hull J Campino S Rowlands K et al Identification of com-mon genetic variation that modulates alternative splicingPLoS Genet 20073e99

34 Albert FW Kruglyak L The role of regulatory vari-ation in complex traits and disease Nat Rev Genet201516197ndash212

35 Battle A Khan Z Wang SH et al Genomic variation Impactof regulatory variation from RNA to protein Science2015347664ndash7

36 den Dunnen JT Antonarakis SE Nomenclature for the de-scription of human sequence variations Hum Genet2001109121ndash4

18 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

37 Eilbeck K Lewis SE Mungall CJ et al The sequence ontologya tool for the unification of genome annotations Genome Biol20056R44

38 Landrum MJ Lee JM Riley GR et al ClinVar public archive ofrelationships among sequence variation and human pheno-type Nucleic Acids Res 201442D980ndash5

39 Kircher M Witten DM Jain P et al A general framework forestimating the relative pathogenicity of human genetic vari-ants Nat Genet 201446310ndash15

40 Kumar P Henikoff S Ng PC Predicting the effects of codingnon-synonymous variants on protein function using theSIFT algorithm Nat Protoc 200941073ndash81

41 Adzhubei I Schmidt S Peshkin L et al A method and serverfor predicting damaging missense mutations Nat Meth20107248ndash9

42 Gonzalez-Perez A Lopez-Bigas N Improving the assessmentof the outcome of nonsynonymous SNVs with a consensusdeleteriousness score Condel Am J Hum Genet 201188440ndash9

43 Grantham R Amino acid difference formula to help explainprotein evolution Science 1974185862ndash4

44 Cooper GM Stone EA Asimenos G et al Distribution and in-tensity of constraint in mammalian genomic sequenceGenome Res 200515901ndash13

45 Pollard KS Hubisz MJ Rosenbloom KR et al Detection ofnonneutral substitution rates on mammalian phylogeniesGenome Res 201020110ndash21

46 Siepel A Bejerano G Pedersen JS et al Evolutionarily con-served elements in vertebrate insect worm and yeast gen-omes Genome Res 2005151034ndash50

47 Dayhoff MO Eck RV Park CM Model of evolutionary changein proteins In MO Dayhoff (ed) Atlas of Protein Sequence andStructure Washington DC National Biomedical ResearchFoundation 1972 89ndash99

48 Henikoff S Henikoff JG Amino acid substitution matricesfrom protein blocks Proc Natl Acad Sci USA 19928910915ndash19

49 Altschul SF Gish W Miller W et al Basic local alignmentsearch tool J Mol Biol 1990215403ndash10

50 Altschul SF Madden TL Schaffer AA et al Gapped BLASTand PSI-BLAST a new generation of protein database searchprograms Nucleic Acids Res 1997253389ndash402

51 Sunyaev S Ramensky V Koch I et al Prediction of deleteri-ous human alleles Hum Mol Genet 200110591ndash7

52 Yang Z Bielawski JP Statistical methods for detecting mo-lecular adaptation Trends Ecol Evol 200015496ndash503

53 Ng PC Henikoff S Predicting deleterious amino acid substi-tutions Genome Res 200111863ndash74

54 Wang Z Moult J SNPs protein structure and disease HumMutat 200117263ndash70

55 Thomas PD Kejariwal A Coding single-nucleotide poly-morphisms associated with complex vs Mendelian diseaseevolutionary evidence for differences in molecular effectsProc Natl Acad Sci USA 200410115398ndash403

56 Stone EA Sidow A Physicochemical constraint violation bymissense substitutions mediates impairment of proteinfunction and disease severity Genome Res 200515978ndash86

57 Miller MP Kumar S Understanding human disease muta-tions through the use of interspecific genetic variation HumMol Genet 2001102319ndash28

58 Johnson AD Single-nucleotide polymorphism bioinfor-matics a comprehensive review of resources Circ CardiovascGenet 20092530ndash6

59 Marjoram P Tavare S Modern computational approachesfor analysing molecular genetic variation data Nat Rev Genet20067759ndash70

60 Excoffier L Heckel G Computer programs for populationgenetics data analysis a survival guide Nat Rev Genet20067745ndash58

61 Peterson TA Nehrt NL Park D et al Incorporating molecularand functional context into the analysis and prioritizationof human variants associated with cancer J Am Med InformAssoc 201219275ndash83

62 Kann MG Advances in translational bioinformatics compu-tational approaches for the hunting of disease genes BriefBioinform 20101196ndash110

63 Cooper GM Shendure J Needles in stacks of needles findingdisease-causal variants in a wealth of genomic data Nat RevGenet 201112628ndash40

64 Stitziel NO Kiezun A Sunyaev S Computational and statis-tical approaches to analyzing variants identified by exomesequencing Genome Biol 201112227

65 Alexander RP Fang G Rozowsky J et al Annotating non-coding regions of the genome Nat Rev Genet 201011559ndash71

66 Mort M Evani US Krishnan VG et al In silico func-tional profiling of human disease-associated and poly-morphic amino acid substitutions Hum Mutat 201031335ndash46

67 Hamosh A Scott AF Amberger JS et al Online MendelianInheritance in Man (OMIM) a knowledgebase of humangenes and genetic disorders Nucleic Acids Res200533D514ndash17

68 Stenson PD Mort M Ball EV et al The human gene mutationdatabase 2008 update Genome Med 2009113

69 Forbes SA Bindal N Bamford S et al COSMIC mining com-plete cancer genomes in the Catalogue of SomaticMutations in Cancer Nucleic Acids Res 201139D945ndash50

70 Chen YA Tripathi LP Mizuguchi K TargetMine an inte-grated data warehouse for candidate gene prioritisation andtarget discovery PLoS One 20116e17844

71 Wang K Li M Hakonarson H ANNOVAR functional annota-tion of genetic variants from high-throughput sequencingdata Nucleic Acids Res 201038e164

72 Danecek P Auton A Abecasis G et al The variant call formatand VCFtools Bioinformatics 2011272156ndash8

73 Macintyre G Jimeno Yepes A Ong CS et al Associating dis-ease-related genetic variants in intergenic regions to thegenes they impact Peer J 20142e639

74 Javed A Agrawal S Ng PC Phen-Gen combining phenotypeand genotype to analyze rare disorders Nat Methods201411935ndash7

75 Dai HJ Wu JC Tsai RT et al T-HOD a literature-based candi-date gene database for hypertension obesity and diabetesDatabase (Oxford) 20132013bas061

76 Cheng D Knox C Young N et al PolySearch a web-basedtext mining system for extracting relationships betweenhuman diseases genes mutations drugs and metabolitesNucleic Acids Res 200836W399ndash405

77 Do R Kathiresan S Abecasis GR Exome sequencing andcomplex disease practical aspects of rare variant associ-ation studies Hum Mol Genet 201221R1ndash9

78 Wagner MJ Rare-variant genome-wide association studiesa new frontier in genetic analysis of complex traitsPharmacogenomics 201314413ndash24

79 Genomes Project C Abecasis GR Auton A et al An inte-grated map of genetic variation from 1 092 human genomesNature 201249156ndash65

80 Anfinsen CB The formation and stabilization of proteinstructure Biochem J 1972128737ndash49

81 Anfinsen CB Principles that govern the folding of proteinchains Science 1973181223ndash30

Interpreting functional effects of coding variants | 19

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

82 Fay J Wu C Sequence divergence functional constraintand selection in protein evolution Annu Rev Genomics HumGenet 20034213ndash35

83 Ponting CP Russell RR The natural history of protein do-mains Annu Rev Biophys Biomol Struct 20023145ndash71

84 Ekman D Bjorklund AK Frey-Skott J et al Multi-domain pro-teins in the three kingdoms of life orphan domains andother unassigned regions J Mol Biol 2005348231ndash43

85 Sigrist CJ Cerutti L Hulo N et al PROSITE a documenteddatabase using patterns and profiles as motif descriptorsBrief Bioinform 20023265ndash74

86 Dinkel H Michael S Weatheritt RJ et al ELMndashthe database ofeukaryotic linear motifs Nucleic Acids Res 201240D242ndash51

87 Haberichter SL Budde U Obser T et al The mutation N528Sin the von Willebrand factor (VWF) propeptide causes de-fective multimerization and storage of VWF Blood20101154580ndash7

88 Kutz WE Wang LW Dagoneau N et al Functional analysis ofan ADAMTS10 signal peptide mutation in Weill-Marchesanisyndrome demonstrates a long-range effect on secretion ofthe full-length enzyme Hum Mutat 2008291425ndash34

89 Pugalenthi G Suganthan PN Sowdhamini R et al SMotif aserver for structural motifs in proteins Bioinformatics200723637ndash8

90 Sundaramurthy P Shameer K Sreenivasan R et al HORI aweb server to compute Higher Order Residue Interactions inprotein structures BMC Bioinformatics 201011(Suppl 1)S24

91 Porter CT Bartlett GJ Thornton JM The Catalytic Site Atlas aresource of catalytic sites and residues identified in enzymesusing structural data Nucleic Acids Res 200432D129ndash33

92 Stuart AC Ilyin VA Sali A LigBase a database of families ofaligned ligand binding sites in known protein sequencesand structures Bioinformatics 200218200ndash01

93 Swapna LS Mahajan S de Brevern A et al Comparison oftertiary structures of proteins in protein-protein complexeswith unbound forms suggests prevalence of allostery in sig-nalling proteins BMC Struct Biol 2012126

94 Mathivanan S Ahmed M Ahn NG et al Human Proteinpediaenables sharing of human protein data Nat Biotechnol200826164ndash7

95 Bhaskara RM Srinivasan N Stability of domain structures inmulti-domain proteins Sci Rep 2011140

96 Furnham N de Beer TA Thornton JM Current challenges ingenome annotation through structural biology and bioinfor-matics Curr Opin Struct Biol 201222594ndash601

97 Wang Y Cheng H Pan Z et al Reconfiguring phosphoryl-ation signaling by genetic polymorphisms affects cancersusceptibility J Mol Cell Biol 20157187ndash202

98 Gnad F Gunawardena J Mann M PHOSIDA 2011 the post-translational modification database Nucleic Acids Res201139D253ndash60

99 Nicolaou N Margadant C Kevelam SH et al Gain of glycosy-lation in integrin alpha3 causes lung disease and nephroticsyndrome J Clin Invest 20121224375ndash87

100 Radivojac P Baenziger PH Kann MG et al Gain and loss ofphosphorylation sites in human cancer Bioinformatics200824i241ndash7

101 Pan Y Karagiannis K Zhang H et al Human germline andpan-cancer variomes and their distinct functional profilesNucleic Acids Res 20144211570ndash88

102 Mazumder R Morampudi KS Motwani M et al Proteome-wide analysis of single-nucleotide variations in the N-glyco-sylation sequon of human genes PLoS One 20127e36212

103 Reddy CC Shameer K Offmann BO et al PURE a webserverfor the prediction of domains in unassigned regions in pro-teins BMC Bioinformatics 20089281

104 Coletta A Pinney JW Solis DY et al Low-complexity regionswithin protein sequences have position-dependent rolesBMC Syst Biol 2010443

105 Jones DT Swindells MB Getting the most from PSI-BLASTTrends Biochem Sci 200227161ndash4

106 Chograni M Rejeb I Jemaa LB et al The first missense muta-tion of NHS gene in a Tunisian family with clinical featuresof NHS syndrome including cardiac anomaly Eur J HumGenet 201119851ndash6

107 Jones S Thornton JM Principles of protein-protein inter-actions Proc Natl Acad Sci USA 19969313ndash20

108 Babu MM van der Lee R de Groot NS et al Intrinsically dis-ordered proteins regulation and disease Curr Opin Struct Biol201121432ndash40

109 Dunker AK Silman I Uversky VN et al Function and struc-ture of inherently disordered proteins Curr Opin Struct Biol200818756ndash64

110 Wright PE Dyson HJ Intrinsically unstructured proteins re-assessing the protein structure-function paradigm J Mol Biol1999293321ndash31

111 Dyson HJ Wright PE Coupling of folding and binding for un-structured proteins Curr Opin Struct Biol 20021254ndash60

112 Nguyen Ba AN Yeh BJ van Dyk D et al Proteome-wide dis-covery of evolutionary conserved sequences in disorderedregions Sci Signal 20125rs1

113 Das RK Mao AH Pappu RV Unmasking functionalmotifs within disordered regions of proteins Sci Signal20125pe17

114 Hu Y Liu Y Jung J et al Changes in predicted protein dis-order tendency may contribute to disease risk BMCGenomics 201112(Suppl 5)S2

115 Vacic V Iakoucheva LM Disease mutations in disordered re-gionsndashexception to the rule Mol BioSyst 2012827ndash32

116 Srivastava SK Gayathri S Manjasetty BA et al Analysis ofconformational variation in macromolecular structuralmodels PLoS One 20127e39993

117 Ward JJ Sodhi JS McGuffin LJ et al Prediction and functionalanalysis of native disorder in proteins from the three king-doms of life J Mol Biol 2004337635ndash45

118 Thomas PJ Qu BH Pedersen PL Defective protein folding asa basis of human disease Trends Biochem Sci 199520456ndash9

119 Welch W Role of quality control pathways in human dis-eases involving protein misfolding Semin Cell Dev Biol20041531ndash8

120 Fink AL Protein aggregation folding aggregates inclusionbodies and amyloid Fold Des 19983R9ndash23

121 Thirumalai D Klimov DK Dima RI Emerging ideas on themolecular basis of protein and peptide aggregation CurrOpin Struct Biol 200313146ndash59

122 Khare SD Dokholyan NV Molecular mechanisms of poly-peptide aggregation in human diseases Curr Protein Pept Sci20078573ndash9

123 Soto C Estrada L Castilla J Amyloids prions and the inher-ent infectious nature of misfolded protein aggregatesTrends Biochem Sci 200631150ndash5

124 Prusiner SB Molecular biology and pathogenesis of priondiseases Trends Biochem Sci 199621482ndash7

125 Shameer K Shingate PN Manjunath SC et al 3DSwap cura-ted knowledgebase of proteins involved in 3D domain swap-ping Database (Oxford) 20112011bar042

20 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

126 Bennett MJ Choe S Eisenberg D Domain swapping entan-gling alliances between proteins Proc Natl Acad Sci USA1994913127ndash31

127 Bennett MJ Sawaya MR Eisenberg D Deposition diseasesand 3D domain swapping Structure 200614811ndash24

128 Shameer K Sowdhamini R Functional repertoire molecularpathways and diseases associated with 3D domain swap-ping in the human proteome J Clin Bioinform 201228

129 Waters PJ Parniak MA Hewson AS et al Alterations in pro-tein aggregation and degradation due to mild and severemissense mutations (A104D R157N) in the human phenyl-alanine hydroxylase gene (PAH) Hum Mutat 199812344ndash54

130 OrsquoNeill JW Kim DE Johnsen K et al Single-site mutations in-duce 3D domain swapping in the B1 domain of protein Lfrom Peptostreptococcus magnus Structure 200191017ndash27

131 Seeliger MA Spichty M Kelly SE et al Role of conformationalheterogeneity in domain swapping and adapter function ofthe Cks proteins J Biol Chem 200528030448ndash59

132 Kirsten Frank M Dyda F Dobrodumov A et al Core muta-tions switch monomeric protein GB1 into an intertwinedtetramer Nat Struct Biol 20029877ndash85

133 Huang Y Cao H Liu Z Three-dimensional domain swappingin the protein structure space Proteins 2012801610ndash19

134 Gidalevitz T Wang N Deravaj T et al Natural genetic vari-ation determines susceptibility to aggregation or toxicity ina C elegans model for polyglutamine disease BMC Biol201311100

135 Ross CA Poirier MA Protein aggregation and neurodegener-ative disease Nat Med 200410(Suppl)S10ndash17

136 Murzin A Metamorphic proteins Science 20083201725ndash6137 Shortle D One sequence plus one mutation equals two

folds Proc Natl Acad Sci USA 200910621011ndash12138 Yadid I Kirshenbaum N Sharon M et al Metamorphic pro-

teins mediate evolutionary transitions of structure Proc NatlAcad Sci USA 20101077287ndash92

139 Yadid I Kirshenbaum N Sharon M et al Metamorphic pro-teins mediate evolutionary transitions of structure Proc NatlAcad Sci USA 20101077287ndash92

140 Pickrell J Marioni J Pai A et al Understanding mechanismsunderlying human gene expression variation with RNAsequencing Nature 2010464768ndash72

141 Matlin AJ Clark F Smith CW Understanding alternative splic-ing towards a cellular code Nat Rev Mol Cell Biol 20056386ndash98

142 Rickman DS Pflueger D Moss B et al SLC45A3-ELK4 is anovel and frequent erythroblast transformation-specific fu-sion transcript in prostate cancer Cancer Res 2009692734ndash8

143 Flomen R Makoff A Increased RNA editing in EAAT2 pre-mRNA from amyotrophic lateral sclerosis patients involve-ment of a cryptic polyadenylation site Neurosci Lett2011497139ndash43

144 Asmann YW Hossain A Necela BM et al A novel bioinfor-matics pipeline for identification and characterization of fu-sion transcripts in breast cancer and normal cell linesNucleic Acids Res 201139e100

145 Bundschuh R Computational prediction of RNA editingsites Bioinformatics 2004203214ndash20

146 Eisenberg E Li JB Levanon EY Sequence based identificationof RNA editing sites RNA Biol 20107248ndash52

147 Levanon EY Eisenberg E Algorithmic approaches for identi-fication of RNA editing sites Brief Funct Genomics Proteomic2006543ndash5

148 McCarthy DJ Humburg P Kanapin A et al Choice of tran-scripts and software has a large effect on variant annota-tion Genome Med 2014626

149 Tang X Baheti S Shameer K et al The eSNV-detect a com-putational system to identify expressed single nucleotidevariants from transcriptome sequencing data Nucleic AcidsRes 201442e172

150 Chorev M Carmel L The function of introns Front Genet2012355

151 Chen R Davydov EV Sirota M et al Non-synonymous andsynonymous coding SNPs show similar likelihood and effectsize of human disease association PLoS One 20105e13574

152 Rose AB Intron-mediated regulation of gene expressionCurr Top Microbiol Immunol 2008326277ndash90

153 Birnbaum RY Clowney EJ Agamy O et al Coding exonsfunction as tissue-specific enhancers of nearby genesGenome Res 2012221059ndash68

154 Nakaya HI Amaral PP Louro R et al Genome mapping andexpression analyses of human intronic noncoding RNAsreveal tissue-specific patterns and enrichment in genesrelated to regulation of transcription Genome Biol 20078R43

155 Rohs R Jin X West SM et al Origins of specificity in protein-DNA recognition Ann Rev Biochem 201079233ndash69

156 Karczewski KJ Dudley JT Kukurba KR et al Systematic func-tional regulatory assessment of disease-associated variantsProc Natl Acad Sci USA 20131109607ndash12

157 Khalil AM Rinn JL RNA-protein interactions in humanhealth and disease Semin Cell Dev Biol 201122359ndash65

158 Jones S van Heyningen P Berman HM et al Protein-DNAinteractions a structural analysis J Mol Biol 1999287877ndash96

159 Jones S Daley DT Luscombe NM et al Protein-RNAinteractions a structural analysis Nucleic Acids Res200129943ndash54

160 Barabasi AL Oltvai ZN Network biology understanding thecellrsquos functional organization Nat Rev Genet 20045101ndash13

161 Nooren IM Thornton JM Diversity of protein-protein inter-actions EMBO J 2003223486ndash92

162 Caspi R Altman T Dreher K et al The MetaCyc database ofmetabolic pathways and enzymes and the BioCyc collectionof pathwaygenome databases Nucleic Acids Res201240D742ndash53

163 Yamada T Bork P Evolution of biomolecular networks les-sons from metabolic and protein interactions Nat Rev MolCell Biol 200910791ndash803

164 Stark C Breitkreutz BJ Chatr-Aryamontri A et al TheBioGRID interaction database 2011 update Nucleic Acids Res201139D698ndash704

165 Szklarczyk D Franceschini A Kuhn M et al The STRINGdatabase in 2011 functional interaction networks of pro-teins globally integrated and scored Nucleic Acids Res201139D561ndash8

166 Bader GD Hogue CW An automated method for finding mo-lecular complexes in large protein interaction networksBMC Bioinformatics 200342

167 DallrsquoOlio GM Vahdati AR Bertranpetit J et alVCF2Networks applying Genotype Networks to SingleNucleotide Variants data Bioinformatics 201531438ndash9

168 Ding L Getz G Wheeler DA et al Somatic mutations affectkey pathways in lung adenocarcinoma Nature20084551069ndash75

169 Case M Matheson E Minto L et al Mutation of genes affect-ing the RAS pathway is common in childhood acutelymphoblastic leukemia Cancer Res 2008686803ndash9

170 Thorstensen L Lind GE Lovig T et al Genetic and epigeneticchanges of components affecting the WNT pathway in colo-rectal carcinomas stratified by microsatellite instabilityNeoplasia 2005799ndash108

Interpreting functional effects of coding variants | 21

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

171 Weng L Macciardi F Subramanian A et al SNP-based path-way enrichment analysis for genome-wide associationstudies BMC Bioinformatics 20111299

172 Wang K Li M Bucan M Pathway-based approaches for ana-lysis of genomewide association studies Am J Hum Genet2007811278ndash83

173 OrsquoDushlaine C Kenny E Heron EA et al The SNP ratio testpathway analysis of genome-wide association datasetsBioinformatics 2009252762ndash3

174 Tarca AL Draghici S Khatri P et al A novel signaling path-way impact analysis Bioinformatics 20092575ndash82

175 Ramanan VK Shen L Moore JH et al Pathway analysis ofgenomic data concepts methods and prospects for futuredevelopment Trends Genet 201228323ndash2

176 Li Y Agarwal P Rajagopalan D A global pathway crosstalknetwork Bioinformatics 2008241442ndash7

177 Rivals I Personnaz L Taing L et al Enrichment or depletionof a GO category within a class of genes which testBioinformatics 200723401ndash7

178 Rhee SY Wood V Dolinski K et al Use and misuse of thegene ontology annotations Nat Rev Genet 20089509ndash15

179 Chandonia JM Brenner SE The impact of structural gen-omics expectations and outcomes Science 2006311347ndash51

180 Johnson MS Srinivasan N Sowdhamini R et al Knowledge-based protein modeling Crit Rev Biochem Mol Biol 1994291ndash68

181 Sulkowska JI Morcos F Weigt M et al Genomics-aided struc-ture prediction Proc Natl Acad Sci USA 201210910340ndash5

182 Lee D Redfern O Orengo C Predicting protein function from se-quence and structure Nat Rev Mol Cell Biol 20078995ndash1005

183 Worth CL Gong S Blundell TL Structural and functionalconstraints in the evolution of protein families Nat Rev MolCell Biol 200910709ndash20

184 Karczewski KJ Fernald GH Martin AR et al STORMSeq anopen-source user-friendly pipeline for processing personalgenomics data in the cloud PLoS One 20149e84860

185 Kalari KR Rossell D Necela BM et al Deep sequence ana-lysis of non-small cell lung cancer integrated analysis ofgene expression alternative splicing and single nucleotidevariations in lung adenocarcinomas with and without onco-genic KRAS mutations Front Oncol 2012212

186 Patil S Upadhayay A Arya D et al A high throughput exomesequencing approach to analyse events within a good re-sponder CML patient under imatinib at diagnosis and underremission Blood 20131225161

187 Kullo IJ Shameer K Jouni H et al The ATXN2-SH2B3 locus isassociated with peripheral arterial disease an electronicmedical record-based genome-wide association study FrontGenet 20145166

188 Shameer K Denny JC Ding K et al A genome- and phe-nome-wide association study to identify genetic variantsinfluencing platelet count and volume and their pleiotropiceffects Hum Genet 201413395ndash109

189 Sabarinathan R Wenzel A Novotny P et al Transcriptome-wide analysis of UTRs in non-small cell lung cancer revealscancer-related genes with SNV-induced changes on RNAsecondary structure and miRNA target sites PLoS One20149e82699

22 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

  • bbv084-TF1
Page 2: challenges in proteome-scale prediction, annotation and assessment

mutations using in silico approaches is an important step toward the identification of functionally significant or clinicallyactionable variants The relationships between genotypes and the expressed phenotypes are multilayered and biologicallycomplex such relationships present numerous challenges and at the same time offer various opportunities for the design of insilico variant assessment strategies Over the past decade many bioinformatics algorithms have been developed to predictfunctional consequences of single nucleotide variants in the protein coding regions In this review we provide an overview ofthe bioinformatics resources for the prediction annotation and visualization of coding single nucleotide variants We discussthe currently available approaches and major challenges from the perspective of protein sequence structure function andinteractions that require consideration when interpreting the impact of putatively functional variants We also discuss therelevance of incorporating integrated workflows for predicting the biomedical impact of the functionally important variationsencoded in a genome exome or transcriptome Finally we propose a framework to classify variant assessment approachesand strategies for incorporation of variant assessment within electronic health records

Key words human genome functional variant human variation variant interpretation mutation human proteome non-synonymous mutations prediction algorithms functional genomics sequence analysis structure analysis

Introduction

Genomic technologies are redefining the understanding of geno-typendashphenotype relationships In the early 2000s array-basedgenomic technologies enabled gene expression analysis usingmicroarrays followed by single nucleotide polymorphism (SNPgenetic variation observed in a population) genotyping platformsin mid-2000s and subsequently by low-cost high-throughputmassively parallel sequencing platforms in the late 2000s [1ndash3]Genomic sequencing data offer insights into the relationships be-tween the human genomic variations and various molecular anddisease phenotypes Sequencing a biological or clinical sample tocharacterize genomic exomic or transcriptomic variations usingnext-generation sequencing (NGS) technologies helps to identifygenomic variations underlying complex diseases Moreoverapproaches such as targeted sequencing of disease-susceptiblegenomic regions whole exome sequencing (WES) whole genomesequencing (WGS) and RNA sequencing (RNA-Seq) providedeeper insights into the genetic bases of familial diseases and abetter understanding of the biological processes underlying dis-ease phenotypes such as tumors

Over the past decade the genetic basis of several complex dis-eases and clinically relevant quantitative traits were examinedusing SNP genotyping array-based genome-wide associationstudies (GWAS) Subsequently sequencing [4ndash6] studies have im-proved the understanding of the associations between SNPs withclinically relevant traits and human diseases (See httpwwwebiacukfgptgwas) SNP arrays have uncovered the more com-mon variations while sequencing has unravelled much of therare variants spectrum The frequency of SNPs in the humangenome is approximately 1300 base pair (bp) SNPs are generallycategorized as common variants [in which the minor allele fre-quency (MAF)frac14 1ndash5] or rare variants (MAFlt 1) according totheir population frequencies Both GWAS and targeted WES andWGS studies have expanded the catalog of genotypendashphenotypeassociations and offered insights into the roles of previouslyuncharacterized genetic regions in complex diseases [7ndash12]

The rapid increase in the high-quality sequencing data gener-ation using low-cost NGS experimental platforms coupled withspeedy bioinformatics algorithms have enhanced the identifica-tion of a large number of sequence and structural variants vari-ations (characterized by genomic DNAgt 1 kb in size) SNPs areassociated with medically relevant phenotypes as well as dis-eases [13ndash15] and are often observed on a population scalewhereas single nucleotide variations (SNVs) are specific poly-morphisms observed in an individual Recent studies have high-lighted the roles of the different types of structural variants (copy

number variations insertions and deletions (indels) inversionstranslocations linking anchored split mapping gainloss) in thegenetic bases of several complex diseases [16ndash19] Furthermoresequencing studies have helped to identify rare personal variantsand variants of unknown significance (VUS)

Public repositories of genomic sequencing and variation datahave experienced an exponential growth in the past decade Forexample Single Nucleotide Polymorphism database (dbSNP) [20]and Ensembl Variation database [21] that archive short geneticvariants and structural variants and Database of GenomicVariants [22] that archives a catalog of curated large-scale gen-omic structural variants in the human genome are expandingSince the first GWAS study reported in 2005 so far genetic basisof 1251 traits were discovered These investigations also ledphenotypic annotations for 15 396 SNPs A large number of clin-ical-grade genomes exomes or transcriptomes sequenced forindividualized medicine [23ndash25] and population-scale sequencing[26] projects such as 1000 genomes [27] Genome10K (httpwwwgenome10korg) and UK10K (httpwwwuk10korg) will furtheradd to the size of variant-centric databases in the futureAnalysis of the data from the sequencing experiments can bebroadly divided into four major tasks (i) quality assessment ofthe sequencing reads (ii) alignment of the sequencing reads withthe reference genome (iii) variant calling and (iv) functional andor clinical assessment and prioritization of variants

The variants identified from the NGS studies present severaldata interpretation challenges in bioinformatics The initial datainference is a key filtering step in the identification and prioritiza-tion of a subset of variants for functionally important cues andvalidation studies The following sections describe a simpleframework to classify the available and widely used bioinfor-matics resources for prediction annotation and interpretation ofcoding SNVs We also discuss 10 different analytical themes thatcan potentially be investigated from a bioinformatics perspectiveto obtain a better understanding of coding SNVs

Landscape of genetic variants

The mutation spectrum of the human genome is complex andhas been classified based on diverse criteria such as the mode ofinheritance heterozygosity pattern impact on chromosome or al-leles impact on protein sequence structure and function impacton the population or evolutionary role and penetrance (Figure 1)On the other hand SNVs may be broadly classified as codingSNVs (functional variations) that perturb the function of a tran-script or a protein or variants may be classified as noncoding

2 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

SNVs (regulatory variations) that are located in the regulatoryelements [promoters untranslated regions (UTRs) enhancershuman accelerated regions noncoding RNA genes transcriptionfactor binding sites (TFBS)] that regulate the expression levels of atranscript or a protein Regulatory variants may also perturb geneor protein functions via complex cis or trans-activation mechan-isms and may thus play important roles in influencing the ex-pression and functions of other genes An example of a functionalvariant (P33S) in the ribonucleoside-diphosphate reductase sub-unit M2 B (RRM2B) gene associated with autosomal recessive pro-gressive external ophthalmoplegia is provided in Figure 2(i) and2(ii) Another example of a missense variant R363H (rs11556924) inzinc finger C3HC-type containing 1 (ZC3HC1) gene is associatedwith coronary heart disease [28] P33S in RRM2B is part of a con-served haribonucleotide reductase domain The variant R363H inZC3HC1 does not lie within a known functional domain and thusthe variation is located in an unassigned region of the proteinThe domain architectures of RRM2B and ZC3HC1 along with thelocation of functional variants are highlighted in Figure 2(ii)Regulatory variants may influence the expression levels of cis- ortrans-acting genes through gene regulation networks For ex-ample an intergenic variant (rs10811661) in the 9p21 locus whichis harbored near CDKN2A and CDKN2B on chromosome 9 wasassociated with myocardial infarction [29] and coronary heart dis-ease [9] A targeted deletion study in mice that investigated the9p21 region revealed a regulatory role of the variant in perturbingthe expression of two genes (Cdkn2a and Cdkn2b) via a cis-acting

mechanism [30] Another regulatory variant rs342293 (intergenicvariant between FLJ36031-PIK3CG) which influences a quantitativetrait (mean platelet volume) [31] perturbed a TFBS of ecotropicviral integration site-1 (EVI1) and influenced the expressionlevels of PIK3CG [32] The genomic locations of the regulatory vari-ants rs10811661 and rs342293 that were generated using theEnsembl Genome Browser are highlighted in Figure 2(iii)Regulatory variants too impact the biological function by alteringthe level of transcription that may influence protein levels as wellThe majority of regulatory variants are not within the protein cod-ing regions but may indirectly influence geneprotein function viaalternative splicing mechanisms [33] See the recent reviewsthat summarized the impact of regulatory variation of protein ex-pression and function for a detailed account of regulatory variants[34 35]

Prediction annotation and visualizationof coding SNVs

Scientific literature often refers to variants using diverse sets ofterms including recommended nomenclature [36] or standardterms from Sequence Ontology (SO) [37] In this review we broadlyclassify genomic variants as lsquofunctional variantsrsquo and lsquoregulatoryvariantsrsquo and we primarily focus on coding SNVs and computa-tional approaches for the prediction annotation visualization andinterpretation of coding SNVs The different terms used to define

Mu

tatio

n

Chromosomalallelic

SequenceStructure

Function

Population

Loss of Heterozygozity

Inheritance

Germ line mutation

Somatic mutation

Passenger mutation

Driver mutation

Harmful mutation

Benign mutation

Neutral mutation

Deleterious mutation

Advantageous mutation

Nearly neutral mutation

Point mutation

Frameshift mutation

Nonsense mutation

Missense mutation

Synonymous mutation

Non-synonymous mutation

Silent mutation

Transitions

Transversions

Deletions

Chromosomal inversions

Amplifications

Deletions

Chromosomal translocations

Loss of Heterozygozity

Interstitial deletions

Loss-of-function mutation

Lethal mutation

Gain-of-function mutation

Gain-of-function mutation

Dominant negative mutations

Allele frequency

Common variant

Rare variant

Private variant

Penetrance

High penetrance

Low penetrance

Recessive

Dominant

Homozygous

Heterozygous

Spontaneous mutation

Induced mutation

Figure 1 The human mutation spectrum

Interpreting functional effects of coding variants | 3

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

variants in the Ensembl Variation resources and dbSNP and theircorresponding SO identifiers and descriptions are summarized inTable 1 The interpretation of variants can be broadly divided intothree steps lsquoannotationrsquo lsquopredictionrsquo and lsquovisualizationrsquo

A given set of variant(s) identified from a sequencing studywill be segregated into various classes of genomic variants in thefirst step in variant annotation (Table 1) Cross-referencing thevariants with reference databases and clinical annotation data-bases like ClinVar [38] can be used to assess the novelty of gen-omic variants Fully automated pipelines can be used to annotatevariants based on coordinate specific mapping to genomic re-gions using proprietary andor public databases and various fea-tures associated with variant can be derived from suchannotation mapping and variant-specific layering approach Forexample a variant can be mapped to a protein-coding regionjunction regions or noncoding regions of the genome

Based on the location of the variant in the protein-coding ornoncoding regions variants (non-synonymous missense non-sense or frameshift variants) can be further examined to under-stand their impact on protein functions Tools like CombinedAnnotation-Dependent Depletion (CADD) [39] SIFT (as toleratedor damaging variants) [40] PolyPhen2 (as benign possibly

damaging or probably damaging variants) [41] or Condel (meta-predictor that combines prediction scores from multiple tools)[42] can be leveraged for predictive assessment of genetic vari-ants Annotation primarily provides localization of a genetic vari-ant using genome coordinates prediction aims to hypothesizethe probable impact of a particular genetic variant on function orregulation The combination of annotation and prediction pro-vides an integrated view of genomic variants Tools such assnpEff or the Ensembl Variant Predictor can be used to predict theimpact of a variation in comparison with the reference sequenceThe impact of a sequence variant with respect to the evolutionaryconservation can also be predicted or derived from GRANTHAMscore [43] genomic evolutionary rate profiling (GERP) score [44]phylogenetic P-value (PhyloP) score [45] and PhastCons score [46]The development of substitution matrices in the early 1970s wasinstrumental in fueling the design and implementation of com-putational approaches to infer the functional impact of genomicvariants [43 47] Matrices such as Point Accepted Mutations(PAM1 matrix substitution probabilities for sequences with a mu-tation rate of 1100 amino acids PAM250 matrix 250 mutations100 amino acids) BLOCK SUbstitution Matrix (BLOSUM) matrices[48] and sequence search algorithms designed to identify the

Figure 2 Examples of the coding (functional) and noncoding (regulatory) variants (i) Functional variant (Pro33Ser) in RRM2B associated with autosomal recessive progres-

sive external ophthalmoplegia visualized on a protein structure (PDB ID 2vux Quaternary assembly is generated using PISAPDBe) Functional variant (Pro33Ser) is high-

lighted in red color inside the green circle on chains A (part of loop) and B (part of helix) Visualization was created using UCSF Chimera (wwwcglucsfeduchimera) (ii)

Protein domain architectures and functional variants mapped to ii (a) RRM2B and ii (b) ZC3HC1 Ribonuc_red_smfrac14Ribonucleotide reductase domain zffrac14C3HC zinc finger-

like domain Rsm1frac14Rsm1-like domain Functional variants are highlighted using red vertical line Figure was generated using MyDomains (httpprositeexpasyorgmydo-

mains) (iii) Genomic localization of the regulatory variants (a) rs10811661 and (b) rs342293 The location of the variants are highlighted using a vertical red line Regulatory

variant rs10811661 regulated the expression of nearby genes CDKN2A and CDKN2B (highlighted in red boxes) Intergenic variant rs342293 located between FLJ36031

(CCDC71L)-PIK3CG is located in a TFBS of EVI1 that regulates the expression (repress) of PIK3CG Genomic regions were visualized using Ensembl Genome Browser v 67

4 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

similarity between any two sequences (pairwise sequence align-ment) and the evolutionary relationships between two or moresequences (multiple sequence alignments) have spawned the de-velopment of improved heuristic homology search tools [49 50]The early 2000s witnessed the development of several predictivemethods based on sequence conservation amino acid substitu-tions and perturbation of the local structural environment to as-sess the impact of functional variants in proteins [51ndash57] The kaks ratio (or dNdS) test which estimates the ratio of the totalnumber of non-synonymous substitutions per non-synonymoussites (dN) to the total number of synonymous substitutions to

synonymous sites (dS) is widely used to quantify the selectionpressure on functional genes [52] Furthermore bioinformaticstools databases and statistical methods for the identification an-notation and analysis of SNPs from genotyping and sequencingdata have been amply surveyed in literature [58ndash66]

Integrative approaches can provide additional annotationsfor variants such as the location of the variants within the con-served protein sequencestructure domains (contiguous unit inprotein sequence or structure with evidence of functional anno-tation from experimental or computational function associationmethods) within known functional sites Gene Ontology (GO)

Table 1 The naming convention used to describe the sequence variants in the Ensembl Variation resources and dbSNP

Ensembl variantconsequences

dbSNPfunctionalclasses

SO ID SO Definition

Essential splice site splice-3 SO0001574SO0001575

A splice variant that changes the two base regions at the 30 end of anintron a splice variant that changes the two base regions at the 50

end of an intronStop gained splice-5 SO0001587 A sequence variant whereby at least one base of a codon is changed

resulting in a premature stop codon leading to a shortenedtranscript

Stop lost nonsense SO0001578 A sequence variant where at least one base of the terminator codon(stop) is changed resulting in an elongated transcript

Complex indel NA SO0001577 A transcript variant with a complex INDELmdashInsertion or deletion thatspans an exonintron border or a coding sequenceUTR border

Frameshift coding frameshift SO0001589 A sequence variant which causes a disruption of the translationalreading frame because the number of nucleotides inserted ordeleted is not a multiple of three

Non-synonymouscoding

missense SO0001582SO0001652SO0001651SO0001583

A codon variant that changes at least one base of the first codon of atranscript an inframe non-synonymous variant that deletes basesfrom the coding sequence an inframe non-synonymous variantthat inserts bases into in the coding sequence a sequence variantwhere the change may be longer than three bases and at least onebase of a codon is changed resulting in a codon that encodes for adifferent amino acid

Splice site NA SO0001630 A sequence variant in which a change has occurred within the regionof the splice site either within 1ndash3 bases of the exon or 3ndash8 bases ofthe intron

Partial codon NA SO0001626 A sequence variant where at least one base of the final codon of an in-completely annotated transcript is changed

Synonymous coding cds-synon SO0001567SO0001588

A sequence variant where at least one base in the terminator codon ischanged but the terminator remains a sequence variant wherethere is no resulting change to the encoded amino acid

Coding unknown NA SO0001580 A sequence variant that changes the coding sequenceWithin mature miRNA NA SO0001620 A transcript variant located with the sequence of the mature miRNA50 UTR untranslated_5

UTR-5SO0001623 A UTR variant of the 50 UTR

30 UTR untranslated_3UTR-3

SO0001624 A UTR variant of the 30 UTR

Intronic intron SO0001627 A transcript variant occurring within an intronNMD transcript SO0001621 A variant in a transcript that is the target of NMDWithin non-coding

genencRNA SO0001619 A transcript variant of a non-coding RNA gene

Upstream near-gene-5 SO0001636SO0001635

A sequence variant located within 2 KB 50 of a gene a sequence variantlocated within 5 KB 50 of a gene

Downstream near-gene-3 SO0001634SO0001633

A sequence variant located within a half KB of the end of a gene a se-quence variant located within 5 KB of the end of a gene

Regulatory region NA SO0001566 A sequence variant located within a regulatory regionTranscription factor

binding motifNA SO0001782 A sequence variant located within a transcription factor-binding site

Intergenic intergenic SO0001628 A sequence variant located in the intergenic region between genes

SO identifiers and descriptions were obtained from MISO (wwwsequenceontologyorgmiso)

Interpreting functional effects of coding variants | 5

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

terms and pathways associated with the variant-containinggenes associations mapped within genetic databases such asdbSNP Ensembl Online Mendelian Inheritance in Man [67]Human Gene Mutation Database [68] Catalogue of SomaticMutations in Cancer (COSMIC) [69] and other clinically relevantgenomic regions also enable enhanced variant annotationTogether integrative data analysis platforms (such asTargetMine [70]) and integrated annotation tools such asANNOVAR [71] or webservers that can handle Variant CallFormat (VCF) [72] files facilitate the annotations for a largenumber of sequence variants in a small amount of timeAdopting a standard set of terms from the SO to define the typeof variants would also help in streamlining the interoperabilityof results from the different prediction tools A comprehensivelist of tools for rapid variant annotation is provided in Table 2

The visualization of a variant in the context of multiplelayers of biological information is helpful to interpret and priori-tize variants for functional studies A growing number of gen-ome browsers and data visualization libraries enable theinteractive and static visualizations of variants in the context ofthe human genome or transcriptome with biological clinicaland population scale annotation data compiled from multipleresources Genome browsers [Integrative Genomics Viewer(IGV) UCSC Genome Browsers NCBI Sequence Viewer etc] en-able the visualization of variants in the context of a referencegenome Resources such as Distributed Annotation System(DAS) and DASTY (a protein-centric DAS client) can be leveragedfor interactive visualization of coding variants in a protein withrich annotations A list of genome browsers and visualizationlibraries capable of visualizing variants and multiple annotationcomponents is provided in Table 3

Text mining and natural language processingfor knowledge aggregation for SNVs

Biomedical knowledge about relationships between genes dis-eases phenotypes and genetic variations are scattered across alarge number of unstructured literature databases Applicationof natural language processing and text-mining thereforeoffers a useful approach for function assignment of codingSNVs Further text mining of large biomedical literature data-bases like PubMed and Medline helps to provide clues for fur-ther investigations and leads to hypothesis [73] For examplePhenGen [74] offers links to a variety of literature evidence tosupport genotypendashphenotype connections Integrated databases

like T-HOD database [75] and PolySearch [76] provide text-mining tools to derive meaningful biological inferences tointerpret coding SNVs

Emerging challenges in annotation andinterpretation of coding SNVs

The growing number of tools for the prediction annotation andvisualization of coding SNVs can address several gaps in thecurrent state of knowledge on variant interpretation The devel-opment of new algorithms for variant interpretation could beconsidered for several emerging themes of the protein se-quence-structure-function paradigm Several tools [See Tables4 5 and 6] are currently available to assess sequence and struc-ture-based features yet a reliable interpretation on how a gen-omic variant could perturb a protein or a protein network isoften a challenging task

VUS (also known as incidental variations or secondary vari-ations) are a class of variants hitherto unknown in known dis-ease genes but the influence of these variants on the diseasephenotype is largely unknown Pilot data from 1000 Genomesproject on exon capture sequencing of 1092 individuals selectedfrom 14 populations across Europe East Asia sub-SaharanAfrica and the Americas reported an observed frequency of 2500non-synonymous variants at conserved positions 20ndash40 vari-ants identified as damaging 24 at conserved sites and about 150loss-of-function (LOF) variants that includes stop-gains frame-shift insertions and deletions (indels) in a coding sequence anddisruptions to essential splice sites Emerging evidence suggeststhat rare variants may have a higher collective impact on dis-ease incidence rate than common variants and considering theallele frequency as a metric to assess the clinical impact of thevariant may help to assess the clinical impact of VUS Majorityof the variants identified in individuals in 1000 Genomes projectare common with gt5 of MAF or low frequency (MAF 05ndash5)than rare variants (MAF lt05) Rare variant frequencies esti-mated as 130ndash400 non-synonymous variants per individual thatincludes 10ndash20 LOF variants 2ndash5 damaging mutations and 1ndash2variants identified from cancer genome sequencing projects[77ndash79] As an ever-increasing number of VUS are being charac-terized their annotation and interpretation is becoming morechallenging With the advent of WGS and WES in clinical set-tings the repertoire of VUS associated with complex chronicand rare diseases is rapidly expanding We surveyed theHuman polymorphisms and disease mutations index (humsa-vartxt) to gather unclassified variants reported in UniProtKB acurated database of functional information on proteins Thecurrent release (2014_05) of the humsavar lists 6564 variants aslsquoUnclassified variantsrsquo these variants were mapped to 1910 pro-tein coding genes A manually curated annotation of the vari-ants and disease ontology-based disease term-gene enrichmentanalyses indicated that the genes were largely from cancer pa-tients (322 genes tested Plt 005) We noted that 497 genesencoded polymorphism disease and unclassified variants sug-gesting that VUS-labeled variants may directly influence thegenes that are relevant to human diseases and clinically rele-vant phenotypes A proportional Venn diagram of genesmapped to variants labeled as lsquoPolymorphismrsquo lsquoDiseasersquo andlsquoUnclassifiedrsquo is illustrated in Figure 3 The top 10 disease ontol-ogy terms enriched among the genes within the three classes ofvariants are summarized in Table 2 We observed that certaindiseases were represented by genes that share multiple classesof variants thereby suggesting that unclassified variants may

Table 2 Top-10 terms from a gene-disease enrichment analyses per-formed using list of genes with shared polymorphism disease andunclassified variants using disease ontology

DO term No ofgenes

P-values(Bonferronicorrected)

Congenital abnormality 38 913E-34Cancer 64 7222E-33Diabetes mellitus 39 8913E-23Breast cancer 36 356E-17Atherosclerosis 26 1478E-16Retinal disease 16 2044E-16Adenovirus infection 15 8816E-14Hypertension 21 2072E-13Alzheimerrsquos disease 22 7548E-13

6 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

play a key role in certain clinical conditions such as cancerschronic diseases (diabetes atherosclerosis hypertension andkidney disease) neurodegenerative disease (Alzheimerrsquos dis-ease) and infection (Plt 005 Bonferroni corrected) Annotatingfunctional coding SNVs including VUS validated using an or-thogonal method also requires detailed sequence structure andinteraction-based analyses In this context we discuss some ofthe predictive features and analytical approaches that onceincorporated in coding variant annotation algorithms may po-tentially enhance the interpretation of the functional variantson a proteome-wide scale

Sequence and structural propertiesperturbed by coding SNVs

A protein performs its defined function after attaining a specifictertiary or quaternary structure [80 81] This is often mediatedby cross-links of inter-chain and intra-chain amino-acid residueinteractions within a protein These interactions (hydrogenbond disulphide bonds salt bridges ionic interactions electro-static interactions hydrophobic interactions etc) stabilize the

fold of individual protein chains and the overall quaternary as-sembly Individual amino acids within a protein are under theinfluence of the varied elements of the sequence-structure-function paradigm that modulate their biochemical role [82]Primarily a residue may be located within an evolutionarilyconserved compact globular domain [83] or in an unassigned re-gion with no known protein domain association [84] A residuecan be a part of a motif which can be functional [85] that of alinear motif [86] propeptide [87] signaling peptide [88] or a partof structural [89] and spatially interacting sites which partici-pate in higher order interactions [90] catalytic sites [91] ligandbinding sites [92] and allosteric sites [93] associated with exten-sive inter-chain and intra-chain interactions [90] based onthe oligomeric property of the proteins Amino acids can also bepart of specific sub-cellular localization signals that directthe proteins to specific locations in the cell [94] The lengths ofthe proteins [62] sequential localization (N-terminalC-terminal or other region of protein sequence) and the locationof a coding SNV with respect to the surfacemdashinterior or inter-face of the protein structuremdashcould influence disease manifest-ation [95]

Table 3 Rapid variant annotation tools

Name Description URL

ANNOVAR Efficient software tool to use up-to-date information tofunctionally annotate genetic variants detected fromdiverse genomes

httpwwwopenbioinformaticsorgannovar

AnnTools Comprehensive and versatile annotation toolkit for gen-omic variants

httpanntoolssourceforgenet

dbNSFP A lightweight database of human non-synonymousSNPs and their functional predictions

httpssitesgooglecomsitejpopgendbNSFP

EVA An efficient and versatile tool for filtering strategies inmedical genomics

httpplateforme-genomique-iribuniv-rouenfrEVAindexphp

Exome VariantServer

Provides different calculated values (GERP GRANTHAMetc) and annotations for SNPs

httpevsgswashingtoneduEVS

gSearch gSearch compares sequence variants in the GenomeVariation Format (GVF) or VCF with a pre-compiledannotation or with variants in other genomes

httpmlssuackrgSearchindexhtml

HugeSeq A pipeline for detection and annotation of geneticvariations

httphugeseqsnyderlaborg

MuSiC Comprehensive mutational analysis pipeline to segre-gate passenger and driver mutations from cancergenomes

httpgmtgenomewustledugenome-musiccurrent

NGS-SNP Collection of command-line scripts for providing richannotations for SNPs

httpstothardafnsualbertacadownloadsNGS-SNP

SeattleSeq Annotation Provides annotation of known and novel SNPs httpsnpgswashingtoneduSeattleSeqAnnotationsnpEff Variant annotation and effect prediction tool httpsnpeffsourceforgenetSVA Software system designed for annotation and visualiza-

tion of genetic variantshttpwwwsvaprojectorg

STORMSeq Cloud computing solution for read mapping read clean-ing and variant calling and annotation

httpsgithubcomkonradjkstormseq

TREAT Targeted RE-sequencing Annotation Tool httpndcmayoedumayoresearchbiostatstand-alone-packagescfm

VAAST Probabilistic search tool for identifying damaged genesand their disease-causing variants in personal gen-ome sequences

httpwwwyandell-laborgsoftwarevaasthtml

VARIANT VARIANT (VARIant ANalysis Tool) can report the func-tional properties of any variant in all the humanmouse or rat genes

httpvariantbioinfocipfes

Variant Reporter Generate a report of known variants and functionalconsequences

wwwncbinlmnihgovvariationtoolsreporter

Variant Tools Tool for the annotation selection and analysis of vari-ants in the context of next-gen sequencing analysis

httpvarianttoolssourceforgenet

Interpreting functional effects of coding variants | 7

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

Table 4 Genome browsers and biological data visualization libraries

Name Description URL

1000 Genomesbrowser

Genome browser to access data from 1000 Genomesproject

httpwwwncbinlmnihgovvariationtools1000genomes

AnnoJ Web 20 application designed for visualizing sequencingand annotation data

httpwwwannojorg

Artemis Genome Browser and Annotation Tool httpwwwsangeracukresourcessoftwareartemis

BioGraphics Perl modules for biological data visualization httpsearchcpanorgdistBio-GraphicsBioGraphics (Ruby) Ruby library for drawing overviews of genomic regions httpbio-graphicsrubyforgeorgBluejay Java-based integrated computational environment for the

exploration of genomic datahttpbluejayucalgaryca

CGView Circular Genome Viewer httpwishartbiologyualbertacacgviewCircos Circos is a software package for visualizing genomic data

and annotations in circular layouthttpcircosca

Dalliance Interactive genome viewer which runs directly in a modernweb browser

httpwwwbiodallianceorg

DAS DAS is an integrated visualization toolkit to share and col-late genomic annotation information

wwwbiodasorg

DASTY Web client for visualizing protein sequence feature infor-mation using DAS

httpwwwebiacukdasty

DNAPlotter DNAPlotter can be used to generate images of circular andlinear DNA maps to display regions and annotations ofinterest

httpwwwsangeracukresourcessoftwarednaplotter

Ensembl GenomeBrowser

Ensembl Genome Browser enables the visualization of gen-omic and transcriptomic sequence and related informa-tion for several vertebrate and non-vertebrate species

httpuseastensemblorgindexhtml

GASV Geometric Analysis of Structural Variants httpcompbiocsbrownedusoftwarehtmlGbrowse Generic Genome Browser (GBrowse) is a genome viewer de-

veloped as part of Generic Model Organism Database(GMOD) project

httpgmodorgwikiGBrowse

GENBOREE Customizable genome browser httpgenboreeorgjava-binloginjspGeneViTo JAVA-based workbench for genome-wide analysis through

visual interactionhttpathinabioluoagrbioinformaticsGENEVITO

GenomeGraphs R-based interface to plot genomic information fromEnsembl

httpwwwbioconductororgpackagesreleasebiochtmlGenomeGraphshtml

GenomePixelizer Tool to generate custom images of genomes out of thegiven set of genes

httpwwwatgcorgGenomePixelizer

GenomeTools Versatile genome analysis software httpgenometoolsorgGenomeView Next-generation stand-alone genome browser and editor httpgenomevieworgGremlin Interactive visualization model for analyzing genomic

rearrangementshttpcompbiocsbrownedusoftwarehtml

IGB Integrated genome browser httpbiovizorgigbindexhtmlIGV Integrative Genomics Viewer httpwwwbroadinstituteorgigvJbrowse Genome browser with a fully dynamic AJAX interface httpgmodorgwikiJBrowsejsDAS JavaScript client library for the DAS httpwwwebiacukdastyebihtmljsdashtmlMagicViewer Integrated solution for NGS data visualization and genetic

variation detection and annotationhttpbioinformaticszjcnmagicviewerindexphp

NCBI GraphicalSequence Viewer

Graphical display for the Nucleotide and Protein sequences httpwwwncbinlmnihgovprojectssviewer

Rover Genome browser framework to build custom genomic tools httpchmille4githubcomRoversitehomehtmlRviewer Interactive online tool for comparing and prioritizing gen-

omic regionshttprviewerlblgovrviewer

Savant Genome Browser for high-throughput sequencing data httpgenomesavantcomScribl HTML5 Canvas-based graphics library for visualization of

genomic data and annotationshttpchmille4githubcomScribl

UCSC GenomeBrowser

Interactive genome browser that provide access to se-quence data from different species integrated with alarge collection of layered annotations from experimentsand prediction algorithms

httpgenomeucsceducgi-binhgGateway

VISTA Browser Visualization of pairwise and multiple alignments of wholegenome assemblies

httppipelinelblgovcgi-bingateway2selectorfrac14vista

Whole GenomerVISTA

Visualization of TFBS that are conserved between speciesand overrepresented in upstream regions of groups ofgenes

httpgenomelblgovvistaindexshtml

8 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

Previously studies have established (Table 4) that functionalvariants may impact the overall structural conformation andfunction of proteins (ligand binding substrate specificity pro-teinndashprotein interaction transcription factor-target bindingetc) Residue-specific interactions can play a key role in suchvariations Sequence-based investigations will further help toidentify linear sequence motifs that confer diversity in the splicevariants [6] but the identification and analysis of the impact ofpoint mutations on spatially distributed structural motifs mayrequire structural data [55] However the limited availability ofstructural data compared with sequence data may pose add-itional challenges for incorporating structural analysis of thefunctional variants [96] Moreover additional challenges ariseowing to the dynamic protein conformations and allosteric orother long-distance effects (including higher order interactions)of few mutations on the activity of the protein

UniProtKB provides extensive information about residueproperties and curated information about globular domains (viaannotations derived from Pfam SMART or InterPro)posttranslational modifications (PTMs) and several bioinfor-matics tools and webservers are available to investigate residueproperties and structural properties (See Table 5 for a list oftools to predict various sequence features See URL httpbioinformaticsca links_directorycategoryprotein for list of toolsavailable for different protein-centric analyses)

Occasionally multiple sequence or structural features canbe annotated onto a single variant for example a residue couldbe part of a functional motif andor a structural motif and cod-ing SNVs may perturb either or both features Quantifying mul-tiple features perturbed by variants using an objective-scoringschema will help to capture the entire set of features perturbedby a functional variant Comparative analysis of the sequenceand structural features of the wild type and the variant se-quences will also enhance the understanding of the impact ofmultiple coding variants within a protein

Gain and loss of PTMs owing to SNVs

Amino acids in a mature peptide chain are targets of variedPTM [97] events PTMs include phosphorylation methylationacetylation amidation hydroxylation sulfation lipidation

glycosylation and palmitoylation [98] Several studies have nowcompiled and assessed the impact of phosphorylation in cancer[97] and have reported mutational landscapes of various PTMevents including gain or loss of particular PTMs owing to SNVs(for example gain of glycosylation [99] gain and loss of phos-phorylation in cancer [100]) Pan-cancer [101] and proteomewide studies [102] have also assessed the impact of variety ofPTMs Several tools and databases are currently available toassess impact of individual PTMs owing to SNVs [refs]However tools that can provide complete in silico profiling ofmutations will help the researchers to identify PTM-relevantmutations and use the information to assess therapeuticstratifications

Coding SNVs in unassigned regions of proteins

An unassigned region refers to the segments in proteins withno known functional domain assignments [84] Human proteinshave a variable degree of unassigned regions and small-un-assigned regions are often defined as the linker regions betweentwo domains (Figure 2i) We surveyed the human proteomeusing Pfam-based domain annotations to understand the un-assigned regions in the human proteome We computed as-signed and unassigned regions (in percentage) for 20 242 proteinsequences from the SwissProt database (reviewed sequences)Pfam-A-based domain assignments were retrieved for 20 137 se-quences A subset of 105 proteins was excluded from the ana-lysis owing to overlapping domain assignments Of the 20 137sequences that were analyzed (Figure 4A) one protein (haloaciddehalogenase-like hydrolase domain-containing protein 2) wasassigned with a conserved domain over its entire length 3234sequences were completely unassigned and the rest of the pro-teins had varying segments of unassigned regions (mean5387 SD63094) Current approaches in variant annotationand interpretation are primarily focused on highly conservedglobular domains Given that domains are presently assigned toonly 50 of the human proteome analytical methods such asPrediction of Unassigned Regions (PURE) that use intermediatesequence search techniques for domain assignments [103] orsimilar approaches that use sensitive sequence search protocols

Table 5 Examples of the functional variants located in sequence features perturbing diverse functional and structural effects in proteins

Variant location Description URL

Protein domain DMDM A database that compiles domain mapping of dis-ease mutations have information about 202 507 muta-tions associated with 10 919 domains (compiled fromCDD Pfam COG and SMART databases)

httpbioinfumbcedudmdm

Phosphorylation site Mutation of an AKT phosphorylation site of human B-raf httpwwwncbinlmnihgovpubmed15791648Propeptide Mutation in the von Willebrand factor (VWF) propeptide

affects the oligomerizationhttpwwwncbinlmnihgovpubmed20335223

Signal peptide Mutation in signal peptide of ADAMTS10 influence secre-tion of full-length enzyme

httpwwwncbinlmnihgovpubmed18567016

Active site Mutation in the active site of human deoxycytidine kinaseaffects the substrate specificity

httpwwwncbinlmnihgovpubmed18361501

Linear motif Linear motifs mediate functional diversity of transcriptvariants

httpwwwncbinlmnihgovpubmed22638587

Structural motif Heterozygous missense mutation of a spatially distributedstructural motif in human connexin (GJB3) gene causeerythrokeratodermia variabilis

httpwwwncbinlmnihgovpubmed9843209

Subcellular localization Missense mutations in the NPHS2 gene altering the traffick-ing of nephrin to the plasma membrane

wwwncbinlmnihgovpubmed15496146

Interpreting functional effects of coding variants | 9

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

Table 6 Tools for predicting various sequence and structural features

Name Description URL

3dswap-pred Classify a protein sequence as domain-swapping ornon-domain swapping using an SVM model

httpcapsncbsresin3dswap-predindexhtml

AAIndex An amino acid index is a set of 20 numerical values rep-resenting various physico-chemical and biochemicalproperties of amino acids

httpwwwgenomejpaaindex

Bioinformatics LinkDirectory (Protein)

Extensive list of tools for prediction of protein sequencefeatures structure features and function

httpbioinformaticsca links_directorycategoryprotein

dbPTM Comprehensive resource for protein PTMs httpdbptmmbcnctuedutwDISOPRED Dynamically disordered protein chains do not have sta-

ble secondary structures and have high flexibility insolution Disordered regions also play critical roles inprotein function

httpbioinfcsuclacukdisopred

ELM Eukaryotic Linear Motif server httpelmeuorgEris Eris server computes the change of the protein stability

induced by mutations using structural datahttpdokhlabuncedutoolserisindexhtml

FoldAmyloid Method for predicting of amyloidogenic regions fromprotein sequence

httpbioinfoprotresrufold-amyloidogacgi

FoldX FoldX can be used to find interactions contributing tothe stability of proteins and protein complexes usingstructural data

httpfoldxcrges

Globplot Globplot can predict disordered regions in proteinsequence

httpglobplotemblde

H-Predictor Predict hinge regions involved in protein oligomeriza-tion via the domain-swapping mechanism fromstructural data

httptrollmeduncedudokhlabindexphpSpecialHpredictor

Harmony Substitution and propensity score-based protein struc-ture assessment algorithm

httpcapsncbsresinharmony

HORI Webserver for prediction of higher order residue inter-actions in protein structures

httpcapsncbsresinhori

InterPro Integrated database of predictive protein signaturesused for the classification and automatic annotationof proteins and genomes

httpwwwebiacukinterpro

IUPred Prediction of intrinsically unstructured proteins httpiupredenzimhuLIMBO Predicts the amylogenic regions in a protein httplimbovibbeMUPRO Prediction of protein stability changes for single-site

mutations from sequenceshttpmuproproteomicsicsuciedu

NCBI-CDD Extensive protein domain and family annotationdatabase

httpwwwncbinlmnihgovStructurecddcddshtml

Pfam Database of conserved protein domain families httppfamsangeracukPFILT Program to filter various sequence regions including

low-complexity regionshttpbioinfadmincsuclacukdownloadspfilt

PIC Protein interactions calculator httppicmbuiiscernetinProtParam Compute biochemical features like Molecular Weight

Theoretical pI Grand Average of Hydropathy(GRAVY) instability index etc

httpwebexpasyorgprotparam

PSIPRED Secondary structure prediction httpbioinfcsuclacukpsipredPURE Prediction of unassigned regions in proteins httpcapsncbsresinpureSABBLE Relative solvent accessibility prediction httpsablecchmcorgScanProsite Report the functional motifspatterns encoded in the se-

quence Helps to assess the gainloss of functionalsites owing to mutation

httpprositeexpasyorgscanprosite

SignalP Predicts the presence and location of signal peptidecleavage sites in amino-acid sequences

httpwwwcbsdtudkservicesSignalP

SMART Simple modular architecture research tool for assigningdomains to protein

httpsmartembl-heidelbergde

TANGO Predicts the aggregation-prone regions in a protein httptangocrgesTargetP Predicts the subcellular location of eukaryotic proteins httpwwwcbsdtudkservicesTargetPUniProtKB Catalog of information on proteins httpwwwuniprotorgWALTZ Predicts the aggregation-prone regions in a protein httpwaltzvubacbe

10 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

(Table 5) may potentially help investigate the evolutionary andfunctional roles of variants in unassigned regions

Impact of coding SNVs in low-complexityregions

Low-complexity regions (LCRs) in the protein universe refer to astretch of amino acids with low Shannon entropy (leucine-richdomains or poly-alanine tracts) Unlike linear motifs whichhave a specific function and sequence signature the individualfunctions of LCRs are poorly characterized LCRs do not adopt adefinite secondary structure but may exist as solvent-exposedamino acids in the coiled or disordered regions in proteins LCRsare observed in functionally diverse proteins and in both eu-karyotes and prokaryotes The predominant functions of LCRsinclude promoting mRNA stability and mediating a diverse set

of proteinndashprotein interactions [104] We scanned the referencehuman proteome (reviewed subset of 20 237 sequences) usingPFILT [105] and observed that 1432 (nfrac14 2899) protein se-quences have LCRs with a median length of 16 amino acids(Figure 4B) A recent study showed that a functional variantlocalized to an LCR in Nance-Horan Syndrome (NHS) gene wasassociated with clinical features of NHS including cardiacanomaly [106] In the absence of direct functional informationto interpret the impact of functional variants scanning proteinsequences with an LCR prediction tool like PFILT is highly rec-ommended to investigate the probable gain or loss of LCRsowing to the functional variations

Coding SNVs and intrinsically disorderedregions in proteins

Proteins are believed to be functional when the structure attainsits definite globular fold [107] Recent studies however suggestthat proteins may perform their functions even when not in afully folded state Such proteins or regions within proteins thatexist in a stable conformation without attaining a definite struc-tural conformation are generally referred to as intrinsically dis-ordered proteins (IDPs) or proteins with intrinsically disorderedregions (IDRs) or simply disordered proteins [108 109] Disprotis a database that catalogs a curated list of disordered proteinsthat includes 248 experimentally validated human proteinswith disordered regions (httpwwwdisprotorgactionsearchphpkeywordfrac14humanampcriterionfrac14organism) Prediction modelssuggest that 30ndash40 of human proteins are considered to beIDPs or have IDRs and approximately 25 of eukaryotic proteinsare predicted to be fully disordered [108 109]

Regions of the intrinsically disordered segments in proteinshave been found to be functionally important (Figure 5)Disordered regions mediate various functional roles includingprotein binding and proteinndashprotein interactions Previous stud-ies have shown that IDPs can attain a definite structure on bind-ing to their interacting partner and may thus exist in anintermediate stage of disordered (unfolded) to ordered (folded)stage [110 111] Irrespective of the structural plasticity recent

Figure 3 Proportional Venn diagram of genes with coding variants annotated as

polymorphism (Polymorphism nfrac1411 527) disease (Disease nfrac142105) and un-

classified (Unclassified nfrac141910) in Human polymorphisms and disease muta-

tions index Percentages of genes shared between the three groups are provided

Unassigned regions in human proteome

Percentage of unassigned regions in human proteins

Fre

qu

en

cy in

hu

ma

n p

rote

om

e (

Sw

issp

rot

re

vie

we

d s

eq

ue

nce

s)

0 20 40 60 80 100

01

00

02

00

03

00

04

00

0

Low-complexity regions in human proteome

Percentage of LCRs in human proteins

Fre

quency in h

um

an p

rote

om

e (

Sw

isspro

t r

evie

wed s

equences)

0 20 40 60 80 100

0500

1000

1500

2000

2500

3000(a) (b)

Figure 4 Histograms depicting the distribution (in percentage) of (A) unassigned regions and (B) LCRs in the human proteome

Interpreting functional effects of coding variants | 11

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

studies have shown the presence of evolutionarily conservedfunctional motifs within IDPs [112 113]

Recently the influence of disease mutations in the dis-ordered regions was extensively surveyed and the crucialroles of disorder-promoting amino acids in imparting vari-ations in the protein structures were proposed [114 115]Disorder-promoting amino acids are a potential cause ofvariations in protein structures and structural modelsDisorder-to-order transitions are enriched among diseasemutations compared with neutral polymorphisms [116]Currently protein modeling or variant effect prediction al-gorithms do not take the intrinsic disorder characteristicsof a protein into consideration Integrating tools such asDisopred [117] for disorder region prediction into the vari-ant annotation pipelines will improve our understanding ofthe impact of coding SNVs in the disordered regions of aprotein

Influence of coding SNVs on proteinmisfolding domain swapping aggregationmacromolecular crowding and degradation

Diseases have multiple environmental as well as geneticcauses and structural biology is an active area of research tounderstand how changes in individual protein structures orprotein complexes may play a key role in disease manifest-ations Understanding the structural bases of human diseases

would help to identify better targets and design better ligandsfor more effective therapeutic interventions

Protein misfolding and aggregation

Protein folding pathways play a crucial role in mediating cellu-lar homeostasis Defective folding of a protein product is themechanistic basis of several disease phenotypes likeAlzheimerrsquos disease and Parkinsonrsquos disease [118] When non-synonymous mutations lead to the production of misfoldedproteins with aberrant function several gate-keeping pathwaysact on such misfolded proteins to clear them from the cellularenvironment [119] Misfolded proteins targeted for ubiquityla-tion may also lead to protein aggregation Protein aggregation[120ndash123] is defined as a molecular phenomenon where a pro-tein is not cleared from the cellular environment by the normalpathways (Figure 6A) This leads to the aggregation of proteinsin the cellular environment which in turn leads to cellular tox-icity and is considered to be the mechanistic basis of varioushuman diseases such as prion diseases [123 124]

Domain swapping

3D domain swapping (Figure 6B) is a phenomenon observed in asubset of proteins where intermolecular interactions arereplaced by intramolecular interactions [125 126] Domainswapping is also recognized as a mechanism for forming pro-tein aggregates via open-ended mode Domain swapping is

Figure 5 Examples of IDRs in human proteins (A) Breast cancer type 1 susceptibility protein (BRCA1) (B) Cellular tumor antigen p53 (Oncoprotein p53) (C) T-cell sur-

face glycoprotein CD4 (Disprot) (D) Secondary structure information (UniProtKB) and (E) mutation histogram (COSMIC) of BRCA1 is provided to illustrate mutations in

the unstructured regions

12 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

observed in a variety of therapeutically important proteins andis considered to be the mechanism mediating deposition dis-eases such as neurodegenerative diseases and Alzheimerrsquos dis-ease that are caused by conformational perturbations [127 128]Missense mutations in the human phenylalanine hydroxylasehave been shown to infleunce aggregation and degradationproperties [129] Functional mutations are also considered to bea causative factor for proteins to adopt swapped conformationfrom monomers to higher oligomers [130ndash132] A recent study[128] that surveyed a subset of human proteins involved in 3Ddomain swapping suggested that swapping is not only confinedto conformational diseases it is also associated with severalkey biological pathways and also plays a role in mediating di-verse diseases in humans Another study [133] that investigatedthe structural properties of proteins involved in 3D domainswapping suggested that 10 of protein folds and 5 of proteinfamilies include domain-swapped structures

Macromolecular crowding

Proteins interact with a variety of small molecules and othermacromolecules to perform a specific functional task The qua-ternary assembly of a protein is influenced by the importantyet understudied phenomenon of macromolecular crowding

(117 118) Protein function is driven by interactions mediated byinter- and intra-chain amino acids and mutations in the sur-face amino acids may affect macromolecular homeostasisMutagenesis studies have shown the impact of functional mu-tations on proteinndashprotein interfaces (119) oligomerization (120)and stability (121) Mutations that influence the intra-chaininteractions may influence the crowding effect in the cellularenvironment Probing the impact of functional mutations onsuch effects using theoretical models and molecular dynamicsimulation studies is likely to enhance the understanding of therelationship between coding SNVs and macromolecular crowd-ing A list of selected tools and databases available for the pre-diction of misfolding protein aggregation and domainswapping is summarized in Table 5

Protein degradation

Targeted biochemical studies have revealed that the proteindegradation pathway plays an important role in the clearanceof the mutated proteins to reduce cellular toxicity For examplemutant Cu Zn-superoxide dismutase associated with amyo-trophic lateral sclerosis was cleared by macroautophagy path-way that includes proteasomal cleavage phenylalaninehydroxylase (109) and NAD(P)H quinone oxidoreductase 1 (156)

(a) (b)

Figure 6 Schematic representation of the impact of functional mutation on protein misfolding folding aggregation domain swapping macromolecular crowding and

protein degradation pathways Structure of misfolded rat CD2 structure (PDB ID 1A6P) and normal CD2 (PDB ID 1HNG) is used for representing misfolded and folded

structures Functional variant is represented using red asterisk

Interpreting functional effects of coding variants | 13

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

were also cleared by similar mechanisms Functional variantsplay a crucial role in aggregation and protein degration path-ways [134 135] Computational approaches that can predict thedegradation properties of mutated proteins using sequence orstructural information will be useful for rapid characterizationof functionally active proteins Integrated models that combinefolding misfolding aggregation 3D domain swapping macro-molecular crowding and degradation pathways (Figure 6) in asystems biology approach are likely to provide additional in-sights into the regulation of these important mechanisms andthe role of coding SNVs in mediating such mechanisms

Coding SNVs and metamorphic proteins

The molecular paradigm of sequence-structure-function sug-gests that a diverse set of sequences could attain a similarstructural fold that may lead to a functional convergence Whilethis is generally true for the protein universe occasionally devi-ations are observed Metamorphic proteins refer to a relativelynew class of proteins in which a given sequence has beenshown to attain different folded conformations under nativeconditions while performing distinct functions [136ndash138]Metamorphic proteins such as human chemokine lymphotactinhave been shown to influence evolutionary transitions of struc-ture and coding SNVs which could play an important role inswitching between two or more folds and functions for a singlesequence [139] Computational approaches to catalog and pre-dict the metamorphic properties and the impact of coding SNVson metamorphic proteins would help us understand how muta-tions drive functional plasticity at the proteome level

Impact of coding SNVs on the transcriptomicdiversity

Dynamic features of the human proteome are driven by tran-script diversity and the average number of characterized tran-scripts per gene is rapidly expanding The recent adoption ofRNA-Seq as a tool for expression profiling has led to the charac-terization of a large number of novel transcripts includingfusion transcripts [140] Several novel tissue-specific orcell-type-specific transcripts were reported from RNA-Seqexperiments The cellular compartment-based functional rolesof such transcripts are determined by alternative splicingevents [141] RNA editing is a phenomenon where a pre-mRNAmolecule is altered through a chemical change in the basemakeup thereby adding to the diversity of the transcriptomeRNA editing events occur via two distinct mechanisms of sub-stitution editing and insertiondeletion editing leading to func-tional diversity Fusion transcripts [142] and RNA editing eventsare also associated with various diseases including prostatecancer and amyotrophic lateral sclerosis [143] Computationalapproaches are available for the identification of fusion tran-scripts [144] and RNA editing events [145ndash147] from sequencingdata Understanding the precise roles and the impact of variantson such a diverse set of transcripts using computationalapproaches is a challenging task Ascribing the functional roleand assessing the impact of coding SNVs for mediating noveland fusion transcripts using bioinformatics approaches is anemerging problem and being addressed in recent studies [148]Recently we designed an integrated method to simultaneousanalyses of genome and transcriptome from RNA-seq data TheeSNV-Detect method can precisely capture genetic variation(genotypes) from RNA-seq data and helps to design cost-

effective and sustainable experimental strategies [149] Analyticand interpretation strategies that rely on multiple data-typeswould provide greater confidence for variant calling annotationand interpretation and thus actionability for clinicalinterventions

Functional impact of synonymous variations

Synonymous mutations are defined as mutations that result ina variation at the DNA level that code for the same amino acidin the protein level owing to codon degeneracy Synonymousvariants may either be coding or noncoding and the functionalroles of such variants are emerging A recent study [150] sug-gests that introns are involved in functional mechanisms andintrons with positional conservation across eukaryotic lineageare classified as functional introns A comparative analysis ofsynonymous and non-synonymous variants associated withcomplex diseases has shown similar likelihood and effect sizewith disease association [151] Synonymous mutations mayalso influence the introns regulating gene expression [152] A re-cent report also suggests that the coding exons function as tis-sue-specific enhancers and synonymous variations in suchenhancer sites may influence the expression level of certaingenes [153] Synonymous variants could influence the expres-sion of intronic noncoding RNAs [154] perturb the transientproteinndashDNA [155 156] and proteinndashRNA interactions in the cellSuch perturbations can lead to diseases such as cancer neuro-logical disorders and cardiac disease [157] Developing bioinfor-matics approaches to explore the putative impact ofsynonymous mutations on different layers of the coding andnoncoding genome and their relationships will be an importantaspect in the detailed interpretation of the genomic variantsThe detailed structural exploration of proteinndashDNA [158] andproteinndashRNA interactions [159] will help to precisely map themechanistic bases of the loss or gain of interactions owing tosynonymous variations

Impact of coding SNVs on the interactomeof a protein

The individual interactome of proteins varies to a great extent[160] Proteins often interact with nucleotides (proteinndashDNA orproteinndashRNA interactions) proteins (protein complexes obligateor non-obligate) [161] and proteinndashprotein interactions (transientor permanent) [107]) and small molecules (molecular reactionsenzymatic reactions metabolic pathways) [162 163] CodingSNVs play an important role in mediating such interactions Aninteractome of a protein can be defined using information gath-ered from high-throughput experiments that systematicallyidentify interactants of proteins Public protein-protein databaseslike BioGRID [164] and STRING [165] provide large data sets forderiving a protein interactome Network-level investigations tounderstand the category of interactors that are perturbed owingto coding SNVs will help delineate the impact of such mutationson the protein interactomes Using network measures (centralitydegree stress betweenness closeness cliques radiality transi-tivity reciprocity assortativity structural equivalence networkheterogeneity network density clustering coefficients neighbor-hood connectivity shared neighbors network topology etc) asquantitative parameters in variant annotation pipelines will helpthe researchers gain better insights into the impact of codingSNVs on the protein interactome Biologically relevant networkproperties such as network modularity [166] network fragility

14 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

[166] and lethality [166] can also be derived from an interactomeIncorporating a comparative network analysis framework thatcompares wild-type and variant interactomes will help quantifythe impact of coding SNVs on a network scale Methods such asVCF2Networks employ genotype networks (ie all genotypesassociated with a single phenotype) to understand the relation-ships between the genotype space and clinical or biological phe-notypes [167] Table 7 summarizes the available tools toinvestigate gene lists and to derive global functional trends andnetwork properties

Pathway-level impact of coding SNVs

Multiple functional genomics studies have investigated the per-turbations of pathways as a result of mutations in diseasessuch as adenocarcinoma (MAPK signaling p53 signaling Wntsignaling cell cycle and mammalian target of rapamycin path-ways) [168] childhood acute lymphoblastic leukemia (RAS path-way) [169] colorectal carcinoma (WNT signaling pathway) [170]etc Several SNP-centric approaches for pathway-level inferencehave been designed for the interpretation of SNPs identifiedfrom GWA studies or differentially expressed genes from gene-expression studies [171ndash175] Coding SNVs may also influencethe cross talk between various signaling pathways Currentvariant interpretation algorithms strive to identify pathways

associated with genes harboring coding SNVs yet a detailedunderstanding of the impact of variants on biological pathwaysand pathway cross talks [176] is often lacking Incorporatinganalytical routines to quantify the effect of functional muta-tions on pathways and pathway cross talk will be useful in in-terpreting functional variants from a biological perspective

A classification schema for annotatingcoding SNVs

Identifying the entire spectrum of molecular perturbationsowing to coding variations at the level of sequence structureinteraction and function of proteins is considered the basis ofvariant interpretation Multiple automated prediction tools thatcan assess the functional effect of mutations are currentlyavailable [See Table 6] Several variant annotation tools focuson the task of predicting the type or effect of mutations provideextended annotations from precomputed databases or lookuptables and map a coding variant to its corresponding gene pro-tein functional domain or signaling pathway Theseapproaches have several limitations because additional layersof protein-centric information that can be derived from the pre-diction or computation of sequence-based features with codingvariants are lacking To deal with this challenge and to designbroad-spectrum tools for deep variant interpretation we

Table 7 Software libraries and tools for biological network analysis

Name Description URL

Bio4J Bio4j is a bioinformatics graph-based database httpwwwbio4jcomBioConductor (Graphs

and Networks view)Collection of BioConductor modules for biological net-

work analysis and visualizationhttpwwwbioconductororgpackagesrelease

BiocViewshtml___GraphsAndNetworksCytoscape Open source platform for complex network analysis and

visualization with a large collection of plug-ins forbiological network analysis

httpwwwcytoscapeorg

DAVID Integrated functional annotation tool httpdavidabccncifcrfgovhomejspFunDO Functional Disease Ontology server httpdjangonubicnorthwesternedufundogene2pathway R package for prediction of KEGG pathway membership

for individual genes based on InterPro domainsignatures

httpwwwbioconductororgpackagesreleasebiochtmlgene2pathwayhtml

GeneAnswers R package for biological or medical interpretation of thegiven one or more groups of genes by means of statis-tical test

httpwwwbioconductororgpackagesreleasebiochtmlGeneAnswershtml

GO Tools Tools for analysis of GO T term enrichment statisticalanalysis semantic similarity and functional similar-ity using GO terms derived from gene lists

httpwwwgeneontologyorgGOtoolsshtml

Gephi Open-source graph visualization and analysis software httpwwwgephiorgGremlin Graph-based programming language httpsgithubcomtinkerpopgremlinwikiiGraph Network analysis and visualization library in C Also

available R package and a Python extensionhttpigraphsourceforgenet

KEGGgraph R package for analysis of KEGG pathways httpwwwbioconductororgpackages210biochtmlKEGGgraphhtml

LEDA A broad-spectrum Cthornthorn class library for efficient datatypes and algorithms including large-scale networkanalysis

httpwwwalgorithmic-solutionscomledaaboutindexhtm

Neat Web-based network analysis tools httpgephiorgOntology Analysis

plugins for Cytoscapehttpchiantiucsdeducyto_webplugins Plugins for functional enrichment analysis using

network dataPANTHER Classification of genes and proteins httppantherdborgReactome Curated knowledgebase of molecular events and

pathwayshttpwwwreactomeorg

TargetMine Integrates different types of biological data and enableflexible queries export results and analyze lists ofdata

httptargetminemizuguchilaborg

Interpreting functional effects of coding variants | 15

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

recommend a three-level annotation schema for the interpret-ation of coding variants (Figure 7) Primary level annotation pro-vides an overview of the types of mutations and associationannotations Secondary level annotation enables the systematicinvestigation of multiple variants in a single gene Tertiary-levelannotations help to find the global characteristics of genes har-boring coding variants and various properties in a network-scale

Level 1 Primary annotation of coding SNVs

Tools that use position information from VCF files to derive thetype of mutation (synonymous or non-synonymous) the loca-tion of the corresponding gene and its mapping onto several an-notation resources using positional data may be viewed aslsquoprimary annotationsrsquo See Table 2 for a list of tools that useminimum input data to gather a diverse set of annotationsusing database lookup tables and identifier mapping

Level 2 Secondary annotation of coding SNVs- Impactof multiple variants in a single gene

Sequence and structural explorations of coding variants arereferred to as secondary annotations VCF files can have mul-tiple variants of the same gene associated with a given pheno-type or disease A systematic investigation of multiple variantsin a single gene and an assessment of the relative impact of cod-ing variants could help in classifying multiple variants in a geneand help to delineate the variants as pathogenic moderatelypathogenic or VUS Comparative analysis of wild-type and vari-ant sequences can be performed and quantified using the totalnumber of gains or losses in the sequence features Based on the

availability of the structural data a given variant can be modeledin a protein structure using in-silico mutagenesis experimentsIn the absence of experimental protein structural data a proteinstructure model can be obtained from the homology model data-base or a new model can be built using homology modelingapproaches Once the wild type and mutant type structures areobtained the impact of variation can be rapidly computed usingstructural feature analyses or using computationally expensivemolecular dynamics simulations Table 5 summarizes key se-quence and structure feature prediction tools

Level 3 Tertiary annotation of coding SNVs- Impact ofcoding SNVs on multiple genes pathways andinteractome

The global impact of multiple coding variants on a genome orexome can be assessed using a combination of both knowledge-based enrichment or depletion analysis and network or interac-tome analyses Functional profiling using GO terms have beenused as an effective approach for characterizing collective func-tional characteristics of a perturbed set of genes [177 178]Gene-list-based analyses can provide biologically relevant infor-mation about the variants Enrichment analysis is not just con-fined to a priori defined gene sets or GO annotationsenrichment can be performed using several types of annota-tions and could provide insights into the probable functional as-sociations of genes with coding variants Enrichment analysisusing protein annotations would help to identify functionallysignificant protein domains protein classes families (mem-brane proteins or kinases etc) protein superfamilies (angioten-sin receptors or G-protein coupled receptors) or protein folds(Rossman fold beta-propeller) associated with proteins

Figure 7 Three-level schema for annotation of functional variants

16 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

harboring coding variants Furthermore it is possible to identifyenriched molecular events mediated by genes enriched TFBS inthe upstream of genes and biological pathways mediated by thegenes using annotations from Reactome UCSC and KEGG repo-sitories Disease ontology can be used to find known genendashdisease associations Following the knowledge-based analysis anetwork-level analysis of genes harboring coding SNVs can beperformed using publicly available network analysis tools suchas Cytoscape and Gephi

Future directions

The rapidly increasing availability of sequence structure func-tional and interaction information offers an attractive means toobtain a detailed characterization of coding variants Howeverthe use of large data sets to systematically explore the functionalvariants is currently limited owing to the gaps in the availablesequencendashstructurendashfunction interaction data Sequence dataare available for the entire human proteome but the availabilityof protein structural data experimentally verified functional as-sociations and biomolecular interaction data is limited Theavailability of structural data of homologous protein structuresis a prerequisite for structural investigations of functional vari-ants and its impact on the structural environment Availabilityof high-quality interaction data will also help analyze the cellu-lar networks modulated by proteins harboring coding variantsThe expansion of structural data sets using structural genomics[179] homology modeling [180] and the application of computa-tional approaches such as genomics-aided structure prediction[181] is likely to lower the increasing chasm between sequencestructure function and interaction data and may help to ascer-tain the impact of variants at the biomolecular level [163 182183] Improving the annotation databases using high-qualitycurated data and integration with cloud-based or stand-alonepipelines [184ndash189] can also help facilitate such efforts

The current version of the Bioinformatics Link Directory en-lists gt900 bioinformatics tools that are capable of processing dif-ferent types of protein data (sequence structure interactionquantification and annotation) Unifying multiple resources andenabling programmatic access via application program interfacesand web services for rapid integration will significantly enhancethe efficacy of variant annotation pipelines The development ofnovel statistical methods that can quantify various residue-specific properties would enable a comparative analysis of wild-type and variant sequences and can thereby help to facilitate theidentification and prioritization of functional mutations The in-clusion of a diverse set of sequence structure and network fea-tures into a graphically depicted database and an assessment ofthe impact of various data types (sequence structure functionand interaction) using probabilistic or machine learning modelswould also help automate variant annotation interpretation

From a sequence perspective understanding the precisefunctional role of LCRs unassigned regions and disordered re-gions as well as the relationship between these features and cod-ing SNVs is a key step in variant annotation and interpretationIdentifying deviations in protein structural space that are likelyto lead to misfolding domain swapping aggregation degrad-ation or deviations that are metamorphic in nature owing to theimpact of coding SNVs is an emerging challenge Several algo-rithms are available to predict sequence or structural domainsassign distantly related domains to unassigned regions func-tional motifs structural motifs and structural features proteindisorder and aggregation propensities However algorithms andtools to predict 3D domain swapping or to analyze metamorphic

properties degradation pathways and the network-level impactof coding SNVs are less abundant This may be attributed to thefact that these concepts are continually evolving in the proteinuniverse and concerted efforts are needed to understand the im-pact of coding variants of such unique features from both experi-mental and computational biologists alike

This review summarizes the major bioinformatics chal-lenges involved in gaining a deeper understanding of codingSNVs Coding SNVs may impart a molecular or disease pheno-type by interacting with noncoding regulatory parts of the gen-ome and the structural variants may provide an importantcontribution to this phenomenon Analyzing functional andregulatory variants in a single analytical framework may furtherenhance the interpretation of the genomic variants

Conclusions

A rapid decline in sequencing costs using NGS technologies hasled to an exponential increase in the frequency of the sequenc-ing projects over the past decade A large number of personalgenomes and exomes clinical samples and cancer genomes arebeing sequenced as part of large-scale collaborative projectsThe identification of a plethora of sequence variants associatedwith diverse molecular and disease phenotypes Scalable com-putational approaches that integrate annotations using se-quence structure functional and interaction data are necessaryfor the rapid interpretation of coding SNVs We have discussedthe bioinformatics resources available for the prediction anno-tation and visualization of coding SNVs summarized the majorbioinformatics challenges into 10 different themes for a deeperinterpretation of coding SNVs and recommended a three-levelschema to assess the phenotypic impact of functional vari-ations on individual protein sequences structures differentfunctional categories biological pathways and interactomesWe envisage that addressing the key challenges discussed inthis review and adopting a comprehensive annotation schemafor variant annotation could improve genomic reports that aregenerated as part of genomic medicine investigations and ex-perimental studies to better understand the variations impli-cated in rare common and complex disease manifestations

Key Points

bull Interpretations of variants identified from next-gener-ation sequencing pose several challenges

bull Systems-level experimental investigation of the func-tional variants is expensive and time-consumingefficient computational techniques are required toidentify the impact of functional variants

bull We recommended an integrated approach that com-bines multiple data types and tools for the predictionannotation and visualization of functional variantsand we have proposed a systematic approach for func-tional variant annotation and interpretation

bull Significant challenges that need to be negotiated dur-ing the interpretation of coding single nucleotide vari-ants are presented with the help of various examples

bull A three-level annotation approach that combines theinformation at the level of an individual variant mul-tiple variants in a single protein and global trends ofmultiple genes harboring multiple variants is proposedfor an effective interpretation of coding single nucleo-tide variants

Interpreting functional effects of coding variants | 17

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

Acknowledgements

KS and JTD would like to thank Icahn Institute forGenomics and Multiscale Biology (httpicahnmssmedudepartments-and-institutesgenomics) Mount Sinai HealthSystem for infrastructural support KS would like to ac-knowledge Michael J Zimmermann and Jean-Pierre Kocher(Mayo Clinic Rochester) for useful discussions The MayoClinic Bioinformatics Core provided infrastructure and ana-lytical support to KRK RS would like to thank NationalCentre for Biological Sciences (TIFR) for infrastructuralsupport

Funding

KS and JTD was supported by grant R01-DK098242-03from the National Institute of Diabetes and Digestive andKidney Diseases KRK is funded in part by the Mayo ClinicCenter for Individualized Medicine the PharmacogenomicsResearch Network (PGRN) and the Mayo Clinic BreastSpecialized Program of Research Excellence (SPORE)

References1 Lander ES Initial impact of the sequencing of the human

genome Nature 2011470187ndash972 Mardis ER A decadersquos perspective on DNA sequencing tech-

nology Nature 2011470198ndash2033 Green ED Guyer MS Charting a course for genomic medi-

cine from base pairs to bedside Nature 2011470204ndash134 Maxmen A Exome sequencing deciphers rare diseases Cell

2011144635ndash75 Wheeler DA Srinivasan M Egholm M et al The complete

genome of an individual by massively parallel DNAsequencing Nature 2008452872ndash6

6 Weatheritt RJ Davey NE Gibson TJ Linear motifs conferfunctional diversity onto splice variants Nucleic Acids Res201240123ndash31

7 Goldstein DB Common genetic variation and human traitsN Engl J Med 20093601696ndash8

8 Manolio TA Collins FS Cox NJ et al Finding the missing her-itability of complex diseases Nature 2009461747ndash53

9 Genome-wide association study of 14 000 cases of sevencommon diseases and 3 000 shared controls Nature2007447661ndash78

10 Kullo IJ Ding K Shameer K et al Complement receptor 1gene variants are associated with erythrocyte sedimenta-tion rate Am J Hum Genet 201189131ndash8

11 Welter D MacArthur J Morales J et al The NHGRI GWASCatalog a curated resource of SNP-trait associations NucleicAcids Res 201442D1001ndash6

12 Leslie R OrsquoDonnell CJ Johnson AD GRASP analysis of geno-type-phenotype results from 1390 genome-wide associationstudies and corresponding open access databaseBioinformatics 201430i185ndash94

13 Nelson MR Wegmann D Ehm MG et al An abundance ofrare functional variants in 202 drug target genes sequencedin 14 002 people Science 2012337100ndash4

14 MacArthur DG Balasubramanian S Frankish A et al A sys-tematic survey of loss-of-function variants in human pro-tein-coding genes Science 2012335823ndash8

15 Raychaudhuri S Mapping rare and common causal allelesfor complex human diseases Cell 201114757ndash69

16 Stankiewicz P Lupski JR Structural variation in the humangenome and its role in disease Annu Rev Med201061437ndash455

17 McCarroll SA Altshuler DM Copy-number variation and as-sociation studies of human disease Nat Genet200739S37ndash42

18 Zhang F Gu W Hurles ME et al Copy number variation inhuman health disease and evolution Annu Rev GenomicsHum Genet 200910451ndash81

19 Medvedev P Stanciu M Brudno M Computational methodsfor discovering structural variation with next-generationsequencing Nat Methods 20096S13ndash20

20 Sherry ST Ward MH Kholodov M et al dbSNP the NCBI data-base of genetic variation Nucleic Acids Res 200129308ndash11

21 Rios D McLaren WM Chen Y et al A database and API forvariation dense genotyping and resequencing data BMCBioinformatics 201011238

22 Iafrate AJ Feuk L Rivera MN et al Detection of large-scale variation in the human genome Nat Genet200436949ndash51

23 Ashley EA Butte AJ Wheeler MT et al Clinical assess-ment incorporating a personal genome Lancet20103751525ndash35

24 Chen R Mias GI Li-Pook-Than J et al Personal omics profil-ing reveals dynamic molecular and medical phenotypesCell 20121481293ndash307

25 Lupski JR Reid JG Gonzaga-Jauregui C et al Whole-genomesequencing in a patient with Charcot-Marie-Tooth neur-opathy N Engl J Med 20103621181ndash91

26 Li Y Vinckenbosch N Tian G et al Resequencing of200 human exomes identifies an excess of low-fre-quency non-synonymous coding variants Nat Genet201042969ndash72

27 Clarke L Zheng-Bradley X Smith R et al The 1000 GenomesProject data management and community access NatMethods 20129459ndash462

28 Schunkert H Konig IR Kathiresan S et al Large-scale associ-ation analysis identifies 13 new susceptibility loci for coron-ary artery disease Nat Genet 201143333ndash8

29 Helgadottir A Thorleifsson G Manolescu A et al A commonvariant on chromosome 9p21 affects the risk of myocardialinfarction Science 20073161491ndash3

30 Visel A Zhu Y May D et al Targeted deletion of the 9p21non-coding coronary artery disease risk interval in miceNature 2010464409ndash12

31 Soranzo N Rendon A Gieger C et al A novel variant onchromosome 7q223 associated with mean platelet volumecounts and function Blood 20091133831ndash7

32 Paul DS Nisbet JP Yang TP et al Maps of open chromatinguide the functional follow-up of genome-wide associationsignals application to hematological traits PLoS Genet20117e1002139

33 Hull J Campino S Rowlands K et al Identification of com-mon genetic variation that modulates alternative splicingPLoS Genet 20073e99

34 Albert FW Kruglyak L The role of regulatory vari-ation in complex traits and disease Nat Rev Genet201516197ndash212

35 Battle A Khan Z Wang SH et al Genomic variation Impactof regulatory variation from RNA to protein Science2015347664ndash7

36 den Dunnen JT Antonarakis SE Nomenclature for the de-scription of human sequence variations Hum Genet2001109121ndash4

18 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

37 Eilbeck K Lewis SE Mungall CJ et al The sequence ontologya tool for the unification of genome annotations Genome Biol20056R44

38 Landrum MJ Lee JM Riley GR et al ClinVar public archive ofrelationships among sequence variation and human pheno-type Nucleic Acids Res 201442D980ndash5

39 Kircher M Witten DM Jain P et al A general framework forestimating the relative pathogenicity of human genetic vari-ants Nat Genet 201446310ndash15

40 Kumar P Henikoff S Ng PC Predicting the effects of codingnon-synonymous variants on protein function using theSIFT algorithm Nat Protoc 200941073ndash81

41 Adzhubei I Schmidt S Peshkin L et al A method and serverfor predicting damaging missense mutations Nat Meth20107248ndash9

42 Gonzalez-Perez A Lopez-Bigas N Improving the assessmentof the outcome of nonsynonymous SNVs with a consensusdeleteriousness score Condel Am J Hum Genet 201188440ndash9

43 Grantham R Amino acid difference formula to help explainprotein evolution Science 1974185862ndash4

44 Cooper GM Stone EA Asimenos G et al Distribution and in-tensity of constraint in mammalian genomic sequenceGenome Res 200515901ndash13

45 Pollard KS Hubisz MJ Rosenbloom KR et al Detection ofnonneutral substitution rates on mammalian phylogeniesGenome Res 201020110ndash21

46 Siepel A Bejerano G Pedersen JS et al Evolutionarily con-served elements in vertebrate insect worm and yeast gen-omes Genome Res 2005151034ndash50

47 Dayhoff MO Eck RV Park CM Model of evolutionary changein proteins In MO Dayhoff (ed) Atlas of Protein Sequence andStructure Washington DC National Biomedical ResearchFoundation 1972 89ndash99

48 Henikoff S Henikoff JG Amino acid substitution matricesfrom protein blocks Proc Natl Acad Sci USA 19928910915ndash19

49 Altschul SF Gish W Miller W et al Basic local alignmentsearch tool J Mol Biol 1990215403ndash10

50 Altschul SF Madden TL Schaffer AA et al Gapped BLASTand PSI-BLAST a new generation of protein database searchprograms Nucleic Acids Res 1997253389ndash402

51 Sunyaev S Ramensky V Koch I et al Prediction of deleteri-ous human alleles Hum Mol Genet 200110591ndash7

52 Yang Z Bielawski JP Statistical methods for detecting mo-lecular adaptation Trends Ecol Evol 200015496ndash503

53 Ng PC Henikoff S Predicting deleterious amino acid substi-tutions Genome Res 200111863ndash74

54 Wang Z Moult J SNPs protein structure and disease HumMutat 200117263ndash70

55 Thomas PD Kejariwal A Coding single-nucleotide poly-morphisms associated with complex vs Mendelian diseaseevolutionary evidence for differences in molecular effectsProc Natl Acad Sci USA 200410115398ndash403

56 Stone EA Sidow A Physicochemical constraint violation bymissense substitutions mediates impairment of proteinfunction and disease severity Genome Res 200515978ndash86

57 Miller MP Kumar S Understanding human disease muta-tions through the use of interspecific genetic variation HumMol Genet 2001102319ndash28

58 Johnson AD Single-nucleotide polymorphism bioinfor-matics a comprehensive review of resources Circ CardiovascGenet 20092530ndash6

59 Marjoram P Tavare S Modern computational approachesfor analysing molecular genetic variation data Nat Rev Genet20067759ndash70

60 Excoffier L Heckel G Computer programs for populationgenetics data analysis a survival guide Nat Rev Genet20067745ndash58

61 Peterson TA Nehrt NL Park D et al Incorporating molecularand functional context into the analysis and prioritizationof human variants associated with cancer J Am Med InformAssoc 201219275ndash83

62 Kann MG Advances in translational bioinformatics compu-tational approaches for the hunting of disease genes BriefBioinform 20101196ndash110

63 Cooper GM Shendure J Needles in stacks of needles findingdisease-causal variants in a wealth of genomic data Nat RevGenet 201112628ndash40

64 Stitziel NO Kiezun A Sunyaev S Computational and statis-tical approaches to analyzing variants identified by exomesequencing Genome Biol 201112227

65 Alexander RP Fang G Rozowsky J et al Annotating non-coding regions of the genome Nat Rev Genet 201011559ndash71

66 Mort M Evani US Krishnan VG et al In silico func-tional profiling of human disease-associated and poly-morphic amino acid substitutions Hum Mutat 201031335ndash46

67 Hamosh A Scott AF Amberger JS et al Online MendelianInheritance in Man (OMIM) a knowledgebase of humangenes and genetic disorders Nucleic Acids Res200533D514ndash17

68 Stenson PD Mort M Ball EV et al The human gene mutationdatabase 2008 update Genome Med 2009113

69 Forbes SA Bindal N Bamford S et al COSMIC mining com-plete cancer genomes in the Catalogue of SomaticMutations in Cancer Nucleic Acids Res 201139D945ndash50

70 Chen YA Tripathi LP Mizuguchi K TargetMine an inte-grated data warehouse for candidate gene prioritisation andtarget discovery PLoS One 20116e17844

71 Wang K Li M Hakonarson H ANNOVAR functional annota-tion of genetic variants from high-throughput sequencingdata Nucleic Acids Res 201038e164

72 Danecek P Auton A Abecasis G et al The variant call formatand VCFtools Bioinformatics 2011272156ndash8

73 Macintyre G Jimeno Yepes A Ong CS et al Associating dis-ease-related genetic variants in intergenic regions to thegenes they impact Peer J 20142e639

74 Javed A Agrawal S Ng PC Phen-Gen combining phenotypeand genotype to analyze rare disorders Nat Methods201411935ndash7

75 Dai HJ Wu JC Tsai RT et al T-HOD a literature-based candi-date gene database for hypertension obesity and diabetesDatabase (Oxford) 20132013bas061

76 Cheng D Knox C Young N et al PolySearch a web-basedtext mining system for extracting relationships betweenhuman diseases genes mutations drugs and metabolitesNucleic Acids Res 200836W399ndash405

77 Do R Kathiresan S Abecasis GR Exome sequencing andcomplex disease practical aspects of rare variant associ-ation studies Hum Mol Genet 201221R1ndash9

78 Wagner MJ Rare-variant genome-wide association studiesa new frontier in genetic analysis of complex traitsPharmacogenomics 201314413ndash24

79 Genomes Project C Abecasis GR Auton A et al An inte-grated map of genetic variation from 1 092 human genomesNature 201249156ndash65

80 Anfinsen CB The formation and stabilization of proteinstructure Biochem J 1972128737ndash49

81 Anfinsen CB Principles that govern the folding of proteinchains Science 1973181223ndash30

Interpreting functional effects of coding variants | 19

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

82 Fay J Wu C Sequence divergence functional constraintand selection in protein evolution Annu Rev Genomics HumGenet 20034213ndash35

83 Ponting CP Russell RR The natural history of protein do-mains Annu Rev Biophys Biomol Struct 20023145ndash71

84 Ekman D Bjorklund AK Frey-Skott J et al Multi-domain pro-teins in the three kingdoms of life orphan domains andother unassigned regions J Mol Biol 2005348231ndash43

85 Sigrist CJ Cerutti L Hulo N et al PROSITE a documenteddatabase using patterns and profiles as motif descriptorsBrief Bioinform 20023265ndash74

86 Dinkel H Michael S Weatheritt RJ et al ELMndashthe database ofeukaryotic linear motifs Nucleic Acids Res 201240D242ndash51

87 Haberichter SL Budde U Obser T et al The mutation N528Sin the von Willebrand factor (VWF) propeptide causes de-fective multimerization and storage of VWF Blood20101154580ndash7

88 Kutz WE Wang LW Dagoneau N et al Functional analysis ofan ADAMTS10 signal peptide mutation in Weill-Marchesanisyndrome demonstrates a long-range effect on secretion ofthe full-length enzyme Hum Mutat 2008291425ndash34

89 Pugalenthi G Suganthan PN Sowdhamini R et al SMotif aserver for structural motifs in proteins Bioinformatics200723637ndash8

90 Sundaramurthy P Shameer K Sreenivasan R et al HORI aweb server to compute Higher Order Residue Interactions inprotein structures BMC Bioinformatics 201011(Suppl 1)S24

91 Porter CT Bartlett GJ Thornton JM The Catalytic Site Atlas aresource of catalytic sites and residues identified in enzymesusing structural data Nucleic Acids Res 200432D129ndash33

92 Stuart AC Ilyin VA Sali A LigBase a database of families ofaligned ligand binding sites in known protein sequencesand structures Bioinformatics 200218200ndash01

93 Swapna LS Mahajan S de Brevern A et al Comparison oftertiary structures of proteins in protein-protein complexeswith unbound forms suggests prevalence of allostery in sig-nalling proteins BMC Struct Biol 2012126

94 Mathivanan S Ahmed M Ahn NG et al Human Proteinpediaenables sharing of human protein data Nat Biotechnol200826164ndash7

95 Bhaskara RM Srinivasan N Stability of domain structures inmulti-domain proteins Sci Rep 2011140

96 Furnham N de Beer TA Thornton JM Current challenges ingenome annotation through structural biology and bioinfor-matics Curr Opin Struct Biol 201222594ndash601

97 Wang Y Cheng H Pan Z et al Reconfiguring phosphoryl-ation signaling by genetic polymorphisms affects cancersusceptibility J Mol Cell Biol 20157187ndash202

98 Gnad F Gunawardena J Mann M PHOSIDA 2011 the post-translational modification database Nucleic Acids Res201139D253ndash60

99 Nicolaou N Margadant C Kevelam SH et al Gain of glycosy-lation in integrin alpha3 causes lung disease and nephroticsyndrome J Clin Invest 20121224375ndash87

100 Radivojac P Baenziger PH Kann MG et al Gain and loss ofphosphorylation sites in human cancer Bioinformatics200824i241ndash7

101 Pan Y Karagiannis K Zhang H et al Human germline andpan-cancer variomes and their distinct functional profilesNucleic Acids Res 20144211570ndash88

102 Mazumder R Morampudi KS Motwani M et al Proteome-wide analysis of single-nucleotide variations in the N-glyco-sylation sequon of human genes PLoS One 20127e36212

103 Reddy CC Shameer K Offmann BO et al PURE a webserverfor the prediction of domains in unassigned regions in pro-teins BMC Bioinformatics 20089281

104 Coletta A Pinney JW Solis DY et al Low-complexity regionswithin protein sequences have position-dependent rolesBMC Syst Biol 2010443

105 Jones DT Swindells MB Getting the most from PSI-BLASTTrends Biochem Sci 200227161ndash4

106 Chograni M Rejeb I Jemaa LB et al The first missense muta-tion of NHS gene in a Tunisian family with clinical featuresof NHS syndrome including cardiac anomaly Eur J HumGenet 201119851ndash6

107 Jones S Thornton JM Principles of protein-protein inter-actions Proc Natl Acad Sci USA 19969313ndash20

108 Babu MM van der Lee R de Groot NS et al Intrinsically dis-ordered proteins regulation and disease Curr Opin Struct Biol201121432ndash40

109 Dunker AK Silman I Uversky VN et al Function and struc-ture of inherently disordered proteins Curr Opin Struct Biol200818756ndash64

110 Wright PE Dyson HJ Intrinsically unstructured proteins re-assessing the protein structure-function paradigm J Mol Biol1999293321ndash31

111 Dyson HJ Wright PE Coupling of folding and binding for un-structured proteins Curr Opin Struct Biol 20021254ndash60

112 Nguyen Ba AN Yeh BJ van Dyk D et al Proteome-wide dis-covery of evolutionary conserved sequences in disorderedregions Sci Signal 20125rs1

113 Das RK Mao AH Pappu RV Unmasking functionalmotifs within disordered regions of proteins Sci Signal20125pe17

114 Hu Y Liu Y Jung J et al Changes in predicted protein dis-order tendency may contribute to disease risk BMCGenomics 201112(Suppl 5)S2

115 Vacic V Iakoucheva LM Disease mutations in disordered re-gionsndashexception to the rule Mol BioSyst 2012827ndash32

116 Srivastava SK Gayathri S Manjasetty BA et al Analysis ofconformational variation in macromolecular structuralmodels PLoS One 20127e39993

117 Ward JJ Sodhi JS McGuffin LJ et al Prediction and functionalanalysis of native disorder in proteins from the three king-doms of life J Mol Biol 2004337635ndash45

118 Thomas PJ Qu BH Pedersen PL Defective protein folding asa basis of human disease Trends Biochem Sci 199520456ndash9

119 Welch W Role of quality control pathways in human dis-eases involving protein misfolding Semin Cell Dev Biol20041531ndash8

120 Fink AL Protein aggregation folding aggregates inclusionbodies and amyloid Fold Des 19983R9ndash23

121 Thirumalai D Klimov DK Dima RI Emerging ideas on themolecular basis of protein and peptide aggregation CurrOpin Struct Biol 200313146ndash59

122 Khare SD Dokholyan NV Molecular mechanisms of poly-peptide aggregation in human diseases Curr Protein Pept Sci20078573ndash9

123 Soto C Estrada L Castilla J Amyloids prions and the inher-ent infectious nature of misfolded protein aggregatesTrends Biochem Sci 200631150ndash5

124 Prusiner SB Molecular biology and pathogenesis of priondiseases Trends Biochem Sci 199621482ndash7

125 Shameer K Shingate PN Manjunath SC et al 3DSwap cura-ted knowledgebase of proteins involved in 3D domain swap-ping Database (Oxford) 20112011bar042

20 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

126 Bennett MJ Choe S Eisenberg D Domain swapping entan-gling alliances between proteins Proc Natl Acad Sci USA1994913127ndash31

127 Bennett MJ Sawaya MR Eisenberg D Deposition diseasesand 3D domain swapping Structure 200614811ndash24

128 Shameer K Sowdhamini R Functional repertoire molecularpathways and diseases associated with 3D domain swap-ping in the human proteome J Clin Bioinform 201228

129 Waters PJ Parniak MA Hewson AS et al Alterations in pro-tein aggregation and degradation due to mild and severemissense mutations (A104D R157N) in the human phenyl-alanine hydroxylase gene (PAH) Hum Mutat 199812344ndash54

130 OrsquoNeill JW Kim DE Johnsen K et al Single-site mutations in-duce 3D domain swapping in the B1 domain of protein Lfrom Peptostreptococcus magnus Structure 200191017ndash27

131 Seeliger MA Spichty M Kelly SE et al Role of conformationalheterogeneity in domain swapping and adapter function ofthe Cks proteins J Biol Chem 200528030448ndash59

132 Kirsten Frank M Dyda F Dobrodumov A et al Core muta-tions switch monomeric protein GB1 into an intertwinedtetramer Nat Struct Biol 20029877ndash85

133 Huang Y Cao H Liu Z Three-dimensional domain swappingin the protein structure space Proteins 2012801610ndash19

134 Gidalevitz T Wang N Deravaj T et al Natural genetic vari-ation determines susceptibility to aggregation or toxicity ina C elegans model for polyglutamine disease BMC Biol201311100

135 Ross CA Poirier MA Protein aggregation and neurodegener-ative disease Nat Med 200410(Suppl)S10ndash17

136 Murzin A Metamorphic proteins Science 20083201725ndash6137 Shortle D One sequence plus one mutation equals two

folds Proc Natl Acad Sci USA 200910621011ndash12138 Yadid I Kirshenbaum N Sharon M et al Metamorphic pro-

teins mediate evolutionary transitions of structure Proc NatlAcad Sci USA 20101077287ndash92

139 Yadid I Kirshenbaum N Sharon M et al Metamorphic pro-teins mediate evolutionary transitions of structure Proc NatlAcad Sci USA 20101077287ndash92

140 Pickrell J Marioni J Pai A et al Understanding mechanismsunderlying human gene expression variation with RNAsequencing Nature 2010464768ndash72

141 Matlin AJ Clark F Smith CW Understanding alternative splic-ing towards a cellular code Nat Rev Mol Cell Biol 20056386ndash98

142 Rickman DS Pflueger D Moss B et al SLC45A3-ELK4 is anovel and frequent erythroblast transformation-specific fu-sion transcript in prostate cancer Cancer Res 2009692734ndash8

143 Flomen R Makoff A Increased RNA editing in EAAT2 pre-mRNA from amyotrophic lateral sclerosis patients involve-ment of a cryptic polyadenylation site Neurosci Lett2011497139ndash43

144 Asmann YW Hossain A Necela BM et al A novel bioinfor-matics pipeline for identification and characterization of fu-sion transcripts in breast cancer and normal cell linesNucleic Acids Res 201139e100

145 Bundschuh R Computational prediction of RNA editingsites Bioinformatics 2004203214ndash20

146 Eisenberg E Li JB Levanon EY Sequence based identificationof RNA editing sites RNA Biol 20107248ndash52

147 Levanon EY Eisenberg E Algorithmic approaches for identi-fication of RNA editing sites Brief Funct Genomics Proteomic2006543ndash5

148 McCarthy DJ Humburg P Kanapin A et al Choice of tran-scripts and software has a large effect on variant annota-tion Genome Med 2014626

149 Tang X Baheti S Shameer K et al The eSNV-detect a com-putational system to identify expressed single nucleotidevariants from transcriptome sequencing data Nucleic AcidsRes 201442e172

150 Chorev M Carmel L The function of introns Front Genet2012355

151 Chen R Davydov EV Sirota M et al Non-synonymous andsynonymous coding SNPs show similar likelihood and effectsize of human disease association PLoS One 20105e13574

152 Rose AB Intron-mediated regulation of gene expressionCurr Top Microbiol Immunol 2008326277ndash90

153 Birnbaum RY Clowney EJ Agamy O et al Coding exonsfunction as tissue-specific enhancers of nearby genesGenome Res 2012221059ndash68

154 Nakaya HI Amaral PP Louro R et al Genome mapping andexpression analyses of human intronic noncoding RNAsreveal tissue-specific patterns and enrichment in genesrelated to regulation of transcription Genome Biol 20078R43

155 Rohs R Jin X West SM et al Origins of specificity in protein-DNA recognition Ann Rev Biochem 201079233ndash69

156 Karczewski KJ Dudley JT Kukurba KR et al Systematic func-tional regulatory assessment of disease-associated variantsProc Natl Acad Sci USA 20131109607ndash12

157 Khalil AM Rinn JL RNA-protein interactions in humanhealth and disease Semin Cell Dev Biol 201122359ndash65

158 Jones S van Heyningen P Berman HM et al Protein-DNAinteractions a structural analysis J Mol Biol 1999287877ndash96

159 Jones S Daley DT Luscombe NM et al Protein-RNAinteractions a structural analysis Nucleic Acids Res200129943ndash54

160 Barabasi AL Oltvai ZN Network biology understanding thecellrsquos functional organization Nat Rev Genet 20045101ndash13

161 Nooren IM Thornton JM Diversity of protein-protein inter-actions EMBO J 2003223486ndash92

162 Caspi R Altman T Dreher K et al The MetaCyc database ofmetabolic pathways and enzymes and the BioCyc collectionof pathwaygenome databases Nucleic Acids Res201240D742ndash53

163 Yamada T Bork P Evolution of biomolecular networks les-sons from metabolic and protein interactions Nat Rev MolCell Biol 200910791ndash803

164 Stark C Breitkreutz BJ Chatr-Aryamontri A et al TheBioGRID interaction database 2011 update Nucleic Acids Res201139D698ndash704

165 Szklarczyk D Franceschini A Kuhn M et al The STRINGdatabase in 2011 functional interaction networks of pro-teins globally integrated and scored Nucleic Acids Res201139D561ndash8

166 Bader GD Hogue CW An automated method for finding mo-lecular complexes in large protein interaction networksBMC Bioinformatics 200342

167 DallrsquoOlio GM Vahdati AR Bertranpetit J et alVCF2Networks applying Genotype Networks to SingleNucleotide Variants data Bioinformatics 201531438ndash9

168 Ding L Getz G Wheeler DA et al Somatic mutations affectkey pathways in lung adenocarcinoma Nature20084551069ndash75

169 Case M Matheson E Minto L et al Mutation of genes affect-ing the RAS pathway is common in childhood acutelymphoblastic leukemia Cancer Res 2008686803ndash9

170 Thorstensen L Lind GE Lovig T et al Genetic and epigeneticchanges of components affecting the WNT pathway in colo-rectal carcinomas stratified by microsatellite instabilityNeoplasia 2005799ndash108

Interpreting functional effects of coding variants | 21

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

171 Weng L Macciardi F Subramanian A et al SNP-based path-way enrichment analysis for genome-wide associationstudies BMC Bioinformatics 20111299

172 Wang K Li M Bucan M Pathway-based approaches for ana-lysis of genomewide association studies Am J Hum Genet2007811278ndash83

173 OrsquoDushlaine C Kenny E Heron EA et al The SNP ratio testpathway analysis of genome-wide association datasetsBioinformatics 2009252762ndash3

174 Tarca AL Draghici S Khatri P et al A novel signaling path-way impact analysis Bioinformatics 20092575ndash82

175 Ramanan VK Shen L Moore JH et al Pathway analysis ofgenomic data concepts methods and prospects for futuredevelopment Trends Genet 201228323ndash2

176 Li Y Agarwal P Rajagopalan D A global pathway crosstalknetwork Bioinformatics 2008241442ndash7

177 Rivals I Personnaz L Taing L et al Enrichment or depletionof a GO category within a class of genes which testBioinformatics 200723401ndash7

178 Rhee SY Wood V Dolinski K et al Use and misuse of thegene ontology annotations Nat Rev Genet 20089509ndash15

179 Chandonia JM Brenner SE The impact of structural gen-omics expectations and outcomes Science 2006311347ndash51

180 Johnson MS Srinivasan N Sowdhamini R et al Knowledge-based protein modeling Crit Rev Biochem Mol Biol 1994291ndash68

181 Sulkowska JI Morcos F Weigt M et al Genomics-aided struc-ture prediction Proc Natl Acad Sci USA 201210910340ndash5

182 Lee D Redfern O Orengo C Predicting protein function from se-quence and structure Nat Rev Mol Cell Biol 20078995ndash1005

183 Worth CL Gong S Blundell TL Structural and functionalconstraints in the evolution of protein families Nat Rev MolCell Biol 200910709ndash20

184 Karczewski KJ Fernald GH Martin AR et al STORMSeq anopen-source user-friendly pipeline for processing personalgenomics data in the cloud PLoS One 20149e84860

185 Kalari KR Rossell D Necela BM et al Deep sequence ana-lysis of non-small cell lung cancer integrated analysis ofgene expression alternative splicing and single nucleotidevariations in lung adenocarcinomas with and without onco-genic KRAS mutations Front Oncol 2012212

186 Patil S Upadhayay A Arya D et al A high throughput exomesequencing approach to analyse events within a good re-sponder CML patient under imatinib at diagnosis and underremission Blood 20131225161

187 Kullo IJ Shameer K Jouni H et al The ATXN2-SH2B3 locus isassociated with peripheral arterial disease an electronicmedical record-based genome-wide association study FrontGenet 20145166

188 Shameer K Denny JC Ding K et al A genome- and phe-nome-wide association study to identify genetic variantsinfluencing platelet count and volume and their pleiotropiceffects Hum Genet 201413395ndash109

189 Sabarinathan R Wenzel A Novotny P et al Transcriptome-wide analysis of UTRs in non-small cell lung cancer revealscancer-related genes with SNV-induced changes on RNAsecondary structure and miRNA target sites PLoS One20149e82699

22 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

  • bbv084-TF1
Page 3: challenges in proteome-scale prediction, annotation and assessment

SNVs (regulatory variations) that are located in the regulatoryelements [promoters untranslated regions (UTRs) enhancershuman accelerated regions noncoding RNA genes transcriptionfactor binding sites (TFBS)] that regulate the expression levels of atranscript or a protein Regulatory variants may also perturb geneor protein functions via complex cis or trans-activation mechan-isms and may thus play important roles in influencing the ex-pression and functions of other genes An example of a functionalvariant (P33S) in the ribonucleoside-diphosphate reductase sub-unit M2 B (RRM2B) gene associated with autosomal recessive pro-gressive external ophthalmoplegia is provided in Figure 2(i) and2(ii) Another example of a missense variant R363H (rs11556924) inzinc finger C3HC-type containing 1 (ZC3HC1) gene is associatedwith coronary heart disease [28] P33S in RRM2B is part of a con-served haribonucleotide reductase domain The variant R363H inZC3HC1 does not lie within a known functional domain and thusthe variation is located in an unassigned region of the proteinThe domain architectures of RRM2B and ZC3HC1 along with thelocation of functional variants are highlighted in Figure 2(ii)Regulatory variants may influence the expression levels of cis- ortrans-acting genes through gene regulation networks For ex-ample an intergenic variant (rs10811661) in the 9p21 locus whichis harbored near CDKN2A and CDKN2B on chromosome 9 wasassociated with myocardial infarction [29] and coronary heart dis-ease [9] A targeted deletion study in mice that investigated the9p21 region revealed a regulatory role of the variant in perturbingthe expression of two genes (Cdkn2a and Cdkn2b) via a cis-acting

mechanism [30] Another regulatory variant rs342293 (intergenicvariant between FLJ36031-PIK3CG) which influences a quantitativetrait (mean platelet volume) [31] perturbed a TFBS of ecotropicviral integration site-1 (EVI1) and influenced the expressionlevels of PIK3CG [32] The genomic locations of the regulatory vari-ants rs10811661 and rs342293 that were generated using theEnsembl Genome Browser are highlighted in Figure 2(iii)Regulatory variants too impact the biological function by alteringthe level of transcription that may influence protein levels as wellThe majority of regulatory variants are not within the protein cod-ing regions but may indirectly influence geneprotein function viaalternative splicing mechanisms [33] See the recent reviewsthat summarized the impact of regulatory variation of protein ex-pression and function for a detailed account of regulatory variants[34 35]

Prediction annotation and visualizationof coding SNVs

Scientific literature often refers to variants using diverse sets ofterms including recommended nomenclature [36] or standardterms from Sequence Ontology (SO) [37] In this review we broadlyclassify genomic variants as lsquofunctional variantsrsquo and lsquoregulatoryvariantsrsquo and we primarily focus on coding SNVs and computa-tional approaches for the prediction annotation visualization andinterpretation of coding SNVs The different terms used to define

Mu

tatio

n

Chromosomalallelic

SequenceStructure

Function

Population

Loss of Heterozygozity

Inheritance

Germ line mutation

Somatic mutation

Passenger mutation

Driver mutation

Harmful mutation

Benign mutation

Neutral mutation

Deleterious mutation

Advantageous mutation

Nearly neutral mutation

Point mutation

Frameshift mutation

Nonsense mutation

Missense mutation

Synonymous mutation

Non-synonymous mutation

Silent mutation

Transitions

Transversions

Deletions

Chromosomal inversions

Amplifications

Deletions

Chromosomal translocations

Loss of Heterozygozity

Interstitial deletions

Loss-of-function mutation

Lethal mutation

Gain-of-function mutation

Gain-of-function mutation

Dominant negative mutations

Allele frequency

Common variant

Rare variant

Private variant

Penetrance

High penetrance

Low penetrance

Recessive

Dominant

Homozygous

Heterozygous

Spontaneous mutation

Induced mutation

Figure 1 The human mutation spectrum

Interpreting functional effects of coding variants | 3

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

variants in the Ensembl Variation resources and dbSNP and theircorresponding SO identifiers and descriptions are summarized inTable 1 The interpretation of variants can be broadly divided intothree steps lsquoannotationrsquo lsquopredictionrsquo and lsquovisualizationrsquo

A given set of variant(s) identified from a sequencing studywill be segregated into various classes of genomic variants in thefirst step in variant annotation (Table 1) Cross-referencing thevariants with reference databases and clinical annotation data-bases like ClinVar [38] can be used to assess the novelty of gen-omic variants Fully automated pipelines can be used to annotatevariants based on coordinate specific mapping to genomic re-gions using proprietary andor public databases and various fea-tures associated with variant can be derived from suchannotation mapping and variant-specific layering approach Forexample a variant can be mapped to a protein-coding regionjunction regions or noncoding regions of the genome

Based on the location of the variant in the protein-coding ornoncoding regions variants (non-synonymous missense non-sense or frameshift variants) can be further examined to under-stand their impact on protein functions Tools like CombinedAnnotation-Dependent Depletion (CADD) [39] SIFT (as toleratedor damaging variants) [40] PolyPhen2 (as benign possibly

damaging or probably damaging variants) [41] or Condel (meta-predictor that combines prediction scores from multiple tools)[42] can be leveraged for predictive assessment of genetic vari-ants Annotation primarily provides localization of a genetic vari-ant using genome coordinates prediction aims to hypothesizethe probable impact of a particular genetic variant on function orregulation The combination of annotation and prediction pro-vides an integrated view of genomic variants Tools such assnpEff or the Ensembl Variant Predictor can be used to predict theimpact of a variation in comparison with the reference sequenceThe impact of a sequence variant with respect to the evolutionaryconservation can also be predicted or derived from GRANTHAMscore [43] genomic evolutionary rate profiling (GERP) score [44]phylogenetic P-value (PhyloP) score [45] and PhastCons score [46]The development of substitution matrices in the early 1970s wasinstrumental in fueling the design and implementation of com-putational approaches to infer the functional impact of genomicvariants [43 47] Matrices such as Point Accepted Mutations(PAM1 matrix substitution probabilities for sequences with a mu-tation rate of 1100 amino acids PAM250 matrix 250 mutations100 amino acids) BLOCK SUbstitution Matrix (BLOSUM) matrices[48] and sequence search algorithms designed to identify the

Figure 2 Examples of the coding (functional) and noncoding (regulatory) variants (i) Functional variant (Pro33Ser) in RRM2B associated with autosomal recessive progres-

sive external ophthalmoplegia visualized on a protein structure (PDB ID 2vux Quaternary assembly is generated using PISAPDBe) Functional variant (Pro33Ser) is high-

lighted in red color inside the green circle on chains A (part of loop) and B (part of helix) Visualization was created using UCSF Chimera (wwwcglucsfeduchimera) (ii)

Protein domain architectures and functional variants mapped to ii (a) RRM2B and ii (b) ZC3HC1 Ribonuc_red_smfrac14Ribonucleotide reductase domain zffrac14C3HC zinc finger-

like domain Rsm1frac14Rsm1-like domain Functional variants are highlighted using red vertical line Figure was generated using MyDomains (httpprositeexpasyorgmydo-

mains) (iii) Genomic localization of the regulatory variants (a) rs10811661 and (b) rs342293 The location of the variants are highlighted using a vertical red line Regulatory

variant rs10811661 regulated the expression of nearby genes CDKN2A and CDKN2B (highlighted in red boxes) Intergenic variant rs342293 located between FLJ36031

(CCDC71L)-PIK3CG is located in a TFBS of EVI1 that regulates the expression (repress) of PIK3CG Genomic regions were visualized using Ensembl Genome Browser v 67

4 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

similarity between any two sequences (pairwise sequence align-ment) and the evolutionary relationships between two or moresequences (multiple sequence alignments) have spawned the de-velopment of improved heuristic homology search tools [49 50]The early 2000s witnessed the development of several predictivemethods based on sequence conservation amino acid substitu-tions and perturbation of the local structural environment to as-sess the impact of functional variants in proteins [51ndash57] The kaks ratio (or dNdS) test which estimates the ratio of the totalnumber of non-synonymous substitutions per non-synonymoussites (dN) to the total number of synonymous substitutions to

synonymous sites (dS) is widely used to quantify the selectionpressure on functional genes [52] Furthermore bioinformaticstools databases and statistical methods for the identification an-notation and analysis of SNPs from genotyping and sequencingdata have been amply surveyed in literature [58ndash66]

Integrative approaches can provide additional annotationsfor variants such as the location of the variants within the con-served protein sequencestructure domains (contiguous unit inprotein sequence or structure with evidence of functional anno-tation from experimental or computational function associationmethods) within known functional sites Gene Ontology (GO)

Table 1 The naming convention used to describe the sequence variants in the Ensembl Variation resources and dbSNP

Ensembl variantconsequences

dbSNPfunctionalclasses

SO ID SO Definition

Essential splice site splice-3 SO0001574SO0001575

A splice variant that changes the two base regions at the 30 end of anintron a splice variant that changes the two base regions at the 50

end of an intronStop gained splice-5 SO0001587 A sequence variant whereby at least one base of a codon is changed

resulting in a premature stop codon leading to a shortenedtranscript

Stop lost nonsense SO0001578 A sequence variant where at least one base of the terminator codon(stop) is changed resulting in an elongated transcript

Complex indel NA SO0001577 A transcript variant with a complex INDELmdashInsertion or deletion thatspans an exonintron border or a coding sequenceUTR border

Frameshift coding frameshift SO0001589 A sequence variant which causes a disruption of the translationalreading frame because the number of nucleotides inserted ordeleted is not a multiple of three

Non-synonymouscoding

missense SO0001582SO0001652SO0001651SO0001583

A codon variant that changes at least one base of the first codon of atranscript an inframe non-synonymous variant that deletes basesfrom the coding sequence an inframe non-synonymous variantthat inserts bases into in the coding sequence a sequence variantwhere the change may be longer than three bases and at least onebase of a codon is changed resulting in a codon that encodes for adifferent amino acid

Splice site NA SO0001630 A sequence variant in which a change has occurred within the regionof the splice site either within 1ndash3 bases of the exon or 3ndash8 bases ofthe intron

Partial codon NA SO0001626 A sequence variant where at least one base of the final codon of an in-completely annotated transcript is changed

Synonymous coding cds-synon SO0001567SO0001588

A sequence variant where at least one base in the terminator codon ischanged but the terminator remains a sequence variant wherethere is no resulting change to the encoded amino acid

Coding unknown NA SO0001580 A sequence variant that changes the coding sequenceWithin mature miRNA NA SO0001620 A transcript variant located with the sequence of the mature miRNA50 UTR untranslated_5

UTR-5SO0001623 A UTR variant of the 50 UTR

30 UTR untranslated_3UTR-3

SO0001624 A UTR variant of the 30 UTR

Intronic intron SO0001627 A transcript variant occurring within an intronNMD transcript SO0001621 A variant in a transcript that is the target of NMDWithin non-coding

genencRNA SO0001619 A transcript variant of a non-coding RNA gene

Upstream near-gene-5 SO0001636SO0001635

A sequence variant located within 2 KB 50 of a gene a sequence variantlocated within 5 KB 50 of a gene

Downstream near-gene-3 SO0001634SO0001633

A sequence variant located within a half KB of the end of a gene a se-quence variant located within 5 KB of the end of a gene

Regulatory region NA SO0001566 A sequence variant located within a regulatory regionTranscription factor

binding motifNA SO0001782 A sequence variant located within a transcription factor-binding site

Intergenic intergenic SO0001628 A sequence variant located in the intergenic region between genes

SO identifiers and descriptions were obtained from MISO (wwwsequenceontologyorgmiso)

Interpreting functional effects of coding variants | 5

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

terms and pathways associated with the variant-containinggenes associations mapped within genetic databases such asdbSNP Ensembl Online Mendelian Inheritance in Man [67]Human Gene Mutation Database [68] Catalogue of SomaticMutations in Cancer (COSMIC) [69] and other clinically relevantgenomic regions also enable enhanced variant annotationTogether integrative data analysis platforms (such asTargetMine [70]) and integrated annotation tools such asANNOVAR [71] or webservers that can handle Variant CallFormat (VCF) [72] files facilitate the annotations for a largenumber of sequence variants in a small amount of timeAdopting a standard set of terms from the SO to define the typeof variants would also help in streamlining the interoperabilityof results from the different prediction tools A comprehensivelist of tools for rapid variant annotation is provided in Table 2

The visualization of a variant in the context of multiplelayers of biological information is helpful to interpret and priori-tize variants for functional studies A growing number of gen-ome browsers and data visualization libraries enable theinteractive and static visualizations of variants in the context ofthe human genome or transcriptome with biological clinicaland population scale annotation data compiled from multipleresources Genome browsers [Integrative Genomics Viewer(IGV) UCSC Genome Browsers NCBI Sequence Viewer etc] en-able the visualization of variants in the context of a referencegenome Resources such as Distributed Annotation System(DAS) and DASTY (a protein-centric DAS client) can be leveragedfor interactive visualization of coding variants in a protein withrich annotations A list of genome browsers and visualizationlibraries capable of visualizing variants and multiple annotationcomponents is provided in Table 3

Text mining and natural language processingfor knowledge aggregation for SNVs

Biomedical knowledge about relationships between genes dis-eases phenotypes and genetic variations are scattered across alarge number of unstructured literature databases Applicationof natural language processing and text-mining thereforeoffers a useful approach for function assignment of codingSNVs Further text mining of large biomedical literature data-bases like PubMed and Medline helps to provide clues for fur-ther investigations and leads to hypothesis [73] For examplePhenGen [74] offers links to a variety of literature evidence tosupport genotypendashphenotype connections Integrated databases

like T-HOD database [75] and PolySearch [76] provide text-mining tools to derive meaningful biological inferences tointerpret coding SNVs

Emerging challenges in annotation andinterpretation of coding SNVs

The growing number of tools for the prediction annotation andvisualization of coding SNVs can address several gaps in thecurrent state of knowledge on variant interpretation The devel-opment of new algorithms for variant interpretation could beconsidered for several emerging themes of the protein se-quence-structure-function paradigm Several tools [See Tables4 5 and 6] are currently available to assess sequence and struc-ture-based features yet a reliable interpretation on how a gen-omic variant could perturb a protein or a protein network isoften a challenging task

VUS (also known as incidental variations or secondary vari-ations) are a class of variants hitherto unknown in known dis-ease genes but the influence of these variants on the diseasephenotype is largely unknown Pilot data from 1000 Genomesproject on exon capture sequencing of 1092 individuals selectedfrom 14 populations across Europe East Asia sub-SaharanAfrica and the Americas reported an observed frequency of 2500non-synonymous variants at conserved positions 20ndash40 vari-ants identified as damaging 24 at conserved sites and about 150loss-of-function (LOF) variants that includes stop-gains frame-shift insertions and deletions (indels) in a coding sequence anddisruptions to essential splice sites Emerging evidence suggeststhat rare variants may have a higher collective impact on dis-ease incidence rate than common variants and considering theallele frequency as a metric to assess the clinical impact of thevariant may help to assess the clinical impact of VUS Majorityof the variants identified in individuals in 1000 Genomes projectare common with gt5 of MAF or low frequency (MAF 05ndash5)than rare variants (MAF lt05) Rare variant frequencies esti-mated as 130ndash400 non-synonymous variants per individual thatincludes 10ndash20 LOF variants 2ndash5 damaging mutations and 1ndash2variants identified from cancer genome sequencing projects[77ndash79] As an ever-increasing number of VUS are being charac-terized their annotation and interpretation is becoming morechallenging With the advent of WGS and WES in clinical set-tings the repertoire of VUS associated with complex chronicand rare diseases is rapidly expanding We surveyed theHuman polymorphisms and disease mutations index (humsa-vartxt) to gather unclassified variants reported in UniProtKB acurated database of functional information on proteins Thecurrent release (2014_05) of the humsavar lists 6564 variants aslsquoUnclassified variantsrsquo these variants were mapped to 1910 pro-tein coding genes A manually curated annotation of the vari-ants and disease ontology-based disease term-gene enrichmentanalyses indicated that the genes were largely from cancer pa-tients (322 genes tested Plt 005) We noted that 497 genesencoded polymorphism disease and unclassified variants sug-gesting that VUS-labeled variants may directly influence thegenes that are relevant to human diseases and clinically rele-vant phenotypes A proportional Venn diagram of genesmapped to variants labeled as lsquoPolymorphismrsquo lsquoDiseasersquo andlsquoUnclassifiedrsquo is illustrated in Figure 3 The top 10 disease ontol-ogy terms enriched among the genes within the three classes ofvariants are summarized in Table 2 We observed that certaindiseases were represented by genes that share multiple classesof variants thereby suggesting that unclassified variants may

Table 2 Top-10 terms from a gene-disease enrichment analyses per-formed using list of genes with shared polymorphism disease andunclassified variants using disease ontology

DO term No ofgenes

P-values(Bonferronicorrected)

Congenital abnormality 38 913E-34Cancer 64 7222E-33Diabetes mellitus 39 8913E-23Breast cancer 36 356E-17Atherosclerosis 26 1478E-16Retinal disease 16 2044E-16Adenovirus infection 15 8816E-14Hypertension 21 2072E-13Alzheimerrsquos disease 22 7548E-13

6 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

play a key role in certain clinical conditions such as cancerschronic diseases (diabetes atherosclerosis hypertension andkidney disease) neurodegenerative disease (Alzheimerrsquos dis-ease) and infection (Plt 005 Bonferroni corrected) Annotatingfunctional coding SNVs including VUS validated using an or-thogonal method also requires detailed sequence structure andinteraction-based analyses In this context we discuss some ofthe predictive features and analytical approaches that onceincorporated in coding variant annotation algorithms may po-tentially enhance the interpretation of the functional variantson a proteome-wide scale

Sequence and structural propertiesperturbed by coding SNVs

A protein performs its defined function after attaining a specifictertiary or quaternary structure [80 81] This is often mediatedby cross-links of inter-chain and intra-chain amino-acid residueinteractions within a protein These interactions (hydrogenbond disulphide bonds salt bridges ionic interactions electro-static interactions hydrophobic interactions etc) stabilize the

fold of individual protein chains and the overall quaternary as-sembly Individual amino acids within a protein are under theinfluence of the varied elements of the sequence-structure-function paradigm that modulate their biochemical role [82]Primarily a residue may be located within an evolutionarilyconserved compact globular domain [83] or in an unassigned re-gion with no known protein domain association [84] A residuecan be a part of a motif which can be functional [85] that of alinear motif [86] propeptide [87] signaling peptide [88] or a partof structural [89] and spatially interacting sites which partici-pate in higher order interactions [90] catalytic sites [91] ligandbinding sites [92] and allosteric sites [93] associated with exten-sive inter-chain and intra-chain interactions [90] based onthe oligomeric property of the proteins Amino acids can also bepart of specific sub-cellular localization signals that directthe proteins to specific locations in the cell [94] The lengths ofthe proteins [62] sequential localization (N-terminalC-terminal or other region of protein sequence) and the locationof a coding SNV with respect to the surfacemdashinterior or inter-face of the protein structuremdashcould influence disease manifest-ation [95]

Table 3 Rapid variant annotation tools

Name Description URL

ANNOVAR Efficient software tool to use up-to-date information tofunctionally annotate genetic variants detected fromdiverse genomes

httpwwwopenbioinformaticsorgannovar

AnnTools Comprehensive and versatile annotation toolkit for gen-omic variants

httpanntoolssourceforgenet

dbNSFP A lightweight database of human non-synonymousSNPs and their functional predictions

httpssitesgooglecomsitejpopgendbNSFP

EVA An efficient and versatile tool for filtering strategies inmedical genomics

httpplateforme-genomique-iribuniv-rouenfrEVAindexphp

Exome VariantServer

Provides different calculated values (GERP GRANTHAMetc) and annotations for SNPs

httpevsgswashingtoneduEVS

gSearch gSearch compares sequence variants in the GenomeVariation Format (GVF) or VCF with a pre-compiledannotation or with variants in other genomes

httpmlssuackrgSearchindexhtml

HugeSeq A pipeline for detection and annotation of geneticvariations

httphugeseqsnyderlaborg

MuSiC Comprehensive mutational analysis pipeline to segre-gate passenger and driver mutations from cancergenomes

httpgmtgenomewustledugenome-musiccurrent

NGS-SNP Collection of command-line scripts for providing richannotations for SNPs

httpstothardafnsualbertacadownloadsNGS-SNP

SeattleSeq Annotation Provides annotation of known and novel SNPs httpsnpgswashingtoneduSeattleSeqAnnotationsnpEff Variant annotation and effect prediction tool httpsnpeffsourceforgenetSVA Software system designed for annotation and visualiza-

tion of genetic variantshttpwwwsvaprojectorg

STORMSeq Cloud computing solution for read mapping read clean-ing and variant calling and annotation

httpsgithubcomkonradjkstormseq

TREAT Targeted RE-sequencing Annotation Tool httpndcmayoedumayoresearchbiostatstand-alone-packagescfm

VAAST Probabilistic search tool for identifying damaged genesand their disease-causing variants in personal gen-ome sequences

httpwwwyandell-laborgsoftwarevaasthtml

VARIANT VARIANT (VARIant ANalysis Tool) can report the func-tional properties of any variant in all the humanmouse or rat genes

httpvariantbioinfocipfes

Variant Reporter Generate a report of known variants and functionalconsequences

wwwncbinlmnihgovvariationtoolsreporter

Variant Tools Tool for the annotation selection and analysis of vari-ants in the context of next-gen sequencing analysis

httpvarianttoolssourceforgenet

Interpreting functional effects of coding variants | 7

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

Table 4 Genome browsers and biological data visualization libraries

Name Description URL

1000 Genomesbrowser

Genome browser to access data from 1000 Genomesproject

httpwwwncbinlmnihgovvariationtools1000genomes

AnnoJ Web 20 application designed for visualizing sequencingand annotation data

httpwwwannojorg

Artemis Genome Browser and Annotation Tool httpwwwsangeracukresourcessoftwareartemis

BioGraphics Perl modules for biological data visualization httpsearchcpanorgdistBio-GraphicsBioGraphics (Ruby) Ruby library for drawing overviews of genomic regions httpbio-graphicsrubyforgeorgBluejay Java-based integrated computational environment for the

exploration of genomic datahttpbluejayucalgaryca

CGView Circular Genome Viewer httpwishartbiologyualbertacacgviewCircos Circos is a software package for visualizing genomic data

and annotations in circular layouthttpcircosca

Dalliance Interactive genome viewer which runs directly in a modernweb browser

httpwwwbiodallianceorg

DAS DAS is an integrated visualization toolkit to share and col-late genomic annotation information

wwwbiodasorg

DASTY Web client for visualizing protein sequence feature infor-mation using DAS

httpwwwebiacukdasty

DNAPlotter DNAPlotter can be used to generate images of circular andlinear DNA maps to display regions and annotations ofinterest

httpwwwsangeracukresourcessoftwarednaplotter

Ensembl GenomeBrowser

Ensembl Genome Browser enables the visualization of gen-omic and transcriptomic sequence and related informa-tion for several vertebrate and non-vertebrate species

httpuseastensemblorgindexhtml

GASV Geometric Analysis of Structural Variants httpcompbiocsbrownedusoftwarehtmlGbrowse Generic Genome Browser (GBrowse) is a genome viewer de-

veloped as part of Generic Model Organism Database(GMOD) project

httpgmodorgwikiGBrowse

GENBOREE Customizable genome browser httpgenboreeorgjava-binloginjspGeneViTo JAVA-based workbench for genome-wide analysis through

visual interactionhttpathinabioluoagrbioinformaticsGENEVITO

GenomeGraphs R-based interface to plot genomic information fromEnsembl

httpwwwbioconductororgpackagesreleasebiochtmlGenomeGraphshtml

GenomePixelizer Tool to generate custom images of genomes out of thegiven set of genes

httpwwwatgcorgGenomePixelizer

GenomeTools Versatile genome analysis software httpgenometoolsorgGenomeView Next-generation stand-alone genome browser and editor httpgenomevieworgGremlin Interactive visualization model for analyzing genomic

rearrangementshttpcompbiocsbrownedusoftwarehtml

IGB Integrated genome browser httpbiovizorgigbindexhtmlIGV Integrative Genomics Viewer httpwwwbroadinstituteorgigvJbrowse Genome browser with a fully dynamic AJAX interface httpgmodorgwikiJBrowsejsDAS JavaScript client library for the DAS httpwwwebiacukdastyebihtmljsdashtmlMagicViewer Integrated solution for NGS data visualization and genetic

variation detection and annotationhttpbioinformaticszjcnmagicviewerindexphp

NCBI GraphicalSequence Viewer

Graphical display for the Nucleotide and Protein sequences httpwwwncbinlmnihgovprojectssviewer

Rover Genome browser framework to build custom genomic tools httpchmille4githubcomRoversitehomehtmlRviewer Interactive online tool for comparing and prioritizing gen-

omic regionshttprviewerlblgovrviewer

Savant Genome Browser for high-throughput sequencing data httpgenomesavantcomScribl HTML5 Canvas-based graphics library for visualization of

genomic data and annotationshttpchmille4githubcomScribl

UCSC GenomeBrowser

Interactive genome browser that provide access to se-quence data from different species integrated with alarge collection of layered annotations from experimentsand prediction algorithms

httpgenomeucsceducgi-binhgGateway

VISTA Browser Visualization of pairwise and multiple alignments of wholegenome assemblies

httppipelinelblgovcgi-bingateway2selectorfrac14vista

Whole GenomerVISTA

Visualization of TFBS that are conserved between speciesand overrepresented in upstream regions of groups ofgenes

httpgenomelblgovvistaindexshtml

8 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

Previously studies have established (Table 4) that functionalvariants may impact the overall structural conformation andfunction of proteins (ligand binding substrate specificity pro-teinndashprotein interaction transcription factor-target bindingetc) Residue-specific interactions can play a key role in suchvariations Sequence-based investigations will further help toidentify linear sequence motifs that confer diversity in the splicevariants [6] but the identification and analysis of the impact ofpoint mutations on spatially distributed structural motifs mayrequire structural data [55] However the limited availability ofstructural data compared with sequence data may pose add-itional challenges for incorporating structural analysis of thefunctional variants [96] Moreover additional challenges ariseowing to the dynamic protein conformations and allosteric orother long-distance effects (including higher order interactions)of few mutations on the activity of the protein

UniProtKB provides extensive information about residueproperties and curated information about globular domains (viaannotations derived from Pfam SMART or InterPro)posttranslational modifications (PTMs) and several bioinfor-matics tools and webservers are available to investigate residueproperties and structural properties (See Table 5 for a list oftools to predict various sequence features See URL httpbioinformaticsca links_directorycategoryprotein for list of toolsavailable for different protein-centric analyses)

Occasionally multiple sequence or structural features canbe annotated onto a single variant for example a residue couldbe part of a functional motif andor a structural motif and cod-ing SNVs may perturb either or both features Quantifying mul-tiple features perturbed by variants using an objective-scoringschema will help to capture the entire set of features perturbedby a functional variant Comparative analysis of the sequenceand structural features of the wild type and the variant se-quences will also enhance the understanding of the impact ofmultiple coding variants within a protein

Gain and loss of PTMs owing to SNVs

Amino acids in a mature peptide chain are targets of variedPTM [97] events PTMs include phosphorylation methylationacetylation amidation hydroxylation sulfation lipidation

glycosylation and palmitoylation [98] Several studies have nowcompiled and assessed the impact of phosphorylation in cancer[97] and have reported mutational landscapes of various PTMevents including gain or loss of particular PTMs owing to SNVs(for example gain of glycosylation [99] gain and loss of phos-phorylation in cancer [100]) Pan-cancer [101] and proteomewide studies [102] have also assessed the impact of variety ofPTMs Several tools and databases are currently available toassess impact of individual PTMs owing to SNVs [refs]However tools that can provide complete in silico profiling ofmutations will help the researchers to identify PTM-relevantmutations and use the information to assess therapeuticstratifications

Coding SNVs in unassigned regions of proteins

An unassigned region refers to the segments in proteins withno known functional domain assignments [84] Human proteinshave a variable degree of unassigned regions and small-un-assigned regions are often defined as the linker regions betweentwo domains (Figure 2i) We surveyed the human proteomeusing Pfam-based domain annotations to understand the un-assigned regions in the human proteome We computed as-signed and unassigned regions (in percentage) for 20 242 proteinsequences from the SwissProt database (reviewed sequences)Pfam-A-based domain assignments were retrieved for 20 137 se-quences A subset of 105 proteins was excluded from the ana-lysis owing to overlapping domain assignments Of the 20 137sequences that were analyzed (Figure 4A) one protein (haloaciddehalogenase-like hydrolase domain-containing protein 2) wasassigned with a conserved domain over its entire length 3234sequences were completely unassigned and the rest of the pro-teins had varying segments of unassigned regions (mean5387 SD63094) Current approaches in variant annotationand interpretation are primarily focused on highly conservedglobular domains Given that domains are presently assigned toonly 50 of the human proteome analytical methods such asPrediction of Unassigned Regions (PURE) that use intermediatesequence search techniques for domain assignments [103] orsimilar approaches that use sensitive sequence search protocols

Table 5 Examples of the functional variants located in sequence features perturbing diverse functional and structural effects in proteins

Variant location Description URL

Protein domain DMDM A database that compiles domain mapping of dis-ease mutations have information about 202 507 muta-tions associated with 10 919 domains (compiled fromCDD Pfam COG and SMART databases)

httpbioinfumbcedudmdm

Phosphorylation site Mutation of an AKT phosphorylation site of human B-raf httpwwwncbinlmnihgovpubmed15791648Propeptide Mutation in the von Willebrand factor (VWF) propeptide

affects the oligomerizationhttpwwwncbinlmnihgovpubmed20335223

Signal peptide Mutation in signal peptide of ADAMTS10 influence secre-tion of full-length enzyme

httpwwwncbinlmnihgovpubmed18567016

Active site Mutation in the active site of human deoxycytidine kinaseaffects the substrate specificity

httpwwwncbinlmnihgovpubmed18361501

Linear motif Linear motifs mediate functional diversity of transcriptvariants

httpwwwncbinlmnihgovpubmed22638587

Structural motif Heterozygous missense mutation of a spatially distributedstructural motif in human connexin (GJB3) gene causeerythrokeratodermia variabilis

httpwwwncbinlmnihgovpubmed9843209

Subcellular localization Missense mutations in the NPHS2 gene altering the traffick-ing of nephrin to the plasma membrane

wwwncbinlmnihgovpubmed15496146

Interpreting functional effects of coding variants | 9

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

Table 6 Tools for predicting various sequence and structural features

Name Description URL

3dswap-pred Classify a protein sequence as domain-swapping ornon-domain swapping using an SVM model

httpcapsncbsresin3dswap-predindexhtml

AAIndex An amino acid index is a set of 20 numerical values rep-resenting various physico-chemical and biochemicalproperties of amino acids

httpwwwgenomejpaaindex

Bioinformatics LinkDirectory (Protein)

Extensive list of tools for prediction of protein sequencefeatures structure features and function

httpbioinformaticsca links_directorycategoryprotein

dbPTM Comprehensive resource for protein PTMs httpdbptmmbcnctuedutwDISOPRED Dynamically disordered protein chains do not have sta-

ble secondary structures and have high flexibility insolution Disordered regions also play critical roles inprotein function

httpbioinfcsuclacukdisopred

ELM Eukaryotic Linear Motif server httpelmeuorgEris Eris server computes the change of the protein stability

induced by mutations using structural datahttpdokhlabuncedutoolserisindexhtml

FoldAmyloid Method for predicting of amyloidogenic regions fromprotein sequence

httpbioinfoprotresrufold-amyloidogacgi

FoldX FoldX can be used to find interactions contributing tothe stability of proteins and protein complexes usingstructural data

httpfoldxcrges

Globplot Globplot can predict disordered regions in proteinsequence

httpglobplotemblde

H-Predictor Predict hinge regions involved in protein oligomeriza-tion via the domain-swapping mechanism fromstructural data

httptrollmeduncedudokhlabindexphpSpecialHpredictor

Harmony Substitution and propensity score-based protein struc-ture assessment algorithm

httpcapsncbsresinharmony

HORI Webserver for prediction of higher order residue inter-actions in protein structures

httpcapsncbsresinhori

InterPro Integrated database of predictive protein signaturesused for the classification and automatic annotationof proteins and genomes

httpwwwebiacukinterpro

IUPred Prediction of intrinsically unstructured proteins httpiupredenzimhuLIMBO Predicts the amylogenic regions in a protein httplimbovibbeMUPRO Prediction of protein stability changes for single-site

mutations from sequenceshttpmuproproteomicsicsuciedu

NCBI-CDD Extensive protein domain and family annotationdatabase

httpwwwncbinlmnihgovStructurecddcddshtml

Pfam Database of conserved protein domain families httppfamsangeracukPFILT Program to filter various sequence regions including

low-complexity regionshttpbioinfadmincsuclacukdownloadspfilt

PIC Protein interactions calculator httppicmbuiiscernetinProtParam Compute biochemical features like Molecular Weight

Theoretical pI Grand Average of Hydropathy(GRAVY) instability index etc

httpwebexpasyorgprotparam

PSIPRED Secondary structure prediction httpbioinfcsuclacukpsipredPURE Prediction of unassigned regions in proteins httpcapsncbsresinpureSABBLE Relative solvent accessibility prediction httpsablecchmcorgScanProsite Report the functional motifspatterns encoded in the se-

quence Helps to assess the gainloss of functionalsites owing to mutation

httpprositeexpasyorgscanprosite

SignalP Predicts the presence and location of signal peptidecleavage sites in amino-acid sequences

httpwwwcbsdtudkservicesSignalP

SMART Simple modular architecture research tool for assigningdomains to protein

httpsmartembl-heidelbergde

TANGO Predicts the aggregation-prone regions in a protein httptangocrgesTargetP Predicts the subcellular location of eukaryotic proteins httpwwwcbsdtudkservicesTargetPUniProtKB Catalog of information on proteins httpwwwuniprotorgWALTZ Predicts the aggregation-prone regions in a protein httpwaltzvubacbe

10 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

(Table 5) may potentially help investigate the evolutionary andfunctional roles of variants in unassigned regions

Impact of coding SNVs in low-complexityregions

Low-complexity regions (LCRs) in the protein universe refer to astretch of amino acids with low Shannon entropy (leucine-richdomains or poly-alanine tracts) Unlike linear motifs whichhave a specific function and sequence signature the individualfunctions of LCRs are poorly characterized LCRs do not adopt adefinite secondary structure but may exist as solvent-exposedamino acids in the coiled or disordered regions in proteins LCRsare observed in functionally diverse proteins and in both eu-karyotes and prokaryotes The predominant functions of LCRsinclude promoting mRNA stability and mediating a diverse set

of proteinndashprotein interactions [104] We scanned the referencehuman proteome (reviewed subset of 20 237 sequences) usingPFILT [105] and observed that 1432 (nfrac14 2899) protein se-quences have LCRs with a median length of 16 amino acids(Figure 4B) A recent study showed that a functional variantlocalized to an LCR in Nance-Horan Syndrome (NHS) gene wasassociated with clinical features of NHS including cardiacanomaly [106] In the absence of direct functional informationto interpret the impact of functional variants scanning proteinsequences with an LCR prediction tool like PFILT is highly rec-ommended to investigate the probable gain or loss of LCRsowing to the functional variations

Coding SNVs and intrinsically disorderedregions in proteins

Proteins are believed to be functional when the structure attainsits definite globular fold [107] Recent studies however suggestthat proteins may perform their functions even when not in afully folded state Such proteins or regions within proteins thatexist in a stable conformation without attaining a definite struc-tural conformation are generally referred to as intrinsically dis-ordered proteins (IDPs) or proteins with intrinsically disorderedregions (IDRs) or simply disordered proteins [108 109] Disprotis a database that catalogs a curated list of disordered proteinsthat includes 248 experimentally validated human proteinswith disordered regions (httpwwwdisprotorgactionsearchphpkeywordfrac14humanampcriterionfrac14organism) Prediction modelssuggest that 30ndash40 of human proteins are considered to beIDPs or have IDRs and approximately 25 of eukaryotic proteinsare predicted to be fully disordered [108 109]

Regions of the intrinsically disordered segments in proteinshave been found to be functionally important (Figure 5)Disordered regions mediate various functional roles includingprotein binding and proteinndashprotein interactions Previous stud-ies have shown that IDPs can attain a definite structure on bind-ing to their interacting partner and may thus exist in anintermediate stage of disordered (unfolded) to ordered (folded)stage [110 111] Irrespective of the structural plasticity recent

Figure 3 Proportional Venn diagram of genes with coding variants annotated as

polymorphism (Polymorphism nfrac1411 527) disease (Disease nfrac142105) and un-

classified (Unclassified nfrac141910) in Human polymorphisms and disease muta-

tions index Percentages of genes shared between the three groups are provided

Unassigned regions in human proteome

Percentage of unassigned regions in human proteins

Fre

qu

en

cy in

hu

ma

n p

rote

om

e (

Sw

issp

rot

re

vie

we

d s

eq

ue

nce

s)

0 20 40 60 80 100

01

00

02

00

03

00

04

00

0

Low-complexity regions in human proteome

Percentage of LCRs in human proteins

Fre

quency in h

um

an p

rote

om

e (

Sw

isspro

t r

evie

wed s

equences)

0 20 40 60 80 100

0500

1000

1500

2000

2500

3000(a) (b)

Figure 4 Histograms depicting the distribution (in percentage) of (A) unassigned regions and (B) LCRs in the human proteome

Interpreting functional effects of coding variants | 11

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

studies have shown the presence of evolutionarily conservedfunctional motifs within IDPs [112 113]

Recently the influence of disease mutations in the dis-ordered regions was extensively surveyed and the crucialroles of disorder-promoting amino acids in imparting vari-ations in the protein structures were proposed [114 115]Disorder-promoting amino acids are a potential cause ofvariations in protein structures and structural modelsDisorder-to-order transitions are enriched among diseasemutations compared with neutral polymorphisms [116]Currently protein modeling or variant effect prediction al-gorithms do not take the intrinsic disorder characteristicsof a protein into consideration Integrating tools such asDisopred [117] for disorder region prediction into the vari-ant annotation pipelines will improve our understanding ofthe impact of coding SNVs in the disordered regions of aprotein

Influence of coding SNVs on proteinmisfolding domain swapping aggregationmacromolecular crowding and degradation

Diseases have multiple environmental as well as geneticcauses and structural biology is an active area of research tounderstand how changes in individual protein structures orprotein complexes may play a key role in disease manifest-ations Understanding the structural bases of human diseases

would help to identify better targets and design better ligandsfor more effective therapeutic interventions

Protein misfolding and aggregation

Protein folding pathways play a crucial role in mediating cellu-lar homeostasis Defective folding of a protein product is themechanistic basis of several disease phenotypes likeAlzheimerrsquos disease and Parkinsonrsquos disease [118] When non-synonymous mutations lead to the production of misfoldedproteins with aberrant function several gate-keeping pathwaysact on such misfolded proteins to clear them from the cellularenvironment [119] Misfolded proteins targeted for ubiquityla-tion may also lead to protein aggregation Protein aggregation[120ndash123] is defined as a molecular phenomenon where a pro-tein is not cleared from the cellular environment by the normalpathways (Figure 6A) This leads to the aggregation of proteinsin the cellular environment which in turn leads to cellular tox-icity and is considered to be the mechanistic basis of varioushuman diseases such as prion diseases [123 124]

Domain swapping

3D domain swapping (Figure 6B) is a phenomenon observed in asubset of proteins where intermolecular interactions arereplaced by intramolecular interactions [125 126] Domainswapping is also recognized as a mechanism for forming pro-tein aggregates via open-ended mode Domain swapping is

Figure 5 Examples of IDRs in human proteins (A) Breast cancer type 1 susceptibility protein (BRCA1) (B) Cellular tumor antigen p53 (Oncoprotein p53) (C) T-cell sur-

face glycoprotein CD4 (Disprot) (D) Secondary structure information (UniProtKB) and (E) mutation histogram (COSMIC) of BRCA1 is provided to illustrate mutations in

the unstructured regions

12 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

observed in a variety of therapeutically important proteins andis considered to be the mechanism mediating deposition dis-eases such as neurodegenerative diseases and Alzheimerrsquos dis-ease that are caused by conformational perturbations [127 128]Missense mutations in the human phenylalanine hydroxylasehave been shown to infleunce aggregation and degradationproperties [129] Functional mutations are also considered to bea causative factor for proteins to adopt swapped conformationfrom monomers to higher oligomers [130ndash132] A recent study[128] that surveyed a subset of human proteins involved in 3Ddomain swapping suggested that swapping is not only confinedto conformational diseases it is also associated with severalkey biological pathways and also plays a role in mediating di-verse diseases in humans Another study [133] that investigatedthe structural properties of proteins involved in 3D domainswapping suggested that 10 of protein folds and 5 of proteinfamilies include domain-swapped structures

Macromolecular crowding

Proteins interact with a variety of small molecules and othermacromolecules to perform a specific functional task The qua-ternary assembly of a protein is influenced by the importantyet understudied phenomenon of macromolecular crowding

(117 118) Protein function is driven by interactions mediated byinter- and intra-chain amino acids and mutations in the sur-face amino acids may affect macromolecular homeostasisMutagenesis studies have shown the impact of functional mu-tations on proteinndashprotein interfaces (119) oligomerization (120)and stability (121) Mutations that influence the intra-chaininteractions may influence the crowding effect in the cellularenvironment Probing the impact of functional mutations onsuch effects using theoretical models and molecular dynamicsimulation studies is likely to enhance the understanding of therelationship between coding SNVs and macromolecular crowd-ing A list of selected tools and databases available for the pre-diction of misfolding protein aggregation and domainswapping is summarized in Table 5

Protein degradation

Targeted biochemical studies have revealed that the proteindegradation pathway plays an important role in the clearanceof the mutated proteins to reduce cellular toxicity For examplemutant Cu Zn-superoxide dismutase associated with amyo-trophic lateral sclerosis was cleared by macroautophagy path-way that includes proteasomal cleavage phenylalaninehydroxylase (109) and NAD(P)H quinone oxidoreductase 1 (156)

(a) (b)

Figure 6 Schematic representation of the impact of functional mutation on protein misfolding folding aggregation domain swapping macromolecular crowding and

protein degradation pathways Structure of misfolded rat CD2 structure (PDB ID 1A6P) and normal CD2 (PDB ID 1HNG) is used for representing misfolded and folded

structures Functional variant is represented using red asterisk

Interpreting functional effects of coding variants | 13

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

were also cleared by similar mechanisms Functional variantsplay a crucial role in aggregation and protein degration path-ways [134 135] Computational approaches that can predict thedegradation properties of mutated proteins using sequence orstructural information will be useful for rapid characterizationof functionally active proteins Integrated models that combinefolding misfolding aggregation 3D domain swapping macro-molecular crowding and degradation pathways (Figure 6) in asystems biology approach are likely to provide additional in-sights into the regulation of these important mechanisms andthe role of coding SNVs in mediating such mechanisms

Coding SNVs and metamorphic proteins

The molecular paradigm of sequence-structure-function sug-gests that a diverse set of sequences could attain a similarstructural fold that may lead to a functional convergence Whilethis is generally true for the protein universe occasionally devi-ations are observed Metamorphic proteins refer to a relativelynew class of proteins in which a given sequence has beenshown to attain different folded conformations under nativeconditions while performing distinct functions [136ndash138]Metamorphic proteins such as human chemokine lymphotactinhave been shown to influence evolutionary transitions of struc-ture and coding SNVs which could play an important role inswitching between two or more folds and functions for a singlesequence [139] Computational approaches to catalog and pre-dict the metamorphic properties and the impact of coding SNVson metamorphic proteins would help us understand how muta-tions drive functional plasticity at the proteome level

Impact of coding SNVs on the transcriptomicdiversity

Dynamic features of the human proteome are driven by tran-script diversity and the average number of characterized tran-scripts per gene is rapidly expanding The recent adoption ofRNA-Seq as a tool for expression profiling has led to the charac-terization of a large number of novel transcripts includingfusion transcripts [140] Several novel tissue-specific orcell-type-specific transcripts were reported from RNA-Seqexperiments The cellular compartment-based functional rolesof such transcripts are determined by alternative splicingevents [141] RNA editing is a phenomenon where a pre-mRNAmolecule is altered through a chemical change in the basemakeup thereby adding to the diversity of the transcriptomeRNA editing events occur via two distinct mechanisms of sub-stitution editing and insertiondeletion editing leading to func-tional diversity Fusion transcripts [142] and RNA editing eventsare also associated with various diseases including prostatecancer and amyotrophic lateral sclerosis [143] Computationalapproaches are available for the identification of fusion tran-scripts [144] and RNA editing events [145ndash147] from sequencingdata Understanding the precise roles and the impact of variantson such a diverse set of transcripts using computationalapproaches is a challenging task Ascribing the functional roleand assessing the impact of coding SNVs for mediating noveland fusion transcripts using bioinformatics approaches is anemerging problem and being addressed in recent studies [148]Recently we designed an integrated method to simultaneousanalyses of genome and transcriptome from RNA-seq data TheeSNV-Detect method can precisely capture genetic variation(genotypes) from RNA-seq data and helps to design cost-

effective and sustainable experimental strategies [149] Analyticand interpretation strategies that rely on multiple data-typeswould provide greater confidence for variant calling annotationand interpretation and thus actionability for clinicalinterventions

Functional impact of synonymous variations

Synonymous mutations are defined as mutations that result ina variation at the DNA level that code for the same amino acidin the protein level owing to codon degeneracy Synonymousvariants may either be coding or noncoding and the functionalroles of such variants are emerging A recent study [150] sug-gests that introns are involved in functional mechanisms andintrons with positional conservation across eukaryotic lineageare classified as functional introns A comparative analysis ofsynonymous and non-synonymous variants associated withcomplex diseases has shown similar likelihood and effect sizewith disease association [151] Synonymous mutations mayalso influence the introns regulating gene expression [152] A re-cent report also suggests that the coding exons function as tis-sue-specific enhancers and synonymous variations in suchenhancer sites may influence the expression level of certaingenes [153] Synonymous variants could influence the expres-sion of intronic noncoding RNAs [154] perturb the transientproteinndashDNA [155 156] and proteinndashRNA interactions in the cellSuch perturbations can lead to diseases such as cancer neuro-logical disorders and cardiac disease [157] Developing bioinfor-matics approaches to explore the putative impact ofsynonymous mutations on different layers of the coding andnoncoding genome and their relationships will be an importantaspect in the detailed interpretation of the genomic variantsThe detailed structural exploration of proteinndashDNA [158] andproteinndashRNA interactions [159] will help to precisely map themechanistic bases of the loss or gain of interactions owing tosynonymous variations

Impact of coding SNVs on the interactomeof a protein

The individual interactome of proteins varies to a great extent[160] Proteins often interact with nucleotides (proteinndashDNA orproteinndashRNA interactions) proteins (protein complexes obligateor non-obligate) [161] and proteinndashprotein interactions (transientor permanent) [107]) and small molecules (molecular reactionsenzymatic reactions metabolic pathways) [162 163] CodingSNVs play an important role in mediating such interactions Aninteractome of a protein can be defined using information gath-ered from high-throughput experiments that systematicallyidentify interactants of proteins Public protein-protein databaseslike BioGRID [164] and STRING [165] provide large data sets forderiving a protein interactome Network-level investigations tounderstand the category of interactors that are perturbed owingto coding SNVs will help delineate the impact of such mutationson the protein interactomes Using network measures (centralitydegree stress betweenness closeness cliques radiality transi-tivity reciprocity assortativity structural equivalence networkheterogeneity network density clustering coefficients neighbor-hood connectivity shared neighbors network topology etc) asquantitative parameters in variant annotation pipelines will helpthe researchers gain better insights into the impact of codingSNVs on the protein interactome Biologically relevant networkproperties such as network modularity [166] network fragility

14 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

[166] and lethality [166] can also be derived from an interactomeIncorporating a comparative network analysis framework thatcompares wild-type and variant interactomes will help quantifythe impact of coding SNVs on a network scale Methods such asVCF2Networks employ genotype networks (ie all genotypesassociated with a single phenotype) to understand the relation-ships between the genotype space and clinical or biological phe-notypes [167] Table 7 summarizes the available tools toinvestigate gene lists and to derive global functional trends andnetwork properties

Pathway-level impact of coding SNVs

Multiple functional genomics studies have investigated the per-turbations of pathways as a result of mutations in diseasessuch as adenocarcinoma (MAPK signaling p53 signaling Wntsignaling cell cycle and mammalian target of rapamycin path-ways) [168] childhood acute lymphoblastic leukemia (RAS path-way) [169] colorectal carcinoma (WNT signaling pathway) [170]etc Several SNP-centric approaches for pathway-level inferencehave been designed for the interpretation of SNPs identifiedfrom GWA studies or differentially expressed genes from gene-expression studies [171ndash175] Coding SNVs may also influencethe cross talk between various signaling pathways Currentvariant interpretation algorithms strive to identify pathways

associated with genes harboring coding SNVs yet a detailedunderstanding of the impact of variants on biological pathwaysand pathway cross talks [176] is often lacking Incorporatinganalytical routines to quantify the effect of functional muta-tions on pathways and pathway cross talk will be useful in in-terpreting functional variants from a biological perspective

A classification schema for annotatingcoding SNVs

Identifying the entire spectrum of molecular perturbationsowing to coding variations at the level of sequence structureinteraction and function of proteins is considered the basis ofvariant interpretation Multiple automated prediction tools thatcan assess the functional effect of mutations are currentlyavailable [See Table 6] Several variant annotation tools focuson the task of predicting the type or effect of mutations provideextended annotations from precomputed databases or lookuptables and map a coding variant to its corresponding gene pro-tein functional domain or signaling pathway Theseapproaches have several limitations because additional layersof protein-centric information that can be derived from the pre-diction or computation of sequence-based features with codingvariants are lacking To deal with this challenge and to designbroad-spectrum tools for deep variant interpretation we

Table 7 Software libraries and tools for biological network analysis

Name Description URL

Bio4J Bio4j is a bioinformatics graph-based database httpwwwbio4jcomBioConductor (Graphs

and Networks view)Collection of BioConductor modules for biological net-

work analysis and visualizationhttpwwwbioconductororgpackagesrelease

BiocViewshtml___GraphsAndNetworksCytoscape Open source platform for complex network analysis and

visualization with a large collection of plug-ins forbiological network analysis

httpwwwcytoscapeorg

DAVID Integrated functional annotation tool httpdavidabccncifcrfgovhomejspFunDO Functional Disease Ontology server httpdjangonubicnorthwesternedufundogene2pathway R package for prediction of KEGG pathway membership

for individual genes based on InterPro domainsignatures

httpwwwbioconductororgpackagesreleasebiochtmlgene2pathwayhtml

GeneAnswers R package for biological or medical interpretation of thegiven one or more groups of genes by means of statis-tical test

httpwwwbioconductororgpackagesreleasebiochtmlGeneAnswershtml

GO Tools Tools for analysis of GO T term enrichment statisticalanalysis semantic similarity and functional similar-ity using GO terms derived from gene lists

httpwwwgeneontologyorgGOtoolsshtml

Gephi Open-source graph visualization and analysis software httpwwwgephiorgGremlin Graph-based programming language httpsgithubcomtinkerpopgremlinwikiiGraph Network analysis and visualization library in C Also

available R package and a Python extensionhttpigraphsourceforgenet

KEGGgraph R package for analysis of KEGG pathways httpwwwbioconductororgpackages210biochtmlKEGGgraphhtml

LEDA A broad-spectrum Cthornthorn class library for efficient datatypes and algorithms including large-scale networkanalysis

httpwwwalgorithmic-solutionscomledaaboutindexhtm

Neat Web-based network analysis tools httpgephiorgOntology Analysis

plugins for Cytoscapehttpchiantiucsdeducyto_webplugins Plugins for functional enrichment analysis using

network dataPANTHER Classification of genes and proteins httppantherdborgReactome Curated knowledgebase of molecular events and

pathwayshttpwwwreactomeorg

TargetMine Integrates different types of biological data and enableflexible queries export results and analyze lists ofdata

httptargetminemizuguchilaborg

Interpreting functional effects of coding variants | 15

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

recommend a three-level annotation schema for the interpret-ation of coding variants (Figure 7) Primary level annotation pro-vides an overview of the types of mutations and associationannotations Secondary level annotation enables the systematicinvestigation of multiple variants in a single gene Tertiary-levelannotations help to find the global characteristics of genes har-boring coding variants and various properties in a network-scale

Level 1 Primary annotation of coding SNVs

Tools that use position information from VCF files to derive thetype of mutation (synonymous or non-synonymous) the loca-tion of the corresponding gene and its mapping onto several an-notation resources using positional data may be viewed aslsquoprimary annotationsrsquo See Table 2 for a list of tools that useminimum input data to gather a diverse set of annotationsusing database lookup tables and identifier mapping

Level 2 Secondary annotation of coding SNVs- Impactof multiple variants in a single gene

Sequence and structural explorations of coding variants arereferred to as secondary annotations VCF files can have mul-tiple variants of the same gene associated with a given pheno-type or disease A systematic investigation of multiple variantsin a single gene and an assessment of the relative impact of cod-ing variants could help in classifying multiple variants in a geneand help to delineate the variants as pathogenic moderatelypathogenic or VUS Comparative analysis of wild-type and vari-ant sequences can be performed and quantified using the totalnumber of gains or losses in the sequence features Based on the

availability of the structural data a given variant can be modeledin a protein structure using in-silico mutagenesis experimentsIn the absence of experimental protein structural data a proteinstructure model can be obtained from the homology model data-base or a new model can be built using homology modelingapproaches Once the wild type and mutant type structures areobtained the impact of variation can be rapidly computed usingstructural feature analyses or using computationally expensivemolecular dynamics simulations Table 5 summarizes key se-quence and structure feature prediction tools

Level 3 Tertiary annotation of coding SNVs- Impact ofcoding SNVs on multiple genes pathways andinteractome

The global impact of multiple coding variants on a genome orexome can be assessed using a combination of both knowledge-based enrichment or depletion analysis and network or interac-tome analyses Functional profiling using GO terms have beenused as an effective approach for characterizing collective func-tional characteristics of a perturbed set of genes [177 178]Gene-list-based analyses can provide biologically relevant infor-mation about the variants Enrichment analysis is not just con-fined to a priori defined gene sets or GO annotationsenrichment can be performed using several types of annota-tions and could provide insights into the probable functional as-sociations of genes with coding variants Enrichment analysisusing protein annotations would help to identify functionallysignificant protein domains protein classes families (mem-brane proteins or kinases etc) protein superfamilies (angioten-sin receptors or G-protein coupled receptors) or protein folds(Rossman fold beta-propeller) associated with proteins

Figure 7 Three-level schema for annotation of functional variants

16 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

harboring coding variants Furthermore it is possible to identifyenriched molecular events mediated by genes enriched TFBS inthe upstream of genes and biological pathways mediated by thegenes using annotations from Reactome UCSC and KEGG repo-sitories Disease ontology can be used to find known genendashdisease associations Following the knowledge-based analysis anetwork-level analysis of genes harboring coding SNVs can beperformed using publicly available network analysis tools suchas Cytoscape and Gephi

Future directions

The rapidly increasing availability of sequence structure func-tional and interaction information offers an attractive means toobtain a detailed characterization of coding variants Howeverthe use of large data sets to systematically explore the functionalvariants is currently limited owing to the gaps in the availablesequencendashstructurendashfunction interaction data Sequence dataare available for the entire human proteome but the availabilityof protein structural data experimentally verified functional as-sociations and biomolecular interaction data is limited Theavailability of structural data of homologous protein structuresis a prerequisite for structural investigations of functional vari-ants and its impact on the structural environment Availabilityof high-quality interaction data will also help analyze the cellu-lar networks modulated by proteins harboring coding variantsThe expansion of structural data sets using structural genomics[179] homology modeling [180] and the application of computa-tional approaches such as genomics-aided structure prediction[181] is likely to lower the increasing chasm between sequencestructure function and interaction data and may help to ascer-tain the impact of variants at the biomolecular level [163 182183] Improving the annotation databases using high-qualitycurated data and integration with cloud-based or stand-alonepipelines [184ndash189] can also help facilitate such efforts

The current version of the Bioinformatics Link Directory en-lists gt900 bioinformatics tools that are capable of processing dif-ferent types of protein data (sequence structure interactionquantification and annotation) Unifying multiple resources andenabling programmatic access via application program interfacesand web services for rapid integration will significantly enhancethe efficacy of variant annotation pipelines The development ofnovel statistical methods that can quantify various residue-specific properties would enable a comparative analysis of wild-type and variant sequences and can thereby help to facilitate theidentification and prioritization of functional mutations The in-clusion of a diverse set of sequence structure and network fea-tures into a graphically depicted database and an assessment ofthe impact of various data types (sequence structure functionand interaction) using probabilistic or machine learning modelswould also help automate variant annotation interpretation

From a sequence perspective understanding the precisefunctional role of LCRs unassigned regions and disordered re-gions as well as the relationship between these features and cod-ing SNVs is a key step in variant annotation and interpretationIdentifying deviations in protein structural space that are likelyto lead to misfolding domain swapping aggregation degrad-ation or deviations that are metamorphic in nature owing to theimpact of coding SNVs is an emerging challenge Several algo-rithms are available to predict sequence or structural domainsassign distantly related domains to unassigned regions func-tional motifs structural motifs and structural features proteindisorder and aggregation propensities However algorithms andtools to predict 3D domain swapping or to analyze metamorphic

properties degradation pathways and the network-level impactof coding SNVs are less abundant This may be attributed to thefact that these concepts are continually evolving in the proteinuniverse and concerted efforts are needed to understand the im-pact of coding variants of such unique features from both experi-mental and computational biologists alike

This review summarizes the major bioinformatics chal-lenges involved in gaining a deeper understanding of codingSNVs Coding SNVs may impart a molecular or disease pheno-type by interacting with noncoding regulatory parts of the gen-ome and the structural variants may provide an importantcontribution to this phenomenon Analyzing functional andregulatory variants in a single analytical framework may furtherenhance the interpretation of the genomic variants

Conclusions

A rapid decline in sequencing costs using NGS technologies hasled to an exponential increase in the frequency of the sequenc-ing projects over the past decade A large number of personalgenomes and exomes clinical samples and cancer genomes arebeing sequenced as part of large-scale collaborative projectsThe identification of a plethora of sequence variants associatedwith diverse molecular and disease phenotypes Scalable com-putational approaches that integrate annotations using se-quence structure functional and interaction data are necessaryfor the rapid interpretation of coding SNVs We have discussedthe bioinformatics resources available for the prediction anno-tation and visualization of coding SNVs summarized the majorbioinformatics challenges into 10 different themes for a deeperinterpretation of coding SNVs and recommended a three-levelschema to assess the phenotypic impact of functional vari-ations on individual protein sequences structures differentfunctional categories biological pathways and interactomesWe envisage that addressing the key challenges discussed inthis review and adopting a comprehensive annotation schemafor variant annotation could improve genomic reports that aregenerated as part of genomic medicine investigations and ex-perimental studies to better understand the variations impli-cated in rare common and complex disease manifestations

Key Points

bull Interpretations of variants identified from next-gener-ation sequencing pose several challenges

bull Systems-level experimental investigation of the func-tional variants is expensive and time-consumingefficient computational techniques are required toidentify the impact of functional variants

bull We recommended an integrated approach that com-bines multiple data types and tools for the predictionannotation and visualization of functional variantsand we have proposed a systematic approach for func-tional variant annotation and interpretation

bull Significant challenges that need to be negotiated dur-ing the interpretation of coding single nucleotide vari-ants are presented with the help of various examples

bull A three-level annotation approach that combines theinformation at the level of an individual variant mul-tiple variants in a single protein and global trends ofmultiple genes harboring multiple variants is proposedfor an effective interpretation of coding single nucleo-tide variants

Interpreting functional effects of coding variants | 17

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

Acknowledgements

KS and JTD would like to thank Icahn Institute forGenomics and Multiscale Biology (httpicahnmssmedudepartments-and-institutesgenomics) Mount Sinai HealthSystem for infrastructural support KS would like to ac-knowledge Michael J Zimmermann and Jean-Pierre Kocher(Mayo Clinic Rochester) for useful discussions The MayoClinic Bioinformatics Core provided infrastructure and ana-lytical support to KRK RS would like to thank NationalCentre for Biological Sciences (TIFR) for infrastructuralsupport

Funding

KS and JTD was supported by grant R01-DK098242-03from the National Institute of Diabetes and Digestive andKidney Diseases KRK is funded in part by the Mayo ClinicCenter for Individualized Medicine the PharmacogenomicsResearch Network (PGRN) and the Mayo Clinic BreastSpecialized Program of Research Excellence (SPORE)

References1 Lander ES Initial impact of the sequencing of the human

genome Nature 2011470187ndash972 Mardis ER A decadersquos perspective on DNA sequencing tech-

nology Nature 2011470198ndash2033 Green ED Guyer MS Charting a course for genomic medi-

cine from base pairs to bedside Nature 2011470204ndash134 Maxmen A Exome sequencing deciphers rare diseases Cell

2011144635ndash75 Wheeler DA Srinivasan M Egholm M et al The complete

genome of an individual by massively parallel DNAsequencing Nature 2008452872ndash6

6 Weatheritt RJ Davey NE Gibson TJ Linear motifs conferfunctional diversity onto splice variants Nucleic Acids Res201240123ndash31

7 Goldstein DB Common genetic variation and human traitsN Engl J Med 20093601696ndash8

8 Manolio TA Collins FS Cox NJ et al Finding the missing her-itability of complex diseases Nature 2009461747ndash53

9 Genome-wide association study of 14 000 cases of sevencommon diseases and 3 000 shared controls Nature2007447661ndash78

10 Kullo IJ Ding K Shameer K et al Complement receptor 1gene variants are associated with erythrocyte sedimenta-tion rate Am J Hum Genet 201189131ndash8

11 Welter D MacArthur J Morales J et al The NHGRI GWASCatalog a curated resource of SNP-trait associations NucleicAcids Res 201442D1001ndash6

12 Leslie R OrsquoDonnell CJ Johnson AD GRASP analysis of geno-type-phenotype results from 1390 genome-wide associationstudies and corresponding open access databaseBioinformatics 201430i185ndash94

13 Nelson MR Wegmann D Ehm MG et al An abundance ofrare functional variants in 202 drug target genes sequencedin 14 002 people Science 2012337100ndash4

14 MacArthur DG Balasubramanian S Frankish A et al A sys-tematic survey of loss-of-function variants in human pro-tein-coding genes Science 2012335823ndash8

15 Raychaudhuri S Mapping rare and common causal allelesfor complex human diseases Cell 201114757ndash69

16 Stankiewicz P Lupski JR Structural variation in the humangenome and its role in disease Annu Rev Med201061437ndash455

17 McCarroll SA Altshuler DM Copy-number variation and as-sociation studies of human disease Nat Genet200739S37ndash42

18 Zhang F Gu W Hurles ME et al Copy number variation inhuman health disease and evolution Annu Rev GenomicsHum Genet 200910451ndash81

19 Medvedev P Stanciu M Brudno M Computational methodsfor discovering structural variation with next-generationsequencing Nat Methods 20096S13ndash20

20 Sherry ST Ward MH Kholodov M et al dbSNP the NCBI data-base of genetic variation Nucleic Acids Res 200129308ndash11

21 Rios D McLaren WM Chen Y et al A database and API forvariation dense genotyping and resequencing data BMCBioinformatics 201011238

22 Iafrate AJ Feuk L Rivera MN et al Detection of large-scale variation in the human genome Nat Genet200436949ndash51

23 Ashley EA Butte AJ Wheeler MT et al Clinical assess-ment incorporating a personal genome Lancet20103751525ndash35

24 Chen R Mias GI Li-Pook-Than J et al Personal omics profil-ing reveals dynamic molecular and medical phenotypesCell 20121481293ndash307

25 Lupski JR Reid JG Gonzaga-Jauregui C et al Whole-genomesequencing in a patient with Charcot-Marie-Tooth neur-opathy N Engl J Med 20103621181ndash91

26 Li Y Vinckenbosch N Tian G et al Resequencing of200 human exomes identifies an excess of low-fre-quency non-synonymous coding variants Nat Genet201042969ndash72

27 Clarke L Zheng-Bradley X Smith R et al The 1000 GenomesProject data management and community access NatMethods 20129459ndash462

28 Schunkert H Konig IR Kathiresan S et al Large-scale associ-ation analysis identifies 13 new susceptibility loci for coron-ary artery disease Nat Genet 201143333ndash8

29 Helgadottir A Thorleifsson G Manolescu A et al A commonvariant on chromosome 9p21 affects the risk of myocardialinfarction Science 20073161491ndash3

30 Visel A Zhu Y May D et al Targeted deletion of the 9p21non-coding coronary artery disease risk interval in miceNature 2010464409ndash12

31 Soranzo N Rendon A Gieger C et al A novel variant onchromosome 7q223 associated with mean platelet volumecounts and function Blood 20091133831ndash7

32 Paul DS Nisbet JP Yang TP et al Maps of open chromatinguide the functional follow-up of genome-wide associationsignals application to hematological traits PLoS Genet20117e1002139

33 Hull J Campino S Rowlands K et al Identification of com-mon genetic variation that modulates alternative splicingPLoS Genet 20073e99

34 Albert FW Kruglyak L The role of regulatory vari-ation in complex traits and disease Nat Rev Genet201516197ndash212

35 Battle A Khan Z Wang SH et al Genomic variation Impactof regulatory variation from RNA to protein Science2015347664ndash7

36 den Dunnen JT Antonarakis SE Nomenclature for the de-scription of human sequence variations Hum Genet2001109121ndash4

18 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

37 Eilbeck K Lewis SE Mungall CJ et al The sequence ontologya tool for the unification of genome annotations Genome Biol20056R44

38 Landrum MJ Lee JM Riley GR et al ClinVar public archive ofrelationships among sequence variation and human pheno-type Nucleic Acids Res 201442D980ndash5

39 Kircher M Witten DM Jain P et al A general framework forestimating the relative pathogenicity of human genetic vari-ants Nat Genet 201446310ndash15

40 Kumar P Henikoff S Ng PC Predicting the effects of codingnon-synonymous variants on protein function using theSIFT algorithm Nat Protoc 200941073ndash81

41 Adzhubei I Schmidt S Peshkin L et al A method and serverfor predicting damaging missense mutations Nat Meth20107248ndash9

42 Gonzalez-Perez A Lopez-Bigas N Improving the assessmentof the outcome of nonsynonymous SNVs with a consensusdeleteriousness score Condel Am J Hum Genet 201188440ndash9

43 Grantham R Amino acid difference formula to help explainprotein evolution Science 1974185862ndash4

44 Cooper GM Stone EA Asimenos G et al Distribution and in-tensity of constraint in mammalian genomic sequenceGenome Res 200515901ndash13

45 Pollard KS Hubisz MJ Rosenbloom KR et al Detection ofnonneutral substitution rates on mammalian phylogeniesGenome Res 201020110ndash21

46 Siepel A Bejerano G Pedersen JS et al Evolutionarily con-served elements in vertebrate insect worm and yeast gen-omes Genome Res 2005151034ndash50

47 Dayhoff MO Eck RV Park CM Model of evolutionary changein proteins In MO Dayhoff (ed) Atlas of Protein Sequence andStructure Washington DC National Biomedical ResearchFoundation 1972 89ndash99

48 Henikoff S Henikoff JG Amino acid substitution matricesfrom protein blocks Proc Natl Acad Sci USA 19928910915ndash19

49 Altschul SF Gish W Miller W et al Basic local alignmentsearch tool J Mol Biol 1990215403ndash10

50 Altschul SF Madden TL Schaffer AA et al Gapped BLASTand PSI-BLAST a new generation of protein database searchprograms Nucleic Acids Res 1997253389ndash402

51 Sunyaev S Ramensky V Koch I et al Prediction of deleteri-ous human alleles Hum Mol Genet 200110591ndash7

52 Yang Z Bielawski JP Statistical methods for detecting mo-lecular adaptation Trends Ecol Evol 200015496ndash503

53 Ng PC Henikoff S Predicting deleterious amino acid substi-tutions Genome Res 200111863ndash74

54 Wang Z Moult J SNPs protein structure and disease HumMutat 200117263ndash70

55 Thomas PD Kejariwal A Coding single-nucleotide poly-morphisms associated with complex vs Mendelian diseaseevolutionary evidence for differences in molecular effectsProc Natl Acad Sci USA 200410115398ndash403

56 Stone EA Sidow A Physicochemical constraint violation bymissense substitutions mediates impairment of proteinfunction and disease severity Genome Res 200515978ndash86

57 Miller MP Kumar S Understanding human disease muta-tions through the use of interspecific genetic variation HumMol Genet 2001102319ndash28

58 Johnson AD Single-nucleotide polymorphism bioinfor-matics a comprehensive review of resources Circ CardiovascGenet 20092530ndash6

59 Marjoram P Tavare S Modern computational approachesfor analysing molecular genetic variation data Nat Rev Genet20067759ndash70

60 Excoffier L Heckel G Computer programs for populationgenetics data analysis a survival guide Nat Rev Genet20067745ndash58

61 Peterson TA Nehrt NL Park D et al Incorporating molecularand functional context into the analysis and prioritizationof human variants associated with cancer J Am Med InformAssoc 201219275ndash83

62 Kann MG Advances in translational bioinformatics compu-tational approaches for the hunting of disease genes BriefBioinform 20101196ndash110

63 Cooper GM Shendure J Needles in stacks of needles findingdisease-causal variants in a wealth of genomic data Nat RevGenet 201112628ndash40

64 Stitziel NO Kiezun A Sunyaev S Computational and statis-tical approaches to analyzing variants identified by exomesequencing Genome Biol 201112227

65 Alexander RP Fang G Rozowsky J et al Annotating non-coding regions of the genome Nat Rev Genet 201011559ndash71

66 Mort M Evani US Krishnan VG et al In silico func-tional profiling of human disease-associated and poly-morphic amino acid substitutions Hum Mutat 201031335ndash46

67 Hamosh A Scott AF Amberger JS et al Online MendelianInheritance in Man (OMIM) a knowledgebase of humangenes and genetic disorders Nucleic Acids Res200533D514ndash17

68 Stenson PD Mort M Ball EV et al The human gene mutationdatabase 2008 update Genome Med 2009113

69 Forbes SA Bindal N Bamford S et al COSMIC mining com-plete cancer genomes in the Catalogue of SomaticMutations in Cancer Nucleic Acids Res 201139D945ndash50

70 Chen YA Tripathi LP Mizuguchi K TargetMine an inte-grated data warehouse for candidate gene prioritisation andtarget discovery PLoS One 20116e17844

71 Wang K Li M Hakonarson H ANNOVAR functional annota-tion of genetic variants from high-throughput sequencingdata Nucleic Acids Res 201038e164

72 Danecek P Auton A Abecasis G et al The variant call formatand VCFtools Bioinformatics 2011272156ndash8

73 Macintyre G Jimeno Yepes A Ong CS et al Associating dis-ease-related genetic variants in intergenic regions to thegenes they impact Peer J 20142e639

74 Javed A Agrawal S Ng PC Phen-Gen combining phenotypeand genotype to analyze rare disorders Nat Methods201411935ndash7

75 Dai HJ Wu JC Tsai RT et al T-HOD a literature-based candi-date gene database for hypertension obesity and diabetesDatabase (Oxford) 20132013bas061

76 Cheng D Knox C Young N et al PolySearch a web-basedtext mining system for extracting relationships betweenhuman diseases genes mutations drugs and metabolitesNucleic Acids Res 200836W399ndash405

77 Do R Kathiresan S Abecasis GR Exome sequencing andcomplex disease practical aspects of rare variant associ-ation studies Hum Mol Genet 201221R1ndash9

78 Wagner MJ Rare-variant genome-wide association studiesa new frontier in genetic analysis of complex traitsPharmacogenomics 201314413ndash24

79 Genomes Project C Abecasis GR Auton A et al An inte-grated map of genetic variation from 1 092 human genomesNature 201249156ndash65

80 Anfinsen CB The formation and stabilization of proteinstructure Biochem J 1972128737ndash49

81 Anfinsen CB Principles that govern the folding of proteinchains Science 1973181223ndash30

Interpreting functional effects of coding variants | 19

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

82 Fay J Wu C Sequence divergence functional constraintand selection in protein evolution Annu Rev Genomics HumGenet 20034213ndash35

83 Ponting CP Russell RR The natural history of protein do-mains Annu Rev Biophys Biomol Struct 20023145ndash71

84 Ekman D Bjorklund AK Frey-Skott J et al Multi-domain pro-teins in the three kingdoms of life orphan domains andother unassigned regions J Mol Biol 2005348231ndash43

85 Sigrist CJ Cerutti L Hulo N et al PROSITE a documenteddatabase using patterns and profiles as motif descriptorsBrief Bioinform 20023265ndash74

86 Dinkel H Michael S Weatheritt RJ et al ELMndashthe database ofeukaryotic linear motifs Nucleic Acids Res 201240D242ndash51

87 Haberichter SL Budde U Obser T et al The mutation N528Sin the von Willebrand factor (VWF) propeptide causes de-fective multimerization and storage of VWF Blood20101154580ndash7

88 Kutz WE Wang LW Dagoneau N et al Functional analysis ofan ADAMTS10 signal peptide mutation in Weill-Marchesanisyndrome demonstrates a long-range effect on secretion ofthe full-length enzyme Hum Mutat 2008291425ndash34

89 Pugalenthi G Suganthan PN Sowdhamini R et al SMotif aserver for structural motifs in proteins Bioinformatics200723637ndash8

90 Sundaramurthy P Shameer K Sreenivasan R et al HORI aweb server to compute Higher Order Residue Interactions inprotein structures BMC Bioinformatics 201011(Suppl 1)S24

91 Porter CT Bartlett GJ Thornton JM The Catalytic Site Atlas aresource of catalytic sites and residues identified in enzymesusing structural data Nucleic Acids Res 200432D129ndash33

92 Stuart AC Ilyin VA Sali A LigBase a database of families ofaligned ligand binding sites in known protein sequencesand structures Bioinformatics 200218200ndash01

93 Swapna LS Mahajan S de Brevern A et al Comparison oftertiary structures of proteins in protein-protein complexeswith unbound forms suggests prevalence of allostery in sig-nalling proteins BMC Struct Biol 2012126

94 Mathivanan S Ahmed M Ahn NG et al Human Proteinpediaenables sharing of human protein data Nat Biotechnol200826164ndash7

95 Bhaskara RM Srinivasan N Stability of domain structures inmulti-domain proteins Sci Rep 2011140

96 Furnham N de Beer TA Thornton JM Current challenges ingenome annotation through structural biology and bioinfor-matics Curr Opin Struct Biol 201222594ndash601

97 Wang Y Cheng H Pan Z et al Reconfiguring phosphoryl-ation signaling by genetic polymorphisms affects cancersusceptibility J Mol Cell Biol 20157187ndash202

98 Gnad F Gunawardena J Mann M PHOSIDA 2011 the post-translational modification database Nucleic Acids Res201139D253ndash60

99 Nicolaou N Margadant C Kevelam SH et al Gain of glycosy-lation in integrin alpha3 causes lung disease and nephroticsyndrome J Clin Invest 20121224375ndash87

100 Radivojac P Baenziger PH Kann MG et al Gain and loss ofphosphorylation sites in human cancer Bioinformatics200824i241ndash7

101 Pan Y Karagiannis K Zhang H et al Human germline andpan-cancer variomes and their distinct functional profilesNucleic Acids Res 20144211570ndash88

102 Mazumder R Morampudi KS Motwani M et al Proteome-wide analysis of single-nucleotide variations in the N-glyco-sylation sequon of human genes PLoS One 20127e36212

103 Reddy CC Shameer K Offmann BO et al PURE a webserverfor the prediction of domains in unassigned regions in pro-teins BMC Bioinformatics 20089281

104 Coletta A Pinney JW Solis DY et al Low-complexity regionswithin protein sequences have position-dependent rolesBMC Syst Biol 2010443

105 Jones DT Swindells MB Getting the most from PSI-BLASTTrends Biochem Sci 200227161ndash4

106 Chograni M Rejeb I Jemaa LB et al The first missense muta-tion of NHS gene in a Tunisian family with clinical featuresof NHS syndrome including cardiac anomaly Eur J HumGenet 201119851ndash6

107 Jones S Thornton JM Principles of protein-protein inter-actions Proc Natl Acad Sci USA 19969313ndash20

108 Babu MM van der Lee R de Groot NS et al Intrinsically dis-ordered proteins regulation and disease Curr Opin Struct Biol201121432ndash40

109 Dunker AK Silman I Uversky VN et al Function and struc-ture of inherently disordered proteins Curr Opin Struct Biol200818756ndash64

110 Wright PE Dyson HJ Intrinsically unstructured proteins re-assessing the protein structure-function paradigm J Mol Biol1999293321ndash31

111 Dyson HJ Wright PE Coupling of folding and binding for un-structured proteins Curr Opin Struct Biol 20021254ndash60

112 Nguyen Ba AN Yeh BJ van Dyk D et al Proteome-wide dis-covery of evolutionary conserved sequences in disorderedregions Sci Signal 20125rs1

113 Das RK Mao AH Pappu RV Unmasking functionalmotifs within disordered regions of proteins Sci Signal20125pe17

114 Hu Y Liu Y Jung J et al Changes in predicted protein dis-order tendency may contribute to disease risk BMCGenomics 201112(Suppl 5)S2

115 Vacic V Iakoucheva LM Disease mutations in disordered re-gionsndashexception to the rule Mol BioSyst 2012827ndash32

116 Srivastava SK Gayathri S Manjasetty BA et al Analysis ofconformational variation in macromolecular structuralmodels PLoS One 20127e39993

117 Ward JJ Sodhi JS McGuffin LJ et al Prediction and functionalanalysis of native disorder in proteins from the three king-doms of life J Mol Biol 2004337635ndash45

118 Thomas PJ Qu BH Pedersen PL Defective protein folding asa basis of human disease Trends Biochem Sci 199520456ndash9

119 Welch W Role of quality control pathways in human dis-eases involving protein misfolding Semin Cell Dev Biol20041531ndash8

120 Fink AL Protein aggregation folding aggregates inclusionbodies and amyloid Fold Des 19983R9ndash23

121 Thirumalai D Klimov DK Dima RI Emerging ideas on themolecular basis of protein and peptide aggregation CurrOpin Struct Biol 200313146ndash59

122 Khare SD Dokholyan NV Molecular mechanisms of poly-peptide aggregation in human diseases Curr Protein Pept Sci20078573ndash9

123 Soto C Estrada L Castilla J Amyloids prions and the inher-ent infectious nature of misfolded protein aggregatesTrends Biochem Sci 200631150ndash5

124 Prusiner SB Molecular biology and pathogenesis of priondiseases Trends Biochem Sci 199621482ndash7

125 Shameer K Shingate PN Manjunath SC et al 3DSwap cura-ted knowledgebase of proteins involved in 3D domain swap-ping Database (Oxford) 20112011bar042

20 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

126 Bennett MJ Choe S Eisenberg D Domain swapping entan-gling alliances between proteins Proc Natl Acad Sci USA1994913127ndash31

127 Bennett MJ Sawaya MR Eisenberg D Deposition diseasesand 3D domain swapping Structure 200614811ndash24

128 Shameer K Sowdhamini R Functional repertoire molecularpathways and diseases associated with 3D domain swap-ping in the human proteome J Clin Bioinform 201228

129 Waters PJ Parniak MA Hewson AS et al Alterations in pro-tein aggregation and degradation due to mild and severemissense mutations (A104D R157N) in the human phenyl-alanine hydroxylase gene (PAH) Hum Mutat 199812344ndash54

130 OrsquoNeill JW Kim DE Johnsen K et al Single-site mutations in-duce 3D domain swapping in the B1 domain of protein Lfrom Peptostreptococcus magnus Structure 200191017ndash27

131 Seeliger MA Spichty M Kelly SE et al Role of conformationalheterogeneity in domain swapping and adapter function ofthe Cks proteins J Biol Chem 200528030448ndash59

132 Kirsten Frank M Dyda F Dobrodumov A et al Core muta-tions switch monomeric protein GB1 into an intertwinedtetramer Nat Struct Biol 20029877ndash85

133 Huang Y Cao H Liu Z Three-dimensional domain swappingin the protein structure space Proteins 2012801610ndash19

134 Gidalevitz T Wang N Deravaj T et al Natural genetic vari-ation determines susceptibility to aggregation or toxicity ina C elegans model for polyglutamine disease BMC Biol201311100

135 Ross CA Poirier MA Protein aggregation and neurodegener-ative disease Nat Med 200410(Suppl)S10ndash17

136 Murzin A Metamorphic proteins Science 20083201725ndash6137 Shortle D One sequence plus one mutation equals two

folds Proc Natl Acad Sci USA 200910621011ndash12138 Yadid I Kirshenbaum N Sharon M et al Metamorphic pro-

teins mediate evolutionary transitions of structure Proc NatlAcad Sci USA 20101077287ndash92

139 Yadid I Kirshenbaum N Sharon M et al Metamorphic pro-teins mediate evolutionary transitions of structure Proc NatlAcad Sci USA 20101077287ndash92

140 Pickrell J Marioni J Pai A et al Understanding mechanismsunderlying human gene expression variation with RNAsequencing Nature 2010464768ndash72

141 Matlin AJ Clark F Smith CW Understanding alternative splic-ing towards a cellular code Nat Rev Mol Cell Biol 20056386ndash98

142 Rickman DS Pflueger D Moss B et al SLC45A3-ELK4 is anovel and frequent erythroblast transformation-specific fu-sion transcript in prostate cancer Cancer Res 2009692734ndash8

143 Flomen R Makoff A Increased RNA editing in EAAT2 pre-mRNA from amyotrophic lateral sclerosis patients involve-ment of a cryptic polyadenylation site Neurosci Lett2011497139ndash43

144 Asmann YW Hossain A Necela BM et al A novel bioinfor-matics pipeline for identification and characterization of fu-sion transcripts in breast cancer and normal cell linesNucleic Acids Res 201139e100

145 Bundschuh R Computational prediction of RNA editingsites Bioinformatics 2004203214ndash20

146 Eisenberg E Li JB Levanon EY Sequence based identificationof RNA editing sites RNA Biol 20107248ndash52

147 Levanon EY Eisenberg E Algorithmic approaches for identi-fication of RNA editing sites Brief Funct Genomics Proteomic2006543ndash5

148 McCarthy DJ Humburg P Kanapin A et al Choice of tran-scripts and software has a large effect on variant annota-tion Genome Med 2014626

149 Tang X Baheti S Shameer K et al The eSNV-detect a com-putational system to identify expressed single nucleotidevariants from transcriptome sequencing data Nucleic AcidsRes 201442e172

150 Chorev M Carmel L The function of introns Front Genet2012355

151 Chen R Davydov EV Sirota M et al Non-synonymous andsynonymous coding SNPs show similar likelihood and effectsize of human disease association PLoS One 20105e13574

152 Rose AB Intron-mediated regulation of gene expressionCurr Top Microbiol Immunol 2008326277ndash90

153 Birnbaum RY Clowney EJ Agamy O et al Coding exonsfunction as tissue-specific enhancers of nearby genesGenome Res 2012221059ndash68

154 Nakaya HI Amaral PP Louro R et al Genome mapping andexpression analyses of human intronic noncoding RNAsreveal tissue-specific patterns and enrichment in genesrelated to regulation of transcription Genome Biol 20078R43

155 Rohs R Jin X West SM et al Origins of specificity in protein-DNA recognition Ann Rev Biochem 201079233ndash69

156 Karczewski KJ Dudley JT Kukurba KR et al Systematic func-tional regulatory assessment of disease-associated variantsProc Natl Acad Sci USA 20131109607ndash12

157 Khalil AM Rinn JL RNA-protein interactions in humanhealth and disease Semin Cell Dev Biol 201122359ndash65

158 Jones S van Heyningen P Berman HM et al Protein-DNAinteractions a structural analysis J Mol Biol 1999287877ndash96

159 Jones S Daley DT Luscombe NM et al Protein-RNAinteractions a structural analysis Nucleic Acids Res200129943ndash54

160 Barabasi AL Oltvai ZN Network biology understanding thecellrsquos functional organization Nat Rev Genet 20045101ndash13

161 Nooren IM Thornton JM Diversity of protein-protein inter-actions EMBO J 2003223486ndash92

162 Caspi R Altman T Dreher K et al The MetaCyc database ofmetabolic pathways and enzymes and the BioCyc collectionof pathwaygenome databases Nucleic Acids Res201240D742ndash53

163 Yamada T Bork P Evolution of biomolecular networks les-sons from metabolic and protein interactions Nat Rev MolCell Biol 200910791ndash803

164 Stark C Breitkreutz BJ Chatr-Aryamontri A et al TheBioGRID interaction database 2011 update Nucleic Acids Res201139D698ndash704

165 Szklarczyk D Franceschini A Kuhn M et al The STRINGdatabase in 2011 functional interaction networks of pro-teins globally integrated and scored Nucleic Acids Res201139D561ndash8

166 Bader GD Hogue CW An automated method for finding mo-lecular complexes in large protein interaction networksBMC Bioinformatics 200342

167 DallrsquoOlio GM Vahdati AR Bertranpetit J et alVCF2Networks applying Genotype Networks to SingleNucleotide Variants data Bioinformatics 201531438ndash9

168 Ding L Getz G Wheeler DA et al Somatic mutations affectkey pathways in lung adenocarcinoma Nature20084551069ndash75

169 Case M Matheson E Minto L et al Mutation of genes affect-ing the RAS pathway is common in childhood acutelymphoblastic leukemia Cancer Res 2008686803ndash9

170 Thorstensen L Lind GE Lovig T et al Genetic and epigeneticchanges of components affecting the WNT pathway in colo-rectal carcinomas stratified by microsatellite instabilityNeoplasia 2005799ndash108

Interpreting functional effects of coding variants | 21

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

171 Weng L Macciardi F Subramanian A et al SNP-based path-way enrichment analysis for genome-wide associationstudies BMC Bioinformatics 20111299

172 Wang K Li M Bucan M Pathway-based approaches for ana-lysis of genomewide association studies Am J Hum Genet2007811278ndash83

173 OrsquoDushlaine C Kenny E Heron EA et al The SNP ratio testpathway analysis of genome-wide association datasetsBioinformatics 2009252762ndash3

174 Tarca AL Draghici S Khatri P et al A novel signaling path-way impact analysis Bioinformatics 20092575ndash82

175 Ramanan VK Shen L Moore JH et al Pathway analysis ofgenomic data concepts methods and prospects for futuredevelopment Trends Genet 201228323ndash2

176 Li Y Agarwal P Rajagopalan D A global pathway crosstalknetwork Bioinformatics 2008241442ndash7

177 Rivals I Personnaz L Taing L et al Enrichment or depletionof a GO category within a class of genes which testBioinformatics 200723401ndash7

178 Rhee SY Wood V Dolinski K et al Use and misuse of thegene ontology annotations Nat Rev Genet 20089509ndash15

179 Chandonia JM Brenner SE The impact of structural gen-omics expectations and outcomes Science 2006311347ndash51

180 Johnson MS Srinivasan N Sowdhamini R et al Knowledge-based protein modeling Crit Rev Biochem Mol Biol 1994291ndash68

181 Sulkowska JI Morcos F Weigt M et al Genomics-aided struc-ture prediction Proc Natl Acad Sci USA 201210910340ndash5

182 Lee D Redfern O Orengo C Predicting protein function from se-quence and structure Nat Rev Mol Cell Biol 20078995ndash1005

183 Worth CL Gong S Blundell TL Structural and functionalconstraints in the evolution of protein families Nat Rev MolCell Biol 200910709ndash20

184 Karczewski KJ Fernald GH Martin AR et al STORMSeq anopen-source user-friendly pipeline for processing personalgenomics data in the cloud PLoS One 20149e84860

185 Kalari KR Rossell D Necela BM et al Deep sequence ana-lysis of non-small cell lung cancer integrated analysis ofgene expression alternative splicing and single nucleotidevariations in lung adenocarcinomas with and without onco-genic KRAS mutations Front Oncol 2012212

186 Patil S Upadhayay A Arya D et al A high throughput exomesequencing approach to analyse events within a good re-sponder CML patient under imatinib at diagnosis and underremission Blood 20131225161

187 Kullo IJ Shameer K Jouni H et al The ATXN2-SH2B3 locus isassociated with peripheral arterial disease an electronicmedical record-based genome-wide association study FrontGenet 20145166

188 Shameer K Denny JC Ding K et al A genome- and phe-nome-wide association study to identify genetic variantsinfluencing platelet count and volume and their pleiotropiceffects Hum Genet 201413395ndash109

189 Sabarinathan R Wenzel A Novotny P et al Transcriptome-wide analysis of UTRs in non-small cell lung cancer revealscancer-related genes with SNV-induced changes on RNAsecondary structure and miRNA target sites PLoS One20149e82699

22 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

  • bbv084-TF1
Page 4: challenges in proteome-scale prediction, annotation and assessment

variants in the Ensembl Variation resources and dbSNP and theircorresponding SO identifiers and descriptions are summarized inTable 1 The interpretation of variants can be broadly divided intothree steps lsquoannotationrsquo lsquopredictionrsquo and lsquovisualizationrsquo

A given set of variant(s) identified from a sequencing studywill be segregated into various classes of genomic variants in thefirst step in variant annotation (Table 1) Cross-referencing thevariants with reference databases and clinical annotation data-bases like ClinVar [38] can be used to assess the novelty of gen-omic variants Fully automated pipelines can be used to annotatevariants based on coordinate specific mapping to genomic re-gions using proprietary andor public databases and various fea-tures associated with variant can be derived from suchannotation mapping and variant-specific layering approach Forexample a variant can be mapped to a protein-coding regionjunction regions or noncoding regions of the genome

Based on the location of the variant in the protein-coding ornoncoding regions variants (non-synonymous missense non-sense or frameshift variants) can be further examined to under-stand their impact on protein functions Tools like CombinedAnnotation-Dependent Depletion (CADD) [39] SIFT (as toleratedor damaging variants) [40] PolyPhen2 (as benign possibly

damaging or probably damaging variants) [41] or Condel (meta-predictor that combines prediction scores from multiple tools)[42] can be leveraged for predictive assessment of genetic vari-ants Annotation primarily provides localization of a genetic vari-ant using genome coordinates prediction aims to hypothesizethe probable impact of a particular genetic variant on function orregulation The combination of annotation and prediction pro-vides an integrated view of genomic variants Tools such assnpEff or the Ensembl Variant Predictor can be used to predict theimpact of a variation in comparison with the reference sequenceThe impact of a sequence variant with respect to the evolutionaryconservation can also be predicted or derived from GRANTHAMscore [43] genomic evolutionary rate profiling (GERP) score [44]phylogenetic P-value (PhyloP) score [45] and PhastCons score [46]The development of substitution matrices in the early 1970s wasinstrumental in fueling the design and implementation of com-putational approaches to infer the functional impact of genomicvariants [43 47] Matrices such as Point Accepted Mutations(PAM1 matrix substitution probabilities for sequences with a mu-tation rate of 1100 amino acids PAM250 matrix 250 mutations100 amino acids) BLOCK SUbstitution Matrix (BLOSUM) matrices[48] and sequence search algorithms designed to identify the

Figure 2 Examples of the coding (functional) and noncoding (regulatory) variants (i) Functional variant (Pro33Ser) in RRM2B associated with autosomal recessive progres-

sive external ophthalmoplegia visualized on a protein structure (PDB ID 2vux Quaternary assembly is generated using PISAPDBe) Functional variant (Pro33Ser) is high-

lighted in red color inside the green circle on chains A (part of loop) and B (part of helix) Visualization was created using UCSF Chimera (wwwcglucsfeduchimera) (ii)

Protein domain architectures and functional variants mapped to ii (a) RRM2B and ii (b) ZC3HC1 Ribonuc_red_smfrac14Ribonucleotide reductase domain zffrac14C3HC zinc finger-

like domain Rsm1frac14Rsm1-like domain Functional variants are highlighted using red vertical line Figure was generated using MyDomains (httpprositeexpasyorgmydo-

mains) (iii) Genomic localization of the regulatory variants (a) rs10811661 and (b) rs342293 The location of the variants are highlighted using a vertical red line Regulatory

variant rs10811661 regulated the expression of nearby genes CDKN2A and CDKN2B (highlighted in red boxes) Intergenic variant rs342293 located between FLJ36031

(CCDC71L)-PIK3CG is located in a TFBS of EVI1 that regulates the expression (repress) of PIK3CG Genomic regions were visualized using Ensembl Genome Browser v 67

4 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

similarity between any two sequences (pairwise sequence align-ment) and the evolutionary relationships between two or moresequences (multiple sequence alignments) have spawned the de-velopment of improved heuristic homology search tools [49 50]The early 2000s witnessed the development of several predictivemethods based on sequence conservation amino acid substitu-tions and perturbation of the local structural environment to as-sess the impact of functional variants in proteins [51ndash57] The kaks ratio (or dNdS) test which estimates the ratio of the totalnumber of non-synonymous substitutions per non-synonymoussites (dN) to the total number of synonymous substitutions to

synonymous sites (dS) is widely used to quantify the selectionpressure on functional genes [52] Furthermore bioinformaticstools databases and statistical methods for the identification an-notation and analysis of SNPs from genotyping and sequencingdata have been amply surveyed in literature [58ndash66]

Integrative approaches can provide additional annotationsfor variants such as the location of the variants within the con-served protein sequencestructure domains (contiguous unit inprotein sequence or structure with evidence of functional anno-tation from experimental or computational function associationmethods) within known functional sites Gene Ontology (GO)

Table 1 The naming convention used to describe the sequence variants in the Ensembl Variation resources and dbSNP

Ensembl variantconsequences

dbSNPfunctionalclasses

SO ID SO Definition

Essential splice site splice-3 SO0001574SO0001575

A splice variant that changes the two base regions at the 30 end of anintron a splice variant that changes the two base regions at the 50

end of an intronStop gained splice-5 SO0001587 A sequence variant whereby at least one base of a codon is changed

resulting in a premature stop codon leading to a shortenedtranscript

Stop lost nonsense SO0001578 A sequence variant where at least one base of the terminator codon(stop) is changed resulting in an elongated transcript

Complex indel NA SO0001577 A transcript variant with a complex INDELmdashInsertion or deletion thatspans an exonintron border or a coding sequenceUTR border

Frameshift coding frameshift SO0001589 A sequence variant which causes a disruption of the translationalreading frame because the number of nucleotides inserted ordeleted is not a multiple of three

Non-synonymouscoding

missense SO0001582SO0001652SO0001651SO0001583

A codon variant that changes at least one base of the first codon of atranscript an inframe non-synonymous variant that deletes basesfrom the coding sequence an inframe non-synonymous variantthat inserts bases into in the coding sequence a sequence variantwhere the change may be longer than three bases and at least onebase of a codon is changed resulting in a codon that encodes for adifferent amino acid

Splice site NA SO0001630 A sequence variant in which a change has occurred within the regionof the splice site either within 1ndash3 bases of the exon or 3ndash8 bases ofthe intron

Partial codon NA SO0001626 A sequence variant where at least one base of the final codon of an in-completely annotated transcript is changed

Synonymous coding cds-synon SO0001567SO0001588

A sequence variant where at least one base in the terminator codon ischanged but the terminator remains a sequence variant wherethere is no resulting change to the encoded amino acid

Coding unknown NA SO0001580 A sequence variant that changes the coding sequenceWithin mature miRNA NA SO0001620 A transcript variant located with the sequence of the mature miRNA50 UTR untranslated_5

UTR-5SO0001623 A UTR variant of the 50 UTR

30 UTR untranslated_3UTR-3

SO0001624 A UTR variant of the 30 UTR

Intronic intron SO0001627 A transcript variant occurring within an intronNMD transcript SO0001621 A variant in a transcript that is the target of NMDWithin non-coding

genencRNA SO0001619 A transcript variant of a non-coding RNA gene

Upstream near-gene-5 SO0001636SO0001635

A sequence variant located within 2 KB 50 of a gene a sequence variantlocated within 5 KB 50 of a gene

Downstream near-gene-3 SO0001634SO0001633

A sequence variant located within a half KB of the end of a gene a se-quence variant located within 5 KB of the end of a gene

Regulatory region NA SO0001566 A sequence variant located within a regulatory regionTranscription factor

binding motifNA SO0001782 A sequence variant located within a transcription factor-binding site

Intergenic intergenic SO0001628 A sequence variant located in the intergenic region between genes

SO identifiers and descriptions were obtained from MISO (wwwsequenceontologyorgmiso)

Interpreting functional effects of coding variants | 5

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

terms and pathways associated with the variant-containinggenes associations mapped within genetic databases such asdbSNP Ensembl Online Mendelian Inheritance in Man [67]Human Gene Mutation Database [68] Catalogue of SomaticMutations in Cancer (COSMIC) [69] and other clinically relevantgenomic regions also enable enhanced variant annotationTogether integrative data analysis platforms (such asTargetMine [70]) and integrated annotation tools such asANNOVAR [71] or webservers that can handle Variant CallFormat (VCF) [72] files facilitate the annotations for a largenumber of sequence variants in a small amount of timeAdopting a standard set of terms from the SO to define the typeof variants would also help in streamlining the interoperabilityof results from the different prediction tools A comprehensivelist of tools for rapid variant annotation is provided in Table 2

The visualization of a variant in the context of multiplelayers of biological information is helpful to interpret and priori-tize variants for functional studies A growing number of gen-ome browsers and data visualization libraries enable theinteractive and static visualizations of variants in the context ofthe human genome or transcriptome with biological clinicaland population scale annotation data compiled from multipleresources Genome browsers [Integrative Genomics Viewer(IGV) UCSC Genome Browsers NCBI Sequence Viewer etc] en-able the visualization of variants in the context of a referencegenome Resources such as Distributed Annotation System(DAS) and DASTY (a protein-centric DAS client) can be leveragedfor interactive visualization of coding variants in a protein withrich annotations A list of genome browsers and visualizationlibraries capable of visualizing variants and multiple annotationcomponents is provided in Table 3

Text mining and natural language processingfor knowledge aggregation for SNVs

Biomedical knowledge about relationships between genes dis-eases phenotypes and genetic variations are scattered across alarge number of unstructured literature databases Applicationof natural language processing and text-mining thereforeoffers a useful approach for function assignment of codingSNVs Further text mining of large biomedical literature data-bases like PubMed and Medline helps to provide clues for fur-ther investigations and leads to hypothesis [73] For examplePhenGen [74] offers links to a variety of literature evidence tosupport genotypendashphenotype connections Integrated databases

like T-HOD database [75] and PolySearch [76] provide text-mining tools to derive meaningful biological inferences tointerpret coding SNVs

Emerging challenges in annotation andinterpretation of coding SNVs

The growing number of tools for the prediction annotation andvisualization of coding SNVs can address several gaps in thecurrent state of knowledge on variant interpretation The devel-opment of new algorithms for variant interpretation could beconsidered for several emerging themes of the protein se-quence-structure-function paradigm Several tools [See Tables4 5 and 6] are currently available to assess sequence and struc-ture-based features yet a reliable interpretation on how a gen-omic variant could perturb a protein or a protein network isoften a challenging task

VUS (also known as incidental variations or secondary vari-ations) are a class of variants hitherto unknown in known dis-ease genes but the influence of these variants on the diseasephenotype is largely unknown Pilot data from 1000 Genomesproject on exon capture sequencing of 1092 individuals selectedfrom 14 populations across Europe East Asia sub-SaharanAfrica and the Americas reported an observed frequency of 2500non-synonymous variants at conserved positions 20ndash40 vari-ants identified as damaging 24 at conserved sites and about 150loss-of-function (LOF) variants that includes stop-gains frame-shift insertions and deletions (indels) in a coding sequence anddisruptions to essential splice sites Emerging evidence suggeststhat rare variants may have a higher collective impact on dis-ease incidence rate than common variants and considering theallele frequency as a metric to assess the clinical impact of thevariant may help to assess the clinical impact of VUS Majorityof the variants identified in individuals in 1000 Genomes projectare common with gt5 of MAF or low frequency (MAF 05ndash5)than rare variants (MAF lt05) Rare variant frequencies esti-mated as 130ndash400 non-synonymous variants per individual thatincludes 10ndash20 LOF variants 2ndash5 damaging mutations and 1ndash2variants identified from cancer genome sequencing projects[77ndash79] As an ever-increasing number of VUS are being charac-terized their annotation and interpretation is becoming morechallenging With the advent of WGS and WES in clinical set-tings the repertoire of VUS associated with complex chronicand rare diseases is rapidly expanding We surveyed theHuman polymorphisms and disease mutations index (humsa-vartxt) to gather unclassified variants reported in UniProtKB acurated database of functional information on proteins Thecurrent release (2014_05) of the humsavar lists 6564 variants aslsquoUnclassified variantsrsquo these variants were mapped to 1910 pro-tein coding genes A manually curated annotation of the vari-ants and disease ontology-based disease term-gene enrichmentanalyses indicated that the genes were largely from cancer pa-tients (322 genes tested Plt 005) We noted that 497 genesencoded polymorphism disease and unclassified variants sug-gesting that VUS-labeled variants may directly influence thegenes that are relevant to human diseases and clinically rele-vant phenotypes A proportional Venn diagram of genesmapped to variants labeled as lsquoPolymorphismrsquo lsquoDiseasersquo andlsquoUnclassifiedrsquo is illustrated in Figure 3 The top 10 disease ontol-ogy terms enriched among the genes within the three classes ofvariants are summarized in Table 2 We observed that certaindiseases were represented by genes that share multiple classesof variants thereby suggesting that unclassified variants may

Table 2 Top-10 terms from a gene-disease enrichment analyses per-formed using list of genes with shared polymorphism disease andunclassified variants using disease ontology

DO term No ofgenes

P-values(Bonferronicorrected)

Congenital abnormality 38 913E-34Cancer 64 7222E-33Diabetes mellitus 39 8913E-23Breast cancer 36 356E-17Atherosclerosis 26 1478E-16Retinal disease 16 2044E-16Adenovirus infection 15 8816E-14Hypertension 21 2072E-13Alzheimerrsquos disease 22 7548E-13

6 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

play a key role in certain clinical conditions such as cancerschronic diseases (diabetes atherosclerosis hypertension andkidney disease) neurodegenerative disease (Alzheimerrsquos dis-ease) and infection (Plt 005 Bonferroni corrected) Annotatingfunctional coding SNVs including VUS validated using an or-thogonal method also requires detailed sequence structure andinteraction-based analyses In this context we discuss some ofthe predictive features and analytical approaches that onceincorporated in coding variant annotation algorithms may po-tentially enhance the interpretation of the functional variantson a proteome-wide scale

Sequence and structural propertiesperturbed by coding SNVs

A protein performs its defined function after attaining a specifictertiary or quaternary structure [80 81] This is often mediatedby cross-links of inter-chain and intra-chain amino-acid residueinteractions within a protein These interactions (hydrogenbond disulphide bonds salt bridges ionic interactions electro-static interactions hydrophobic interactions etc) stabilize the

fold of individual protein chains and the overall quaternary as-sembly Individual amino acids within a protein are under theinfluence of the varied elements of the sequence-structure-function paradigm that modulate their biochemical role [82]Primarily a residue may be located within an evolutionarilyconserved compact globular domain [83] or in an unassigned re-gion with no known protein domain association [84] A residuecan be a part of a motif which can be functional [85] that of alinear motif [86] propeptide [87] signaling peptide [88] or a partof structural [89] and spatially interacting sites which partici-pate in higher order interactions [90] catalytic sites [91] ligandbinding sites [92] and allosteric sites [93] associated with exten-sive inter-chain and intra-chain interactions [90] based onthe oligomeric property of the proteins Amino acids can also bepart of specific sub-cellular localization signals that directthe proteins to specific locations in the cell [94] The lengths ofthe proteins [62] sequential localization (N-terminalC-terminal or other region of protein sequence) and the locationof a coding SNV with respect to the surfacemdashinterior or inter-face of the protein structuremdashcould influence disease manifest-ation [95]

Table 3 Rapid variant annotation tools

Name Description URL

ANNOVAR Efficient software tool to use up-to-date information tofunctionally annotate genetic variants detected fromdiverse genomes

httpwwwopenbioinformaticsorgannovar

AnnTools Comprehensive and versatile annotation toolkit for gen-omic variants

httpanntoolssourceforgenet

dbNSFP A lightweight database of human non-synonymousSNPs and their functional predictions

httpssitesgooglecomsitejpopgendbNSFP

EVA An efficient and versatile tool for filtering strategies inmedical genomics

httpplateforme-genomique-iribuniv-rouenfrEVAindexphp

Exome VariantServer

Provides different calculated values (GERP GRANTHAMetc) and annotations for SNPs

httpevsgswashingtoneduEVS

gSearch gSearch compares sequence variants in the GenomeVariation Format (GVF) or VCF with a pre-compiledannotation or with variants in other genomes

httpmlssuackrgSearchindexhtml

HugeSeq A pipeline for detection and annotation of geneticvariations

httphugeseqsnyderlaborg

MuSiC Comprehensive mutational analysis pipeline to segre-gate passenger and driver mutations from cancergenomes

httpgmtgenomewustledugenome-musiccurrent

NGS-SNP Collection of command-line scripts for providing richannotations for SNPs

httpstothardafnsualbertacadownloadsNGS-SNP

SeattleSeq Annotation Provides annotation of known and novel SNPs httpsnpgswashingtoneduSeattleSeqAnnotationsnpEff Variant annotation and effect prediction tool httpsnpeffsourceforgenetSVA Software system designed for annotation and visualiza-

tion of genetic variantshttpwwwsvaprojectorg

STORMSeq Cloud computing solution for read mapping read clean-ing and variant calling and annotation

httpsgithubcomkonradjkstormseq

TREAT Targeted RE-sequencing Annotation Tool httpndcmayoedumayoresearchbiostatstand-alone-packagescfm

VAAST Probabilistic search tool for identifying damaged genesand their disease-causing variants in personal gen-ome sequences

httpwwwyandell-laborgsoftwarevaasthtml

VARIANT VARIANT (VARIant ANalysis Tool) can report the func-tional properties of any variant in all the humanmouse or rat genes

httpvariantbioinfocipfes

Variant Reporter Generate a report of known variants and functionalconsequences

wwwncbinlmnihgovvariationtoolsreporter

Variant Tools Tool for the annotation selection and analysis of vari-ants in the context of next-gen sequencing analysis

httpvarianttoolssourceforgenet

Interpreting functional effects of coding variants | 7

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

Table 4 Genome browsers and biological data visualization libraries

Name Description URL

1000 Genomesbrowser

Genome browser to access data from 1000 Genomesproject

httpwwwncbinlmnihgovvariationtools1000genomes

AnnoJ Web 20 application designed for visualizing sequencingand annotation data

httpwwwannojorg

Artemis Genome Browser and Annotation Tool httpwwwsangeracukresourcessoftwareartemis

BioGraphics Perl modules for biological data visualization httpsearchcpanorgdistBio-GraphicsBioGraphics (Ruby) Ruby library for drawing overviews of genomic regions httpbio-graphicsrubyforgeorgBluejay Java-based integrated computational environment for the

exploration of genomic datahttpbluejayucalgaryca

CGView Circular Genome Viewer httpwishartbiologyualbertacacgviewCircos Circos is a software package for visualizing genomic data

and annotations in circular layouthttpcircosca

Dalliance Interactive genome viewer which runs directly in a modernweb browser

httpwwwbiodallianceorg

DAS DAS is an integrated visualization toolkit to share and col-late genomic annotation information

wwwbiodasorg

DASTY Web client for visualizing protein sequence feature infor-mation using DAS

httpwwwebiacukdasty

DNAPlotter DNAPlotter can be used to generate images of circular andlinear DNA maps to display regions and annotations ofinterest

httpwwwsangeracukresourcessoftwarednaplotter

Ensembl GenomeBrowser

Ensembl Genome Browser enables the visualization of gen-omic and transcriptomic sequence and related informa-tion for several vertebrate and non-vertebrate species

httpuseastensemblorgindexhtml

GASV Geometric Analysis of Structural Variants httpcompbiocsbrownedusoftwarehtmlGbrowse Generic Genome Browser (GBrowse) is a genome viewer de-

veloped as part of Generic Model Organism Database(GMOD) project

httpgmodorgwikiGBrowse

GENBOREE Customizable genome browser httpgenboreeorgjava-binloginjspGeneViTo JAVA-based workbench for genome-wide analysis through

visual interactionhttpathinabioluoagrbioinformaticsGENEVITO

GenomeGraphs R-based interface to plot genomic information fromEnsembl

httpwwwbioconductororgpackagesreleasebiochtmlGenomeGraphshtml

GenomePixelizer Tool to generate custom images of genomes out of thegiven set of genes

httpwwwatgcorgGenomePixelizer

GenomeTools Versatile genome analysis software httpgenometoolsorgGenomeView Next-generation stand-alone genome browser and editor httpgenomevieworgGremlin Interactive visualization model for analyzing genomic

rearrangementshttpcompbiocsbrownedusoftwarehtml

IGB Integrated genome browser httpbiovizorgigbindexhtmlIGV Integrative Genomics Viewer httpwwwbroadinstituteorgigvJbrowse Genome browser with a fully dynamic AJAX interface httpgmodorgwikiJBrowsejsDAS JavaScript client library for the DAS httpwwwebiacukdastyebihtmljsdashtmlMagicViewer Integrated solution for NGS data visualization and genetic

variation detection and annotationhttpbioinformaticszjcnmagicviewerindexphp

NCBI GraphicalSequence Viewer

Graphical display for the Nucleotide and Protein sequences httpwwwncbinlmnihgovprojectssviewer

Rover Genome browser framework to build custom genomic tools httpchmille4githubcomRoversitehomehtmlRviewer Interactive online tool for comparing and prioritizing gen-

omic regionshttprviewerlblgovrviewer

Savant Genome Browser for high-throughput sequencing data httpgenomesavantcomScribl HTML5 Canvas-based graphics library for visualization of

genomic data and annotationshttpchmille4githubcomScribl

UCSC GenomeBrowser

Interactive genome browser that provide access to se-quence data from different species integrated with alarge collection of layered annotations from experimentsand prediction algorithms

httpgenomeucsceducgi-binhgGateway

VISTA Browser Visualization of pairwise and multiple alignments of wholegenome assemblies

httppipelinelblgovcgi-bingateway2selectorfrac14vista

Whole GenomerVISTA

Visualization of TFBS that are conserved between speciesand overrepresented in upstream regions of groups ofgenes

httpgenomelblgovvistaindexshtml

8 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

Previously studies have established (Table 4) that functionalvariants may impact the overall structural conformation andfunction of proteins (ligand binding substrate specificity pro-teinndashprotein interaction transcription factor-target bindingetc) Residue-specific interactions can play a key role in suchvariations Sequence-based investigations will further help toidentify linear sequence motifs that confer diversity in the splicevariants [6] but the identification and analysis of the impact ofpoint mutations on spatially distributed structural motifs mayrequire structural data [55] However the limited availability ofstructural data compared with sequence data may pose add-itional challenges for incorporating structural analysis of thefunctional variants [96] Moreover additional challenges ariseowing to the dynamic protein conformations and allosteric orother long-distance effects (including higher order interactions)of few mutations on the activity of the protein

UniProtKB provides extensive information about residueproperties and curated information about globular domains (viaannotations derived from Pfam SMART or InterPro)posttranslational modifications (PTMs) and several bioinfor-matics tools and webservers are available to investigate residueproperties and structural properties (See Table 5 for a list oftools to predict various sequence features See URL httpbioinformaticsca links_directorycategoryprotein for list of toolsavailable for different protein-centric analyses)

Occasionally multiple sequence or structural features canbe annotated onto a single variant for example a residue couldbe part of a functional motif andor a structural motif and cod-ing SNVs may perturb either or both features Quantifying mul-tiple features perturbed by variants using an objective-scoringschema will help to capture the entire set of features perturbedby a functional variant Comparative analysis of the sequenceand structural features of the wild type and the variant se-quences will also enhance the understanding of the impact ofmultiple coding variants within a protein

Gain and loss of PTMs owing to SNVs

Amino acids in a mature peptide chain are targets of variedPTM [97] events PTMs include phosphorylation methylationacetylation amidation hydroxylation sulfation lipidation

glycosylation and palmitoylation [98] Several studies have nowcompiled and assessed the impact of phosphorylation in cancer[97] and have reported mutational landscapes of various PTMevents including gain or loss of particular PTMs owing to SNVs(for example gain of glycosylation [99] gain and loss of phos-phorylation in cancer [100]) Pan-cancer [101] and proteomewide studies [102] have also assessed the impact of variety ofPTMs Several tools and databases are currently available toassess impact of individual PTMs owing to SNVs [refs]However tools that can provide complete in silico profiling ofmutations will help the researchers to identify PTM-relevantmutations and use the information to assess therapeuticstratifications

Coding SNVs in unassigned regions of proteins

An unassigned region refers to the segments in proteins withno known functional domain assignments [84] Human proteinshave a variable degree of unassigned regions and small-un-assigned regions are often defined as the linker regions betweentwo domains (Figure 2i) We surveyed the human proteomeusing Pfam-based domain annotations to understand the un-assigned regions in the human proteome We computed as-signed and unassigned regions (in percentage) for 20 242 proteinsequences from the SwissProt database (reviewed sequences)Pfam-A-based domain assignments were retrieved for 20 137 se-quences A subset of 105 proteins was excluded from the ana-lysis owing to overlapping domain assignments Of the 20 137sequences that were analyzed (Figure 4A) one protein (haloaciddehalogenase-like hydrolase domain-containing protein 2) wasassigned with a conserved domain over its entire length 3234sequences were completely unassigned and the rest of the pro-teins had varying segments of unassigned regions (mean5387 SD63094) Current approaches in variant annotationand interpretation are primarily focused on highly conservedglobular domains Given that domains are presently assigned toonly 50 of the human proteome analytical methods such asPrediction of Unassigned Regions (PURE) that use intermediatesequence search techniques for domain assignments [103] orsimilar approaches that use sensitive sequence search protocols

Table 5 Examples of the functional variants located in sequence features perturbing diverse functional and structural effects in proteins

Variant location Description URL

Protein domain DMDM A database that compiles domain mapping of dis-ease mutations have information about 202 507 muta-tions associated with 10 919 domains (compiled fromCDD Pfam COG and SMART databases)

httpbioinfumbcedudmdm

Phosphorylation site Mutation of an AKT phosphorylation site of human B-raf httpwwwncbinlmnihgovpubmed15791648Propeptide Mutation in the von Willebrand factor (VWF) propeptide

affects the oligomerizationhttpwwwncbinlmnihgovpubmed20335223

Signal peptide Mutation in signal peptide of ADAMTS10 influence secre-tion of full-length enzyme

httpwwwncbinlmnihgovpubmed18567016

Active site Mutation in the active site of human deoxycytidine kinaseaffects the substrate specificity

httpwwwncbinlmnihgovpubmed18361501

Linear motif Linear motifs mediate functional diversity of transcriptvariants

httpwwwncbinlmnihgovpubmed22638587

Structural motif Heterozygous missense mutation of a spatially distributedstructural motif in human connexin (GJB3) gene causeerythrokeratodermia variabilis

httpwwwncbinlmnihgovpubmed9843209

Subcellular localization Missense mutations in the NPHS2 gene altering the traffick-ing of nephrin to the plasma membrane

wwwncbinlmnihgovpubmed15496146

Interpreting functional effects of coding variants | 9

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

Table 6 Tools for predicting various sequence and structural features

Name Description URL

3dswap-pred Classify a protein sequence as domain-swapping ornon-domain swapping using an SVM model

httpcapsncbsresin3dswap-predindexhtml

AAIndex An amino acid index is a set of 20 numerical values rep-resenting various physico-chemical and biochemicalproperties of amino acids

httpwwwgenomejpaaindex

Bioinformatics LinkDirectory (Protein)

Extensive list of tools for prediction of protein sequencefeatures structure features and function

httpbioinformaticsca links_directorycategoryprotein

dbPTM Comprehensive resource for protein PTMs httpdbptmmbcnctuedutwDISOPRED Dynamically disordered protein chains do not have sta-

ble secondary structures and have high flexibility insolution Disordered regions also play critical roles inprotein function

httpbioinfcsuclacukdisopred

ELM Eukaryotic Linear Motif server httpelmeuorgEris Eris server computes the change of the protein stability

induced by mutations using structural datahttpdokhlabuncedutoolserisindexhtml

FoldAmyloid Method for predicting of amyloidogenic regions fromprotein sequence

httpbioinfoprotresrufold-amyloidogacgi

FoldX FoldX can be used to find interactions contributing tothe stability of proteins and protein complexes usingstructural data

httpfoldxcrges

Globplot Globplot can predict disordered regions in proteinsequence

httpglobplotemblde

H-Predictor Predict hinge regions involved in protein oligomeriza-tion via the domain-swapping mechanism fromstructural data

httptrollmeduncedudokhlabindexphpSpecialHpredictor

Harmony Substitution and propensity score-based protein struc-ture assessment algorithm

httpcapsncbsresinharmony

HORI Webserver for prediction of higher order residue inter-actions in protein structures

httpcapsncbsresinhori

InterPro Integrated database of predictive protein signaturesused for the classification and automatic annotationof proteins and genomes

httpwwwebiacukinterpro

IUPred Prediction of intrinsically unstructured proteins httpiupredenzimhuLIMBO Predicts the amylogenic regions in a protein httplimbovibbeMUPRO Prediction of protein stability changes for single-site

mutations from sequenceshttpmuproproteomicsicsuciedu

NCBI-CDD Extensive protein domain and family annotationdatabase

httpwwwncbinlmnihgovStructurecddcddshtml

Pfam Database of conserved protein domain families httppfamsangeracukPFILT Program to filter various sequence regions including

low-complexity regionshttpbioinfadmincsuclacukdownloadspfilt

PIC Protein interactions calculator httppicmbuiiscernetinProtParam Compute biochemical features like Molecular Weight

Theoretical pI Grand Average of Hydropathy(GRAVY) instability index etc

httpwebexpasyorgprotparam

PSIPRED Secondary structure prediction httpbioinfcsuclacukpsipredPURE Prediction of unassigned regions in proteins httpcapsncbsresinpureSABBLE Relative solvent accessibility prediction httpsablecchmcorgScanProsite Report the functional motifspatterns encoded in the se-

quence Helps to assess the gainloss of functionalsites owing to mutation

httpprositeexpasyorgscanprosite

SignalP Predicts the presence and location of signal peptidecleavage sites in amino-acid sequences

httpwwwcbsdtudkservicesSignalP

SMART Simple modular architecture research tool for assigningdomains to protein

httpsmartembl-heidelbergde

TANGO Predicts the aggregation-prone regions in a protein httptangocrgesTargetP Predicts the subcellular location of eukaryotic proteins httpwwwcbsdtudkservicesTargetPUniProtKB Catalog of information on proteins httpwwwuniprotorgWALTZ Predicts the aggregation-prone regions in a protein httpwaltzvubacbe

10 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

(Table 5) may potentially help investigate the evolutionary andfunctional roles of variants in unassigned regions

Impact of coding SNVs in low-complexityregions

Low-complexity regions (LCRs) in the protein universe refer to astretch of amino acids with low Shannon entropy (leucine-richdomains or poly-alanine tracts) Unlike linear motifs whichhave a specific function and sequence signature the individualfunctions of LCRs are poorly characterized LCRs do not adopt adefinite secondary structure but may exist as solvent-exposedamino acids in the coiled or disordered regions in proteins LCRsare observed in functionally diverse proteins and in both eu-karyotes and prokaryotes The predominant functions of LCRsinclude promoting mRNA stability and mediating a diverse set

of proteinndashprotein interactions [104] We scanned the referencehuman proteome (reviewed subset of 20 237 sequences) usingPFILT [105] and observed that 1432 (nfrac14 2899) protein se-quences have LCRs with a median length of 16 amino acids(Figure 4B) A recent study showed that a functional variantlocalized to an LCR in Nance-Horan Syndrome (NHS) gene wasassociated with clinical features of NHS including cardiacanomaly [106] In the absence of direct functional informationto interpret the impact of functional variants scanning proteinsequences with an LCR prediction tool like PFILT is highly rec-ommended to investigate the probable gain or loss of LCRsowing to the functional variations

Coding SNVs and intrinsically disorderedregions in proteins

Proteins are believed to be functional when the structure attainsits definite globular fold [107] Recent studies however suggestthat proteins may perform their functions even when not in afully folded state Such proteins or regions within proteins thatexist in a stable conformation without attaining a definite struc-tural conformation are generally referred to as intrinsically dis-ordered proteins (IDPs) or proteins with intrinsically disorderedregions (IDRs) or simply disordered proteins [108 109] Disprotis a database that catalogs a curated list of disordered proteinsthat includes 248 experimentally validated human proteinswith disordered regions (httpwwwdisprotorgactionsearchphpkeywordfrac14humanampcriterionfrac14organism) Prediction modelssuggest that 30ndash40 of human proteins are considered to beIDPs or have IDRs and approximately 25 of eukaryotic proteinsare predicted to be fully disordered [108 109]

Regions of the intrinsically disordered segments in proteinshave been found to be functionally important (Figure 5)Disordered regions mediate various functional roles includingprotein binding and proteinndashprotein interactions Previous stud-ies have shown that IDPs can attain a definite structure on bind-ing to their interacting partner and may thus exist in anintermediate stage of disordered (unfolded) to ordered (folded)stage [110 111] Irrespective of the structural plasticity recent

Figure 3 Proportional Venn diagram of genes with coding variants annotated as

polymorphism (Polymorphism nfrac1411 527) disease (Disease nfrac142105) and un-

classified (Unclassified nfrac141910) in Human polymorphisms and disease muta-

tions index Percentages of genes shared between the three groups are provided

Unassigned regions in human proteome

Percentage of unassigned regions in human proteins

Fre

qu

en

cy in

hu

ma

n p

rote

om

e (

Sw

issp

rot

re

vie

we

d s

eq

ue

nce

s)

0 20 40 60 80 100

01

00

02

00

03

00

04

00

0

Low-complexity regions in human proteome

Percentage of LCRs in human proteins

Fre

quency in h

um

an p

rote

om

e (

Sw

isspro

t r

evie

wed s

equences)

0 20 40 60 80 100

0500

1000

1500

2000

2500

3000(a) (b)

Figure 4 Histograms depicting the distribution (in percentage) of (A) unassigned regions and (B) LCRs in the human proteome

Interpreting functional effects of coding variants | 11

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

studies have shown the presence of evolutionarily conservedfunctional motifs within IDPs [112 113]

Recently the influence of disease mutations in the dis-ordered regions was extensively surveyed and the crucialroles of disorder-promoting amino acids in imparting vari-ations in the protein structures were proposed [114 115]Disorder-promoting amino acids are a potential cause ofvariations in protein structures and structural modelsDisorder-to-order transitions are enriched among diseasemutations compared with neutral polymorphisms [116]Currently protein modeling or variant effect prediction al-gorithms do not take the intrinsic disorder characteristicsof a protein into consideration Integrating tools such asDisopred [117] for disorder region prediction into the vari-ant annotation pipelines will improve our understanding ofthe impact of coding SNVs in the disordered regions of aprotein

Influence of coding SNVs on proteinmisfolding domain swapping aggregationmacromolecular crowding and degradation

Diseases have multiple environmental as well as geneticcauses and structural biology is an active area of research tounderstand how changes in individual protein structures orprotein complexes may play a key role in disease manifest-ations Understanding the structural bases of human diseases

would help to identify better targets and design better ligandsfor more effective therapeutic interventions

Protein misfolding and aggregation

Protein folding pathways play a crucial role in mediating cellu-lar homeostasis Defective folding of a protein product is themechanistic basis of several disease phenotypes likeAlzheimerrsquos disease and Parkinsonrsquos disease [118] When non-synonymous mutations lead to the production of misfoldedproteins with aberrant function several gate-keeping pathwaysact on such misfolded proteins to clear them from the cellularenvironment [119] Misfolded proteins targeted for ubiquityla-tion may also lead to protein aggregation Protein aggregation[120ndash123] is defined as a molecular phenomenon where a pro-tein is not cleared from the cellular environment by the normalpathways (Figure 6A) This leads to the aggregation of proteinsin the cellular environment which in turn leads to cellular tox-icity and is considered to be the mechanistic basis of varioushuman diseases such as prion diseases [123 124]

Domain swapping

3D domain swapping (Figure 6B) is a phenomenon observed in asubset of proteins where intermolecular interactions arereplaced by intramolecular interactions [125 126] Domainswapping is also recognized as a mechanism for forming pro-tein aggregates via open-ended mode Domain swapping is

Figure 5 Examples of IDRs in human proteins (A) Breast cancer type 1 susceptibility protein (BRCA1) (B) Cellular tumor antigen p53 (Oncoprotein p53) (C) T-cell sur-

face glycoprotein CD4 (Disprot) (D) Secondary structure information (UniProtKB) and (E) mutation histogram (COSMIC) of BRCA1 is provided to illustrate mutations in

the unstructured regions

12 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

observed in a variety of therapeutically important proteins andis considered to be the mechanism mediating deposition dis-eases such as neurodegenerative diseases and Alzheimerrsquos dis-ease that are caused by conformational perturbations [127 128]Missense mutations in the human phenylalanine hydroxylasehave been shown to infleunce aggregation and degradationproperties [129] Functional mutations are also considered to bea causative factor for proteins to adopt swapped conformationfrom monomers to higher oligomers [130ndash132] A recent study[128] that surveyed a subset of human proteins involved in 3Ddomain swapping suggested that swapping is not only confinedto conformational diseases it is also associated with severalkey biological pathways and also plays a role in mediating di-verse diseases in humans Another study [133] that investigatedthe structural properties of proteins involved in 3D domainswapping suggested that 10 of protein folds and 5 of proteinfamilies include domain-swapped structures

Macromolecular crowding

Proteins interact with a variety of small molecules and othermacromolecules to perform a specific functional task The qua-ternary assembly of a protein is influenced by the importantyet understudied phenomenon of macromolecular crowding

(117 118) Protein function is driven by interactions mediated byinter- and intra-chain amino acids and mutations in the sur-face amino acids may affect macromolecular homeostasisMutagenesis studies have shown the impact of functional mu-tations on proteinndashprotein interfaces (119) oligomerization (120)and stability (121) Mutations that influence the intra-chaininteractions may influence the crowding effect in the cellularenvironment Probing the impact of functional mutations onsuch effects using theoretical models and molecular dynamicsimulation studies is likely to enhance the understanding of therelationship between coding SNVs and macromolecular crowd-ing A list of selected tools and databases available for the pre-diction of misfolding protein aggregation and domainswapping is summarized in Table 5

Protein degradation

Targeted biochemical studies have revealed that the proteindegradation pathway plays an important role in the clearanceof the mutated proteins to reduce cellular toxicity For examplemutant Cu Zn-superoxide dismutase associated with amyo-trophic lateral sclerosis was cleared by macroautophagy path-way that includes proteasomal cleavage phenylalaninehydroxylase (109) and NAD(P)H quinone oxidoreductase 1 (156)

(a) (b)

Figure 6 Schematic representation of the impact of functional mutation on protein misfolding folding aggregation domain swapping macromolecular crowding and

protein degradation pathways Structure of misfolded rat CD2 structure (PDB ID 1A6P) and normal CD2 (PDB ID 1HNG) is used for representing misfolded and folded

structures Functional variant is represented using red asterisk

Interpreting functional effects of coding variants | 13

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

were also cleared by similar mechanisms Functional variantsplay a crucial role in aggregation and protein degration path-ways [134 135] Computational approaches that can predict thedegradation properties of mutated proteins using sequence orstructural information will be useful for rapid characterizationof functionally active proteins Integrated models that combinefolding misfolding aggregation 3D domain swapping macro-molecular crowding and degradation pathways (Figure 6) in asystems biology approach are likely to provide additional in-sights into the regulation of these important mechanisms andthe role of coding SNVs in mediating such mechanisms

Coding SNVs and metamorphic proteins

The molecular paradigm of sequence-structure-function sug-gests that a diverse set of sequences could attain a similarstructural fold that may lead to a functional convergence Whilethis is generally true for the protein universe occasionally devi-ations are observed Metamorphic proteins refer to a relativelynew class of proteins in which a given sequence has beenshown to attain different folded conformations under nativeconditions while performing distinct functions [136ndash138]Metamorphic proteins such as human chemokine lymphotactinhave been shown to influence evolutionary transitions of struc-ture and coding SNVs which could play an important role inswitching between two or more folds and functions for a singlesequence [139] Computational approaches to catalog and pre-dict the metamorphic properties and the impact of coding SNVson metamorphic proteins would help us understand how muta-tions drive functional plasticity at the proteome level

Impact of coding SNVs on the transcriptomicdiversity

Dynamic features of the human proteome are driven by tran-script diversity and the average number of characterized tran-scripts per gene is rapidly expanding The recent adoption ofRNA-Seq as a tool for expression profiling has led to the charac-terization of a large number of novel transcripts includingfusion transcripts [140] Several novel tissue-specific orcell-type-specific transcripts were reported from RNA-Seqexperiments The cellular compartment-based functional rolesof such transcripts are determined by alternative splicingevents [141] RNA editing is a phenomenon where a pre-mRNAmolecule is altered through a chemical change in the basemakeup thereby adding to the diversity of the transcriptomeRNA editing events occur via two distinct mechanisms of sub-stitution editing and insertiondeletion editing leading to func-tional diversity Fusion transcripts [142] and RNA editing eventsare also associated with various diseases including prostatecancer and amyotrophic lateral sclerosis [143] Computationalapproaches are available for the identification of fusion tran-scripts [144] and RNA editing events [145ndash147] from sequencingdata Understanding the precise roles and the impact of variantson such a diverse set of transcripts using computationalapproaches is a challenging task Ascribing the functional roleand assessing the impact of coding SNVs for mediating noveland fusion transcripts using bioinformatics approaches is anemerging problem and being addressed in recent studies [148]Recently we designed an integrated method to simultaneousanalyses of genome and transcriptome from RNA-seq data TheeSNV-Detect method can precisely capture genetic variation(genotypes) from RNA-seq data and helps to design cost-

effective and sustainable experimental strategies [149] Analyticand interpretation strategies that rely on multiple data-typeswould provide greater confidence for variant calling annotationand interpretation and thus actionability for clinicalinterventions

Functional impact of synonymous variations

Synonymous mutations are defined as mutations that result ina variation at the DNA level that code for the same amino acidin the protein level owing to codon degeneracy Synonymousvariants may either be coding or noncoding and the functionalroles of such variants are emerging A recent study [150] sug-gests that introns are involved in functional mechanisms andintrons with positional conservation across eukaryotic lineageare classified as functional introns A comparative analysis ofsynonymous and non-synonymous variants associated withcomplex diseases has shown similar likelihood and effect sizewith disease association [151] Synonymous mutations mayalso influence the introns regulating gene expression [152] A re-cent report also suggests that the coding exons function as tis-sue-specific enhancers and synonymous variations in suchenhancer sites may influence the expression level of certaingenes [153] Synonymous variants could influence the expres-sion of intronic noncoding RNAs [154] perturb the transientproteinndashDNA [155 156] and proteinndashRNA interactions in the cellSuch perturbations can lead to diseases such as cancer neuro-logical disorders and cardiac disease [157] Developing bioinfor-matics approaches to explore the putative impact ofsynonymous mutations on different layers of the coding andnoncoding genome and their relationships will be an importantaspect in the detailed interpretation of the genomic variantsThe detailed structural exploration of proteinndashDNA [158] andproteinndashRNA interactions [159] will help to precisely map themechanistic bases of the loss or gain of interactions owing tosynonymous variations

Impact of coding SNVs on the interactomeof a protein

The individual interactome of proteins varies to a great extent[160] Proteins often interact with nucleotides (proteinndashDNA orproteinndashRNA interactions) proteins (protein complexes obligateor non-obligate) [161] and proteinndashprotein interactions (transientor permanent) [107]) and small molecules (molecular reactionsenzymatic reactions metabolic pathways) [162 163] CodingSNVs play an important role in mediating such interactions Aninteractome of a protein can be defined using information gath-ered from high-throughput experiments that systematicallyidentify interactants of proteins Public protein-protein databaseslike BioGRID [164] and STRING [165] provide large data sets forderiving a protein interactome Network-level investigations tounderstand the category of interactors that are perturbed owingto coding SNVs will help delineate the impact of such mutationson the protein interactomes Using network measures (centralitydegree stress betweenness closeness cliques radiality transi-tivity reciprocity assortativity structural equivalence networkheterogeneity network density clustering coefficients neighbor-hood connectivity shared neighbors network topology etc) asquantitative parameters in variant annotation pipelines will helpthe researchers gain better insights into the impact of codingSNVs on the protein interactome Biologically relevant networkproperties such as network modularity [166] network fragility

14 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

[166] and lethality [166] can also be derived from an interactomeIncorporating a comparative network analysis framework thatcompares wild-type and variant interactomes will help quantifythe impact of coding SNVs on a network scale Methods such asVCF2Networks employ genotype networks (ie all genotypesassociated with a single phenotype) to understand the relation-ships between the genotype space and clinical or biological phe-notypes [167] Table 7 summarizes the available tools toinvestigate gene lists and to derive global functional trends andnetwork properties

Pathway-level impact of coding SNVs

Multiple functional genomics studies have investigated the per-turbations of pathways as a result of mutations in diseasessuch as adenocarcinoma (MAPK signaling p53 signaling Wntsignaling cell cycle and mammalian target of rapamycin path-ways) [168] childhood acute lymphoblastic leukemia (RAS path-way) [169] colorectal carcinoma (WNT signaling pathway) [170]etc Several SNP-centric approaches for pathway-level inferencehave been designed for the interpretation of SNPs identifiedfrom GWA studies or differentially expressed genes from gene-expression studies [171ndash175] Coding SNVs may also influencethe cross talk between various signaling pathways Currentvariant interpretation algorithms strive to identify pathways

associated with genes harboring coding SNVs yet a detailedunderstanding of the impact of variants on biological pathwaysand pathway cross talks [176] is often lacking Incorporatinganalytical routines to quantify the effect of functional muta-tions on pathways and pathway cross talk will be useful in in-terpreting functional variants from a biological perspective

A classification schema for annotatingcoding SNVs

Identifying the entire spectrum of molecular perturbationsowing to coding variations at the level of sequence structureinteraction and function of proteins is considered the basis ofvariant interpretation Multiple automated prediction tools thatcan assess the functional effect of mutations are currentlyavailable [See Table 6] Several variant annotation tools focuson the task of predicting the type or effect of mutations provideextended annotations from precomputed databases or lookuptables and map a coding variant to its corresponding gene pro-tein functional domain or signaling pathway Theseapproaches have several limitations because additional layersof protein-centric information that can be derived from the pre-diction or computation of sequence-based features with codingvariants are lacking To deal with this challenge and to designbroad-spectrum tools for deep variant interpretation we

Table 7 Software libraries and tools for biological network analysis

Name Description URL

Bio4J Bio4j is a bioinformatics graph-based database httpwwwbio4jcomBioConductor (Graphs

and Networks view)Collection of BioConductor modules for biological net-

work analysis and visualizationhttpwwwbioconductororgpackagesrelease

BiocViewshtml___GraphsAndNetworksCytoscape Open source platform for complex network analysis and

visualization with a large collection of plug-ins forbiological network analysis

httpwwwcytoscapeorg

DAVID Integrated functional annotation tool httpdavidabccncifcrfgovhomejspFunDO Functional Disease Ontology server httpdjangonubicnorthwesternedufundogene2pathway R package for prediction of KEGG pathway membership

for individual genes based on InterPro domainsignatures

httpwwwbioconductororgpackagesreleasebiochtmlgene2pathwayhtml

GeneAnswers R package for biological or medical interpretation of thegiven one or more groups of genes by means of statis-tical test

httpwwwbioconductororgpackagesreleasebiochtmlGeneAnswershtml

GO Tools Tools for analysis of GO T term enrichment statisticalanalysis semantic similarity and functional similar-ity using GO terms derived from gene lists

httpwwwgeneontologyorgGOtoolsshtml

Gephi Open-source graph visualization and analysis software httpwwwgephiorgGremlin Graph-based programming language httpsgithubcomtinkerpopgremlinwikiiGraph Network analysis and visualization library in C Also

available R package and a Python extensionhttpigraphsourceforgenet

KEGGgraph R package for analysis of KEGG pathways httpwwwbioconductororgpackages210biochtmlKEGGgraphhtml

LEDA A broad-spectrum Cthornthorn class library for efficient datatypes and algorithms including large-scale networkanalysis

httpwwwalgorithmic-solutionscomledaaboutindexhtm

Neat Web-based network analysis tools httpgephiorgOntology Analysis

plugins for Cytoscapehttpchiantiucsdeducyto_webplugins Plugins for functional enrichment analysis using

network dataPANTHER Classification of genes and proteins httppantherdborgReactome Curated knowledgebase of molecular events and

pathwayshttpwwwreactomeorg

TargetMine Integrates different types of biological data and enableflexible queries export results and analyze lists ofdata

httptargetminemizuguchilaborg

Interpreting functional effects of coding variants | 15

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

recommend a three-level annotation schema for the interpret-ation of coding variants (Figure 7) Primary level annotation pro-vides an overview of the types of mutations and associationannotations Secondary level annotation enables the systematicinvestigation of multiple variants in a single gene Tertiary-levelannotations help to find the global characteristics of genes har-boring coding variants and various properties in a network-scale

Level 1 Primary annotation of coding SNVs

Tools that use position information from VCF files to derive thetype of mutation (synonymous or non-synonymous) the loca-tion of the corresponding gene and its mapping onto several an-notation resources using positional data may be viewed aslsquoprimary annotationsrsquo See Table 2 for a list of tools that useminimum input data to gather a diverse set of annotationsusing database lookup tables and identifier mapping

Level 2 Secondary annotation of coding SNVs- Impactof multiple variants in a single gene

Sequence and structural explorations of coding variants arereferred to as secondary annotations VCF files can have mul-tiple variants of the same gene associated with a given pheno-type or disease A systematic investigation of multiple variantsin a single gene and an assessment of the relative impact of cod-ing variants could help in classifying multiple variants in a geneand help to delineate the variants as pathogenic moderatelypathogenic or VUS Comparative analysis of wild-type and vari-ant sequences can be performed and quantified using the totalnumber of gains or losses in the sequence features Based on the

availability of the structural data a given variant can be modeledin a protein structure using in-silico mutagenesis experimentsIn the absence of experimental protein structural data a proteinstructure model can be obtained from the homology model data-base or a new model can be built using homology modelingapproaches Once the wild type and mutant type structures areobtained the impact of variation can be rapidly computed usingstructural feature analyses or using computationally expensivemolecular dynamics simulations Table 5 summarizes key se-quence and structure feature prediction tools

Level 3 Tertiary annotation of coding SNVs- Impact ofcoding SNVs on multiple genes pathways andinteractome

The global impact of multiple coding variants on a genome orexome can be assessed using a combination of both knowledge-based enrichment or depletion analysis and network or interac-tome analyses Functional profiling using GO terms have beenused as an effective approach for characterizing collective func-tional characteristics of a perturbed set of genes [177 178]Gene-list-based analyses can provide biologically relevant infor-mation about the variants Enrichment analysis is not just con-fined to a priori defined gene sets or GO annotationsenrichment can be performed using several types of annota-tions and could provide insights into the probable functional as-sociations of genes with coding variants Enrichment analysisusing protein annotations would help to identify functionallysignificant protein domains protein classes families (mem-brane proteins or kinases etc) protein superfamilies (angioten-sin receptors or G-protein coupled receptors) or protein folds(Rossman fold beta-propeller) associated with proteins

Figure 7 Three-level schema for annotation of functional variants

16 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

harboring coding variants Furthermore it is possible to identifyenriched molecular events mediated by genes enriched TFBS inthe upstream of genes and biological pathways mediated by thegenes using annotations from Reactome UCSC and KEGG repo-sitories Disease ontology can be used to find known genendashdisease associations Following the knowledge-based analysis anetwork-level analysis of genes harboring coding SNVs can beperformed using publicly available network analysis tools suchas Cytoscape and Gephi

Future directions

The rapidly increasing availability of sequence structure func-tional and interaction information offers an attractive means toobtain a detailed characterization of coding variants Howeverthe use of large data sets to systematically explore the functionalvariants is currently limited owing to the gaps in the availablesequencendashstructurendashfunction interaction data Sequence dataare available for the entire human proteome but the availabilityof protein structural data experimentally verified functional as-sociations and biomolecular interaction data is limited Theavailability of structural data of homologous protein structuresis a prerequisite for structural investigations of functional vari-ants and its impact on the structural environment Availabilityof high-quality interaction data will also help analyze the cellu-lar networks modulated by proteins harboring coding variantsThe expansion of structural data sets using structural genomics[179] homology modeling [180] and the application of computa-tional approaches such as genomics-aided structure prediction[181] is likely to lower the increasing chasm between sequencestructure function and interaction data and may help to ascer-tain the impact of variants at the biomolecular level [163 182183] Improving the annotation databases using high-qualitycurated data and integration with cloud-based or stand-alonepipelines [184ndash189] can also help facilitate such efforts

The current version of the Bioinformatics Link Directory en-lists gt900 bioinformatics tools that are capable of processing dif-ferent types of protein data (sequence structure interactionquantification and annotation) Unifying multiple resources andenabling programmatic access via application program interfacesand web services for rapid integration will significantly enhancethe efficacy of variant annotation pipelines The development ofnovel statistical methods that can quantify various residue-specific properties would enable a comparative analysis of wild-type and variant sequences and can thereby help to facilitate theidentification and prioritization of functional mutations The in-clusion of a diverse set of sequence structure and network fea-tures into a graphically depicted database and an assessment ofthe impact of various data types (sequence structure functionand interaction) using probabilistic or machine learning modelswould also help automate variant annotation interpretation

From a sequence perspective understanding the precisefunctional role of LCRs unassigned regions and disordered re-gions as well as the relationship between these features and cod-ing SNVs is a key step in variant annotation and interpretationIdentifying deviations in protein structural space that are likelyto lead to misfolding domain swapping aggregation degrad-ation or deviations that are metamorphic in nature owing to theimpact of coding SNVs is an emerging challenge Several algo-rithms are available to predict sequence or structural domainsassign distantly related domains to unassigned regions func-tional motifs structural motifs and structural features proteindisorder and aggregation propensities However algorithms andtools to predict 3D domain swapping or to analyze metamorphic

properties degradation pathways and the network-level impactof coding SNVs are less abundant This may be attributed to thefact that these concepts are continually evolving in the proteinuniverse and concerted efforts are needed to understand the im-pact of coding variants of such unique features from both experi-mental and computational biologists alike

This review summarizes the major bioinformatics chal-lenges involved in gaining a deeper understanding of codingSNVs Coding SNVs may impart a molecular or disease pheno-type by interacting with noncoding regulatory parts of the gen-ome and the structural variants may provide an importantcontribution to this phenomenon Analyzing functional andregulatory variants in a single analytical framework may furtherenhance the interpretation of the genomic variants

Conclusions

A rapid decline in sequencing costs using NGS technologies hasled to an exponential increase in the frequency of the sequenc-ing projects over the past decade A large number of personalgenomes and exomes clinical samples and cancer genomes arebeing sequenced as part of large-scale collaborative projectsThe identification of a plethora of sequence variants associatedwith diverse molecular and disease phenotypes Scalable com-putational approaches that integrate annotations using se-quence structure functional and interaction data are necessaryfor the rapid interpretation of coding SNVs We have discussedthe bioinformatics resources available for the prediction anno-tation and visualization of coding SNVs summarized the majorbioinformatics challenges into 10 different themes for a deeperinterpretation of coding SNVs and recommended a three-levelschema to assess the phenotypic impact of functional vari-ations on individual protein sequences structures differentfunctional categories biological pathways and interactomesWe envisage that addressing the key challenges discussed inthis review and adopting a comprehensive annotation schemafor variant annotation could improve genomic reports that aregenerated as part of genomic medicine investigations and ex-perimental studies to better understand the variations impli-cated in rare common and complex disease manifestations

Key Points

bull Interpretations of variants identified from next-gener-ation sequencing pose several challenges

bull Systems-level experimental investigation of the func-tional variants is expensive and time-consumingefficient computational techniques are required toidentify the impact of functional variants

bull We recommended an integrated approach that com-bines multiple data types and tools for the predictionannotation and visualization of functional variantsand we have proposed a systematic approach for func-tional variant annotation and interpretation

bull Significant challenges that need to be negotiated dur-ing the interpretation of coding single nucleotide vari-ants are presented with the help of various examples

bull A three-level annotation approach that combines theinformation at the level of an individual variant mul-tiple variants in a single protein and global trends ofmultiple genes harboring multiple variants is proposedfor an effective interpretation of coding single nucleo-tide variants

Interpreting functional effects of coding variants | 17

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

Acknowledgements

KS and JTD would like to thank Icahn Institute forGenomics and Multiscale Biology (httpicahnmssmedudepartments-and-institutesgenomics) Mount Sinai HealthSystem for infrastructural support KS would like to ac-knowledge Michael J Zimmermann and Jean-Pierre Kocher(Mayo Clinic Rochester) for useful discussions The MayoClinic Bioinformatics Core provided infrastructure and ana-lytical support to KRK RS would like to thank NationalCentre for Biological Sciences (TIFR) for infrastructuralsupport

Funding

KS and JTD was supported by grant R01-DK098242-03from the National Institute of Diabetes and Digestive andKidney Diseases KRK is funded in part by the Mayo ClinicCenter for Individualized Medicine the PharmacogenomicsResearch Network (PGRN) and the Mayo Clinic BreastSpecialized Program of Research Excellence (SPORE)

References1 Lander ES Initial impact of the sequencing of the human

genome Nature 2011470187ndash972 Mardis ER A decadersquos perspective on DNA sequencing tech-

nology Nature 2011470198ndash2033 Green ED Guyer MS Charting a course for genomic medi-

cine from base pairs to bedside Nature 2011470204ndash134 Maxmen A Exome sequencing deciphers rare diseases Cell

2011144635ndash75 Wheeler DA Srinivasan M Egholm M et al The complete

genome of an individual by massively parallel DNAsequencing Nature 2008452872ndash6

6 Weatheritt RJ Davey NE Gibson TJ Linear motifs conferfunctional diversity onto splice variants Nucleic Acids Res201240123ndash31

7 Goldstein DB Common genetic variation and human traitsN Engl J Med 20093601696ndash8

8 Manolio TA Collins FS Cox NJ et al Finding the missing her-itability of complex diseases Nature 2009461747ndash53

9 Genome-wide association study of 14 000 cases of sevencommon diseases and 3 000 shared controls Nature2007447661ndash78

10 Kullo IJ Ding K Shameer K et al Complement receptor 1gene variants are associated with erythrocyte sedimenta-tion rate Am J Hum Genet 201189131ndash8

11 Welter D MacArthur J Morales J et al The NHGRI GWASCatalog a curated resource of SNP-trait associations NucleicAcids Res 201442D1001ndash6

12 Leslie R OrsquoDonnell CJ Johnson AD GRASP analysis of geno-type-phenotype results from 1390 genome-wide associationstudies and corresponding open access databaseBioinformatics 201430i185ndash94

13 Nelson MR Wegmann D Ehm MG et al An abundance ofrare functional variants in 202 drug target genes sequencedin 14 002 people Science 2012337100ndash4

14 MacArthur DG Balasubramanian S Frankish A et al A sys-tematic survey of loss-of-function variants in human pro-tein-coding genes Science 2012335823ndash8

15 Raychaudhuri S Mapping rare and common causal allelesfor complex human diseases Cell 201114757ndash69

16 Stankiewicz P Lupski JR Structural variation in the humangenome and its role in disease Annu Rev Med201061437ndash455

17 McCarroll SA Altshuler DM Copy-number variation and as-sociation studies of human disease Nat Genet200739S37ndash42

18 Zhang F Gu W Hurles ME et al Copy number variation inhuman health disease and evolution Annu Rev GenomicsHum Genet 200910451ndash81

19 Medvedev P Stanciu M Brudno M Computational methodsfor discovering structural variation with next-generationsequencing Nat Methods 20096S13ndash20

20 Sherry ST Ward MH Kholodov M et al dbSNP the NCBI data-base of genetic variation Nucleic Acids Res 200129308ndash11

21 Rios D McLaren WM Chen Y et al A database and API forvariation dense genotyping and resequencing data BMCBioinformatics 201011238

22 Iafrate AJ Feuk L Rivera MN et al Detection of large-scale variation in the human genome Nat Genet200436949ndash51

23 Ashley EA Butte AJ Wheeler MT et al Clinical assess-ment incorporating a personal genome Lancet20103751525ndash35

24 Chen R Mias GI Li-Pook-Than J et al Personal omics profil-ing reveals dynamic molecular and medical phenotypesCell 20121481293ndash307

25 Lupski JR Reid JG Gonzaga-Jauregui C et al Whole-genomesequencing in a patient with Charcot-Marie-Tooth neur-opathy N Engl J Med 20103621181ndash91

26 Li Y Vinckenbosch N Tian G et al Resequencing of200 human exomes identifies an excess of low-fre-quency non-synonymous coding variants Nat Genet201042969ndash72

27 Clarke L Zheng-Bradley X Smith R et al The 1000 GenomesProject data management and community access NatMethods 20129459ndash462

28 Schunkert H Konig IR Kathiresan S et al Large-scale associ-ation analysis identifies 13 new susceptibility loci for coron-ary artery disease Nat Genet 201143333ndash8

29 Helgadottir A Thorleifsson G Manolescu A et al A commonvariant on chromosome 9p21 affects the risk of myocardialinfarction Science 20073161491ndash3

30 Visel A Zhu Y May D et al Targeted deletion of the 9p21non-coding coronary artery disease risk interval in miceNature 2010464409ndash12

31 Soranzo N Rendon A Gieger C et al A novel variant onchromosome 7q223 associated with mean platelet volumecounts and function Blood 20091133831ndash7

32 Paul DS Nisbet JP Yang TP et al Maps of open chromatinguide the functional follow-up of genome-wide associationsignals application to hematological traits PLoS Genet20117e1002139

33 Hull J Campino S Rowlands K et al Identification of com-mon genetic variation that modulates alternative splicingPLoS Genet 20073e99

34 Albert FW Kruglyak L The role of regulatory vari-ation in complex traits and disease Nat Rev Genet201516197ndash212

35 Battle A Khan Z Wang SH et al Genomic variation Impactof regulatory variation from RNA to protein Science2015347664ndash7

36 den Dunnen JT Antonarakis SE Nomenclature for the de-scription of human sequence variations Hum Genet2001109121ndash4

18 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

37 Eilbeck K Lewis SE Mungall CJ et al The sequence ontologya tool for the unification of genome annotations Genome Biol20056R44

38 Landrum MJ Lee JM Riley GR et al ClinVar public archive ofrelationships among sequence variation and human pheno-type Nucleic Acids Res 201442D980ndash5

39 Kircher M Witten DM Jain P et al A general framework forestimating the relative pathogenicity of human genetic vari-ants Nat Genet 201446310ndash15

40 Kumar P Henikoff S Ng PC Predicting the effects of codingnon-synonymous variants on protein function using theSIFT algorithm Nat Protoc 200941073ndash81

41 Adzhubei I Schmidt S Peshkin L et al A method and serverfor predicting damaging missense mutations Nat Meth20107248ndash9

42 Gonzalez-Perez A Lopez-Bigas N Improving the assessmentof the outcome of nonsynonymous SNVs with a consensusdeleteriousness score Condel Am J Hum Genet 201188440ndash9

43 Grantham R Amino acid difference formula to help explainprotein evolution Science 1974185862ndash4

44 Cooper GM Stone EA Asimenos G et al Distribution and in-tensity of constraint in mammalian genomic sequenceGenome Res 200515901ndash13

45 Pollard KS Hubisz MJ Rosenbloom KR et al Detection ofnonneutral substitution rates on mammalian phylogeniesGenome Res 201020110ndash21

46 Siepel A Bejerano G Pedersen JS et al Evolutionarily con-served elements in vertebrate insect worm and yeast gen-omes Genome Res 2005151034ndash50

47 Dayhoff MO Eck RV Park CM Model of evolutionary changein proteins In MO Dayhoff (ed) Atlas of Protein Sequence andStructure Washington DC National Biomedical ResearchFoundation 1972 89ndash99

48 Henikoff S Henikoff JG Amino acid substitution matricesfrom protein blocks Proc Natl Acad Sci USA 19928910915ndash19

49 Altschul SF Gish W Miller W et al Basic local alignmentsearch tool J Mol Biol 1990215403ndash10

50 Altschul SF Madden TL Schaffer AA et al Gapped BLASTand PSI-BLAST a new generation of protein database searchprograms Nucleic Acids Res 1997253389ndash402

51 Sunyaev S Ramensky V Koch I et al Prediction of deleteri-ous human alleles Hum Mol Genet 200110591ndash7

52 Yang Z Bielawski JP Statistical methods for detecting mo-lecular adaptation Trends Ecol Evol 200015496ndash503

53 Ng PC Henikoff S Predicting deleterious amino acid substi-tutions Genome Res 200111863ndash74

54 Wang Z Moult J SNPs protein structure and disease HumMutat 200117263ndash70

55 Thomas PD Kejariwal A Coding single-nucleotide poly-morphisms associated with complex vs Mendelian diseaseevolutionary evidence for differences in molecular effectsProc Natl Acad Sci USA 200410115398ndash403

56 Stone EA Sidow A Physicochemical constraint violation bymissense substitutions mediates impairment of proteinfunction and disease severity Genome Res 200515978ndash86

57 Miller MP Kumar S Understanding human disease muta-tions through the use of interspecific genetic variation HumMol Genet 2001102319ndash28

58 Johnson AD Single-nucleotide polymorphism bioinfor-matics a comprehensive review of resources Circ CardiovascGenet 20092530ndash6

59 Marjoram P Tavare S Modern computational approachesfor analysing molecular genetic variation data Nat Rev Genet20067759ndash70

60 Excoffier L Heckel G Computer programs for populationgenetics data analysis a survival guide Nat Rev Genet20067745ndash58

61 Peterson TA Nehrt NL Park D et al Incorporating molecularand functional context into the analysis and prioritizationof human variants associated with cancer J Am Med InformAssoc 201219275ndash83

62 Kann MG Advances in translational bioinformatics compu-tational approaches for the hunting of disease genes BriefBioinform 20101196ndash110

63 Cooper GM Shendure J Needles in stacks of needles findingdisease-causal variants in a wealth of genomic data Nat RevGenet 201112628ndash40

64 Stitziel NO Kiezun A Sunyaev S Computational and statis-tical approaches to analyzing variants identified by exomesequencing Genome Biol 201112227

65 Alexander RP Fang G Rozowsky J et al Annotating non-coding regions of the genome Nat Rev Genet 201011559ndash71

66 Mort M Evani US Krishnan VG et al In silico func-tional profiling of human disease-associated and poly-morphic amino acid substitutions Hum Mutat 201031335ndash46

67 Hamosh A Scott AF Amberger JS et al Online MendelianInheritance in Man (OMIM) a knowledgebase of humangenes and genetic disorders Nucleic Acids Res200533D514ndash17

68 Stenson PD Mort M Ball EV et al The human gene mutationdatabase 2008 update Genome Med 2009113

69 Forbes SA Bindal N Bamford S et al COSMIC mining com-plete cancer genomes in the Catalogue of SomaticMutations in Cancer Nucleic Acids Res 201139D945ndash50

70 Chen YA Tripathi LP Mizuguchi K TargetMine an inte-grated data warehouse for candidate gene prioritisation andtarget discovery PLoS One 20116e17844

71 Wang K Li M Hakonarson H ANNOVAR functional annota-tion of genetic variants from high-throughput sequencingdata Nucleic Acids Res 201038e164

72 Danecek P Auton A Abecasis G et al The variant call formatand VCFtools Bioinformatics 2011272156ndash8

73 Macintyre G Jimeno Yepes A Ong CS et al Associating dis-ease-related genetic variants in intergenic regions to thegenes they impact Peer J 20142e639

74 Javed A Agrawal S Ng PC Phen-Gen combining phenotypeand genotype to analyze rare disorders Nat Methods201411935ndash7

75 Dai HJ Wu JC Tsai RT et al T-HOD a literature-based candi-date gene database for hypertension obesity and diabetesDatabase (Oxford) 20132013bas061

76 Cheng D Knox C Young N et al PolySearch a web-basedtext mining system for extracting relationships betweenhuman diseases genes mutations drugs and metabolitesNucleic Acids Res 200836W399ndash405

77 Do R Kathiresan S Abecasis GR Exome sequencing andcomplex disease practical aspects of rare variant associ-ation studies Hum Mol Genet 201221R1ndash9

78 Wagner MJ Rare-variant genome-wide association studiesa new frontier in genetic analysis of complex traitsPharmacogenomics 201314413ndash24

79 Genomes Project C Abecasis GR Auton A et al An inte-grated map of genetic variation from 1 092 human genomesNature 201249156ndash65

80 Anfinsen CB The formation and stabilization of proteinstructure Biochem J 1972128737ndash49

81 Anfinsen CB Principles that govern the folding of proteinchains Science 1973181223ndash30

Interpreting functional effects of coding variants | 19

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

82 Fay J Wu C Sequence divergence functional constraintand selection in protein evolution Annu Rev Genomics HumGenet 20034213ndash35

83 Ponting CP Russell RR The natural history of protein do-mains Annu Rev Biophys Biomol Struct 20023145ndash71

84 Ekman D Bjorklund AK Frey-Skott J et al Multi-domain pro-teins in the three kingdoms of life orphan domains andother unassigned regions J Mol Biol 2005348231ndash43

85 Sigrist CJ Cerutti L Hulo N et al PROSITE a documenteddatabase using patterns and profiles as motif descriptorsBrief Bioinform 20023265ndash74

86 Dinkel H Michael S Weatheritt RJ et al ELMndashthe database ofeukaryotic linear motifs Nucleic Acids Res 201240D242ndash51

87 Haberichter SL Budde U Obser T et al The mutation N528Sin the von Willebrand factor (VWF) propeptide causes de-fective multimerization and storage of VWF Blood20101154580ndash7

88 Kutz WE Wang LW Dagoneau N et al Functional analysis ofan ADAMTS10 signal peptide mutation in Weill-Marchesanisyndrome demonstrates a long-range effect on secretion ofthe full-length enzyme Hum Mutat 2008291425ndash34

89 Pugalenthi G Suganthan PN Sowdhamini R et al SMotif aserver for structural motifs in proteins Bioinformatics200723637ndash8

90 Sundaramurthy P Shameer K Sreenivasan R et al HORI aweb server to compute Higher Order Residue Interactions inprotein structures BMC Bioinformatics 201011(Suppl 1)S24

91 Porter CT Bartlett GJ Thornton JM The Catalytic Site Atlas aresource of catalytic sites and residues identified in enzymesusing structural data Nucleic Acids Res 200432D129ndash33

92 Stuart AC Ilyin VA Sali A LigBase a database of families ofaligned ligand binding sites in known protein sequencesand structures Bioinformatics 200218200ndash01

93 Swapna LS Mahajan S de Brevern A et al Comparison oftertiary structures of proteins in protein-protein complexeswith unbound forms suggests prevalence of allostery in sig-nalling proteins BMC Struct Biol 2012126

94 Mathivanan S Ahmed M Ahn NG et al Human Proteinpediaenables sharing of human protein data Nat Biotechnol200826164ndash7

95 Bhaskara RM Srinivasan N Stability of domain structures inmulti-domain proteins Sci Rep 2011140

96 Furnham N de Beer TA Thornton JM Current challenges ingenome annotation through structural biology and bioinfor-matics Curr Opin Struct Biol 201222594ndash601

97 Wang Y Cheng H Pan Z et al Reconfiguring phosphoryl-ation signaling by genetic polymorphisms affects cancersusceptibility J Mol Cell Biol 20157187ndash202

98 Gnad F Gunawardena J Mann M PHOSIDA 2011 the post-translational modification database Nucleic Acids Res201139D253ndash60

99 Nicolaou N Margadant C Kevelam SH et al Gain of glycosy-lation in integrin alpha3 causes lung disease and nephroticsyndrome J Clin Invest 20121224375ndash87

100 Radivojac P Baenziger PH Kann MG et al Gain and loss ofphosphorylation sites in human cancer Bioinformatics200824i241ndash7

101 Pan Y Karagiannis K Zhang H et al Human germline andpan-cancer variomes and their distinct functional profilesNucleic Acids Res 20144211570ndash88

102 Mazumder R Morampudi KS Motwani M et al Proteome-wide analysis of single-nucleotide variations in the N-glyco-sylation sequon of human genes PLoS One 20127e36212

103 Reddy CC Shameer K Offmann BO et al PURE a webserverfor the prediction of domains in unassigned regions in pro-teins BMC Bioinformatics 20089281

104 Coletta A Pinney JW Solis DY et al Low-complexity regionswithin protein sequences have position-dependent rolesBMC Syst Biol 2010443

105 Jones DT Swindells MB Getting the most from PSI-BLASTTrends Biochem Sci 200227161ndash4

106 Chograni M Rejeb I Jemaa LB et al The first missense muta-tion of NHS gene in a Tunisian family with clinical featuresof NHS syndrome including cardiac anomaly Eur J HumGenet 201119851ndash6

107 Jones S Thornton JM Principles of protein-protein inter-actions Proc Natl Acad Sci USA 19969313ndash20

108 Babu MM van der Lee R de Groot NS et al Intrinsically dis-ordered proteins regulation and disease Curr Opin Struct Biol201121432ndash40

109 Dunker AK Silman I Uversky VN et al Function and struc-ture of inherently disordered proteins Curr Opin Struct Biol200818756ndash64

110 Wright PE Dyson HJ Intrinsically unstructured proteins re-assessing the protein structure-function paradigm J Mol Biol1999293321ndash31

111 Dyson HJ Wright PE Coupling of folding and binding for un-structured proteins Curr Opin Struct Biol 20021254ndash60

112 Nguyen Ba AN Yeh BJ van Dyk D et al Proteome-wide dis-covery of evolutionary conserved sequences in disorderedregions Sci Signal 20125rs1

113 Das RK Mao AH Pappu RV Unmasking functionalmotifs within disordered regions of proteins Sci Signal20125pe17

114 Hu Y Liu Y Jung J et al Changes in predicted protein dis-order tendency may contribute to disease risk BMCGenomics 201112(Suppl 5)S2

115 Vacic V Iakoucheva LM Disease mutations in disordered re-gionsndashexception to the rule Mol BioSyst 2012827ndash32

116 Srivastava SK Gayathri S Manjasetty BA et al Analysis ofconformational variation in macromolecular structuralmodels PLoS One 20127e39993

117 Ward JJ Sodhi JS McGuffin LJ et al Prediction and functionalanalysis of native disorder in proteins from the three king-doms of life J Mol Biol 2004337635ndash45

118 Thomas PJ Qu BH Pedersen PL Defective protein folding asa basis of human disease Trends Biochem Sci 199520456ndash9

119 Welch W Role of quality control pathways in human dis-eases involving protein misfolding Semin Cell Dev Biol20041531ndash8

120 Fink AL Protein aggregation folding aggregates inclusionbodies and amyloid Fold Des 19983R9ndash23

121 Thirumalai D Klimov DK Dima RI Emerging ideas on themolecular basis of protein and peptide aggregation CurrOpin Struct Biol 200313146ndash59

122 Khare SD Dokholyan NV Molecular mechanisms of poly-peptide aggregation in human diseases Curr Protein Pept Sci20078573ndash9

123 Soto C Estrada L Castilla J Amyloids prions and the inher-ent infectious nature of misfolded protein aggregatesTrends Biochem Sci 200631150ndash5

124 Prusiner SB Molecular biology and pathogenesis of priondiseases Trends Biochem Sci 199621482ndash7

125 Shameer K Shingate PN Manjunath SC et al 3DSwap cura-ted knowledgebase of proteins involved in 3D domain swap-ping Database (Oxford) 20112011bar042

20 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

126 Bennett MJ Choe S Eisenberg D Domain swapping entan-gling alliances between proteins Proc Natl Acad Sci USA1994913127ndash31

127 Bennett MJ Sawaya MR Eisenberg D Deposition diseasesand 3D domain swapping Structure 200614811ndash24

128 Shameer K Sowdhamini R Functional repertoire molecularpathways and diseases associated with 3D domain swap-ping in the human proteome J Clin Bioinform 201228

129 Waters PJ Parniak MA Hewson AS et al Alterations in pro-tein aggregation and degradation due to mild and severemissense mutations (A104D R157N) in the human phenyl-alanine hydroxylase gene (PAH) Hum Mutat 199812344ndash54

130 OrsquoNeill JW Kim DE Johnsen K et al Single-site mutations in-duce 3D domain swapping in the B1 domain of protein Lfrom Peptostreptococcus magnus Structure 200191017ndash27

131 Seeliger MA Spichty M Kelly SE et al Role of conformationalheterogeneity in domain swapping and adapter function ofthe Cks proteins J Biol Chem 200528030448ndash59

132 Kirsten Frank M Dyda F Dobrodumov A et al Core muta-tions switch monomeric protein GB1 into an intertwinedtetramer Nat Struct Biol 20029877ndash85

133 Huang Y Cao H Liu Z Three-dimensional domain swappingin the protein structure space Proteins 2012801610ndash19

134 Gidalevitz T Wang N Deravaj T et al Natural genetic vari-ation determines susceptibility to aggregation or toxicity ina C elegans model for polyglutamine disease BMC Biol201311100

135 Ross CA Poirier MA Protein aggregation and neurodegener-ative disease Nat Med 200410(Suppl)S10ndash17

136 Murzin A Metamorphic proteins Science 20083201725ndash6137 Shortle D One sequence plus one mutation equals two

folds Proc Natl Acad Sci USA 200910621011ndash12138 Yadid I Kirshenbaum N Sharon M et al Metamorphic pro-

teins mediate evolutionary transitions of structure Proc NatlAcad Sci USA 20101077287ndash92

139 Yadid I Kirshenbaum N Sharon M et al Metamorphic pro-teins mediate evolutionary transitions of structure Proc NatlAcad Sci USA 20101077287ndash92

140 Pickrell J Marioni J Pai A et al Understanding mechanismsunderlying human gene expression variation with RNAsequencing Nature 2010464768ndash72

141 Matlin AJ Clark F Smith CW Understanding alternative splic-ing towards a cellular code Nat Rev Mol Cell Biol 20056386ndash98

142 Rickman DS Pflueger D Moss B et al SLC45A3-ELK4 is anovel and frequent erythroblast transformation-specific fu-sion transcript in prostate cancer Cancer Res 2009692734ndash8

143 Flomen R Makoff A Increased RNA editing in EAAT2 pre-mRNA from amyotrophic lateral sclerosis patients involve-ment of a cryptic polyadenylation site Neurosci Lett2011497139ndash43

144 Asmann YW Hossain A Necela BM et al A novel bioinfor-matics pipeline for identification and characterization of fu-sion transcripts in breast cancer and normal cell linesNucleic Acids Res 201139e100

145 Bundschuh R Computational prediction of RNA editingsites Bioinformatics 2004203214ndash20

146 Eisenberg E Li JB Levanon EY Sequence based identificationof RNA editing sites RNA Biol 20107248ndash52

147 Levanon EY Eisenberg E Algorithmic approaches for identi-fication of RNA editing sites Brief Funct Genomics Proteomic2006543ndash5

148 McCarthy DJ Humburg P Kanapin A et al Choice of tran-scripts and software has a large effect on variant annota-tion Genome Med 2014626

149 Tang X Baheti S Shameer K et al The eSNV-detect a com-putational system to identify expressed single nucleotidevariants from transcriptome sequencing data Nucleic AcidsRes 201442e172

150 Chorev M Carmel L The function of introns Front Genet2012355

151 Chen R Davydov EV Sirota M et al Non-synonymous andsynonymous coding SNPs show similar likelihood and effectsize of human disease association PLoS One 20105e13574

152 Rose AB Intron-mediated regulation of gene expressionCurr Top Microbiol Immunol 2008326277ndash90

153 Birnbaum RY Clowney EJ Agamy O et al Coding exonsfunction as tissue-specific enhancers of nearby genesGenome Res 2012221059ndash68

154 Nakaya HI Amaral PP Louro R et al Genome mapping andexpression analyses of human intronic noncoding RNAsreveal tissue-specific patterns and enrichment in genesrelated to regulation of transcription Genome Biol 20078R43

155 Rohs R Jin X West SM et al Origins of specificity in protein-DNA recognition Ann Rev Biochem 201079233ndash69

156 Karczewski KJ Dudley JT Kukurba KR et al Systematic func-tional regulatory assessment of disease-associated variantsProc Natl Acad Sci USA 20131109607ndash12

157 Khalil AM Rinn JL RNA-protein interactions in humanhealth and disease Semin Cell Dev Biol 201122359ndash65

158 Jones S van Heyningen P Berman HM et al Protein-DNAinteractions a structural analysis J Mol Biol 1999287877ndash96

159 Jones S Daley DT Luscombe NM et al Protein-RNAinteractions a structural analysis Nucleic Acids Res200129943ndash54

160 Barabasi AL Oltvai ZN Network biology understanding thecellrsquos functional organization Nat Rev Genet 20045101ndash13

161 Nooren IM Thornton JM Diversity of protein-protein inter-actions EMBO J 2003223486ndash92

162 Caspi R Altman T Dreher K et al The MetaCyc database ofmetabolic pathways and enzymes and the BioCyc collectionof pathwaygenome databases Nucleic Acids Res201240D742ndash53

163 Yamada T Bork P Evolution of biomolecular networks les-sons from metabolic and protein interactions Nat Rev MolCell Biol 200910791ndash803

164 Stark C Breitkreutz BJ Chatr-Aryamontri A et al TheBioGRID interaction database 2011 update Nucleic Acids Res201139D698ndash704

165 Szklarczyk D Franceschini A Kuhn M et al The STRINGdatabase in 2011 functional interaction networks of pro-teins globally integrated and scored Nucleic Acids Res201139D561ndash8

166 Bader GD Hogue CW An automated method for finding mo-lecular complexes in large protein interaction networksBMC Bioinformatics 200342

167 DallrsquoOlio GM Vahdati AR Bertranpetit J et alVCF2Networks applying Genotype Networks to SingleNucleotide Variants data Bioinformatics 201531438ndash9

168 Ding L Getz G Wheeler DA et al Somatic mutations affectkey pathways in lung adenocarcinoma Nature20084551069ndash75

169 Case M Matheson E Minto L et al Mutation of genes affect-ing the RAS pathway is common in childhood acutelymphoblastic leukemia Cancer Res 2008686803ndash9

170 Thorstensen L Lind GE Lovig T et al Genetic and epigeneticchanges of components affecting the WNT pathway in colo-rectal carcinomas stratified by microsatellite instabilityNeoplasia 2005799ndash108

Interpreting functional effects of coding variants | 21

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

171 Weng L Macciardi F Subramanian A et al SNP-based path-way enrichment analysis for genome-wide associationstudies BMC Bioinformatics 20111299

172 Wang K Li M Bucan M Pathway-based approaches for ana-lysis of genomewide association studies Am J Hum Genet2007811278ndash83

173 OrsquoDushlaine C Kenny E Heron EA et al The SNP ratio testpathway analysis of genome-wide association datasetsBioinformatics 2009252762ndash3

174 Tarca AL Draghici S Khatri P et al A novel signaling path-way impact analysis Bioinformatics 20092575ndash82

175 Ramanan VK Shen L Moore JH et al Pathway analysis ofgenomic data concepts methods and prospects for futuredevelopment Trends Genet 201228323ndash2

176 Li Y Agarwal P Rajagopalan D A global pathway crosstalknetwork Bioinformatics 2008241442ndash7

177 Rivals I Personnaz L Taing L et al Enrichment or depletionof a GO category within a class of genes which testBioinformatics 200723401ndash7

178 Rhee SY Wood V Dolinski K et al Use and misuse of thegene ontology annotations Nat Rev Genet 20089509ndash15

179 Chandonia JM Brenner SE The impact of structural gen-omics expectations and outcomes Science 2006311347ndash51

180 Johnson MS Srinivasan N Sowdhamini R et al Knowledge-based protein modeling Crit Rev Biochem Mol Biol 1994291ndash68

181 Sulkowska JI Morcos F Weigt M et al Genomics-aided struc-ture prediction Proc Natl Acad Sci USA 201210910340ndash5

182 Lee D Redfern O Orengo C Predicting protein function from se-quence and structure Nat Rev Mol Cell Biol 20078995ndash1005

183 Worth CL Gong S Blundell TL Structural and functionalconstraints in the evolution of protein families Nat Rev MolCell Biol 200910709ndash20

184 Karczewski KJ Fernald GH Martin AR et al STORMSeq anopen-source user-friendly pipeline for processing personalgenomics data in the cloud PLoS One 20149e84860

185 Kalari KR Rossell D Necela BM et al Deep sequence ana-lysis of non-small cell lung cancer integrated analysis ofgene expression alternative splicing and single nucleotidevariations in lung adenocarcinomas with and without onco-genic KRAS mutations Front Oncol 2012212

186 Patil S Upadhayay A Arya D et al A high throughput exomesequencing approach to analyse events within a good re-sponder CML patient under imatinib at diagnosis and underremission Blood 20131225161

187 Kullo IJ Shameer K Jouni H et al The ATXN2-SH2B3 locus isassociated with peripheral arterial disease an electronicmedical record-based genome-wide association study FrontGenet 20145166

188 Shameer K Denny JC Ding K et al A genome- and phe-nome-wide association study to identify genetic variantsinfluencing platelet count and volume and their pleiotropiceffects Hum Genet 201413395ndash109

189 Sabarinathan R Wenzel A Novotny P et al Transcriptome-wide analysis of UTRs in non-small cell lung cancer revealscancer-related genes with SNV-induced changes on RNAsecondary structure and miRNA target sites PLoS One20149e82699

22 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

  • bbv084-TF1
Page 5: challenges in proteome-scale prediction, annotation and assessment

similarity between any two sequences (pairwise sequence align-ment) and the evolutionary relationships between two or moresequences (multiple sequence alignments) have spawned the de-velopment of improved heuristic homology search tools [49 50]The early 2000s witnessed the development of several predictivemethods based on sequence conservation amino acid substitu-tions and perturbation of the local structural environment to as-sess the impact of functional variants in proteins [51ndash57] The kaks ratio (or dNdS) test which estimates the ratio of the totalnumber of non-synonymous substitutions per non-synonymoussites (dN) to the total number of synonymous substitutions to

synonymous sites (dS) is widely used to quantify the selectionpressure on functional genes [52] Furthermore bioinformaticstools databases and statistical methods for the identification an-notation and analysis of SNPs from genotyping and sequencingdata have been amply surveyed in literature [58ndash66]

Integrative approaches can provide additional annotationsfor variants such as the location of the variants within the con-served protein sequencestructure domains (contiguous unit inprotein sequence or structure with evidence of functional anno-tation from experimental or computational function associationmethods) within known functional sites Gene Ontology (GO)

Table 1 The naming convention used to describe the sequence variants in the Ensembl Variation resources and dbSNP

Ensembl variantconsequences

dbSNPfunctionalclasses

SO ID SO Definition

Essential splice site splice-3 SO0001574SO0001575

A splice variant that changes the two base regions at the 30 end of anintron a splice variant that changes the two base regions at the 50

end of an intronStop gained splice-5 SO0001587 A sequence variant whereby at least one base of a codon is changed

resulting in a premature stop codon leading to a shortenedtranscript

Stop lost nonsense SO0001578 A sequence variant where at least one base of the terminator codon(stop) is changed resulting in an elongated transcript

Complex indel NA SO0001577 A transcript variant with a complex INDELmdashInsertion or deletion thatspans an exonintron border or a coding sequenceUTR border

Frameshift coding frameshift SO0001589 A sequence variant which causes a disruption of the translationalreading frame because the number of nucleotides inserted ordeleted is not a multiple of three

Non-synonymouscoding

missense SO0001582SO0001652SO0001651SO0001583

A codon variant that changes at least one base of the first codon of atranscript an inframe non-synonymous variant that deletes basesfrom the coding sequence an inframe non-synonymous variantthat inserts bases into in the coding sequence a sequence variantwhere the change may be longer than three bases and at least onebase of a codon is changed resulting in a codon that encodes for adifferent amino acid

Splice site NA SO0001630 A sequence variant in which a change has occurred within the regionof the splice site either within 1ndash3 bases of the exon or 3ndash8 bases ofthe intron

Partial codon NA SO0001626 A sequence variant where at least one base of the final codon of an in-completely annotated transcript is changed

Synonymous coding cds-synon SO0001567SO0001588

A sequence variant where at least one base in the terminator codon ischanged but the terminator remains a sequence variant wherethere is no resulting change to the encoded amino acid

Coding unknown NA SO0001580 A sequence variant that changes the coding sequenceWithin mature miRNA NA SO0001620 A transcript variant located with the sequence of the mature miRNA50 UTR untranslated_5

UTR-5SO0001623 A UTR variant of the 50 UTR

30 UTR untranslated_3UTR-3

SO0001624 A UTR variant of the 30 UTR

Intronic intron SO0001627 A transcript variant occurring within an intronNMD transcript SO0001621 A variant in a transcript that is the target of NMDWithin non-coding

genencRNA SO0001619 A transcript variant of a non-coding RNA gene

Upstream near-gene-5 SO0001636SO0001635

A sequence variant located within 2 KB 50 of a gene a sequence variantlocated within 5 KB 50 of a gene

Downstream near-gene-3 SO0001634SO0001633

A sequence variant located within a half KB of the end of a gene a se-quence variant located within 5 KB of the end of a gene

Regulatory region NA SO0001566 A sequence variant located within a regulatory regionTranscription factor

binding motifNA SO0001782 A sequence variant located within a transcription factor-binding site

Intergenic intergenic SO0001628 A sequence variant located in the intergenic region between genes

SO identifiers and descriptions were obtained from MISO (wwwsequenceontologyorgmiso)

Interpreting functional effects of coding variants | 5

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

terms and pathways associated with the variant-containinggenes associations mapped within genetic databases such asdbSNP Ensembl Online Mendelian Inheritance in Man [67]Human Gene Mutation Database [68] Catalogue of SomaticMutations in Cancer (COSMIC) [69] and other clinically relevantgenomic regions also enable enhanced variant annotationTogether integrative data analysis platforms (such asTargetMine [70]) and integrated annotation tools such asANNOVAR [71] or webservers that can handle Variant CallFormat (VCF) [72] files facilitate the annotations for a largenumber of sequence variants in a small amount of timeAdopting a standard set of terms from the SO to define the typeof variants would also help in streamlining the interoperabilityof results from the different prediction tools A comprehensivelist of tools for rapid variant annotation is provided in Table 2

The visualization of a variant in the context of multiplelayers of biological information is helpful to interpret and priori-tize variants for functional studies A growing number of gen-ome browsers and data visualization libraries enable theinteractive and static visualizations of variants in the context ofthe human genome or transcriptome with biological clinicaland population scale annotation data compiled from multipleresources Genome browsers [Integrative Genomics Viewer(IGV) UCSC Genome Browsers NCBI Sequence Viewer etc] en-able the visualization of variants in the context of a referencegenome Resources such as Distributed Annotation System(DAS) and DASTY (a protein-centric DAS client) can be leveragedfor interactive visualization of coding variants in a protein withrich annotations A list of genome browsers and visualizationlibraries capable of visualizing variants and multiple annotationcomponents is provided in Table 3

Text mining and natural language processingfor knowledge aggregation for SNVs

Biomedical knowledge about relationships between genes dis-eases phenotypes and genetic variations are scattered across alarge number of unstructured literature databases Applicationof natural language processing and text-mining thereforeoffers a useful approach for function assignment of codingSNVs Further text mining of large biomedical literature data-bases like PubMed and Medline helps to provide clues for fur-ther investigations and leads to hypothesis [73] For examplePhenGen [74] offers links to a variety of literature evidence tosupport genotypendashphenotype connections Integrated databases

like T-HOD database [75] and PolySearch [76] provide text-mining tools to derive meaningful biological inferences tointerpret coding SNVs

Emerging challenges in annotation andinterpretation of coding SNVs

The growing number of tools for the prediction annotation andvisualization of coding SNVs can address several gaps in thecurrent state of knowledge on variant interpretation The devel-opment of new algorithms for variant interpretation could beconsidered for several emerging themes of the protein se-quence-structure-function paradigm Several tools [See Tables4 5 and 6] are currently available to assess sequence and struc-ture-based features yet a reliable interpretation on how a gen-omic variant could perturb a protein or a protein network isoften a challenging task

VUS (also known as incidental variations or secondary vari-ations) are a class of variants hitherto unknown in known dis-ease genes but the influence of these variants on the diseasephenotype is largely unknown Pilot data from 1000 Genomesproject on exon capture sequencing of 1092 individuals selectedfrom 14 populations across Europe East Asia sub-SaharanAfrica and the Americas reported an observed frequency of 2500non-synonymous variants at conserved positions 20ndash40 vari-ants identified as damaging 24 at conserved sites and about 150loss-of-function (LOF) variants that includes stop-gains frame-shift insertions and deletions (indels) in a coding sequence anddisruptions to essential splice sites Emerging evidence suggeststhat rare variants may have a higher collective impact on dis-ease incidence rate than common variants and considering theallele frequency as a metric to assess the clinical impact of thevariant may help to assess the clinical impact of VUS Majorityof the variants identified in individuals in 1000 Genomes projectare common with gt5 of MAF or low frequency (MAF 05ndash5)than rare variants (MAF lt05) Rare variant frequencies esti-mated as 130ndash400 non-synonymous variants per individual thatincludes 10ndash20 LOF variants 2ndash5 damaging mutations and 1ndash2variants identified from cancer genome sequencing projects[77ndash79] As an ever-increasing number of VUS are being charac-terized their annotation and interpretation is becoming morechallenging With the advent of WGS and WES in clinical set-tings the repertoire of VUS associated with complex chronicand rare diseases is rapidly expanding We surveyed theHuman polymorphisms and disease mutations index (humsa-vartxt) to gather unclassified variants reported in UniProtKB acurated database of functional information on proteins Thecurrent release (2014_05) of the humsavar lists 6564 variants aslsquoUnclassified variantsrsquo these variants were mapped to 1910 pro-tein coding genes A manually curated annotation of the vari-ants and disease ontology-based disease term-gene enrichmentanalyses indicated that the genes were largely from cancer pa-tients (322 genes tested Plt 005) We noted that 497 genesencoded polymorphism disease and unclassified variants sug-gesting that VUS-labeled variants may directly influence thegenes that are relevant to human diseases and clinically rele-vant phenotypes A proportional Venn diagram of genesmapped to variants labeled as lsquoPolymorphismrsquo lsquoDiseasersquo andlsquoUnclassifiedrsquo is illustrated in Figure 3 The top 10 disease ontol-ogy terms enriched among the genes within the three classes ofvariants are summarized in Table 2 We observed that certaindiseases were represented by genes that share multiple classesof variants thereby suggesting that unclassified variants may

Table 2 Top-10 terms from a gene-disease enrichment analyses per-formed using list of genes with shared polymorphism disease andunclassified variants using disease ontology

DO term No ofgenes

P-values(Bonferronicorrected)

Congenital abnormality 38 913E-34Cancer 64 7222E-33Diabetes mellitus 39 8913E-23Breast cancer 36 356E-17Atherosclerosis 26 1478E-16Retinal disease 16 2044E-16Adenovirus infection 15 8816E-14Hypertension 21 2072E-13Alzheimerrsquos disease 22 7548E-13

6 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

play a key role in certain clinical conditions such as cancerschronic diseases (diabetes atherosclerosis hypertension andkidney disease) neurodegenerative disease (Alzheimerrsquos dis-ease) and infection (Plt 005 Bonferroni corrected) Annotatingfunctional coding SNVs including VUS validated using an or-thogonal method also requires detailed sequence structure andinteraction-based analyses In this context we discuss some ofthe predictive features and analytical approaches that onceincorporated in coding variant annotation algorithms may po-tentially enhance the interpretation of the functional variantson a proteome-wide scale

Sequence and structural propertiesperturbed by coding SNVs

A protein performs its defined function after attaining a specifictertiary or quaternary structure [80 81] This is often mediatedby cross-links of inter-chain and intra-chain amino-acid residueinteractions within a protein These interactions (hydrogenbond disulphide bonds salt bridges ionic interactions electro-static interactions hydrophobic interactions etc) stabilize the

fold of individual protein chains and the overall quaternary as-sembly Individual amino acids within a protein are under theinfluence of the varied elements of the sequence-structure-function paradigm that modulate their biochemical role [82]Primarily a residue may be located within an evolutionarilyconserved compact globular domain [83] or in an unassigned re-gion with no known protein domain association [84] A residuecan be a part of a motif which can be functional [85] that of alinear motif [86] propeptide [87] signaling peptide [88] or a partof structural [89] and spatially interacting sites which partici-pate in higher order interactions [90] catalytic sites [91] ligandbinding sites [92] and allosteric sites [93] associated with exten-sive inter-chain and intra-chain interactions [90] based onthe oligomeric property of the proteins Amino acids can also bepart of specific sub-cellular localization signals that directthe proteins to specific locations in the cell [94] The lengths ofthe proteins [62] sequential localization (N-terminalC-terminal or other region of protein sequence) and the locationof a coding SNV with respect to the surfacemdashinterior or inter-face of the protein structuremdashcould influence disease manifest-ation [95]

Table 3 Rapid variant annotation tools

Name Description URL

ANNOVAR Efficient software tool to use up-to-date information tofunctionally annotate genetic variants detected fromdiverse genomes

httpwwwopenbioinformaticsorgannovar

AnnTools Comprehensive and versatile annotation toolkit for gen-omic variants

httpanntoolssourceforgenet

dbNSFP A lightweight database of human non-synonymousSNPs and their functional predictions

httpssitesgooglecomsitejpopgendbNSFP

EVA An efficient and versatile tool for filtering strategies inmedical genomics

httpplateforme-genomique-iribuniv-rouenfrEVAindexphp

Exome VariantServer

Provides different calculated values (GERP GRANTHAMetc) and annotations for SNPs

httpevsgswashingtoneduEVS

gSearch gSearch compares sequence variants in the GenomeVariation Format (GVF) or VCF with a pre-compiledannotation or with variants in other genomes

httpmlssuackrgSearchindexhtml

HugeSeq A pipeline for detection and annotation of geneticvariations

httphugeseqsnyderlaborg

MuSiC Comprehensive mutational analysis pipeline to segre-gate passenger and driver mutations from cancergenomes

httpgmtgenomewustledugenome-musiccurrent

NGS-SNP Collection of command-line scripts for providing richannotations for SNPs

httpstothardafnsualbertacadownloadsNGS-SNP

SeattleSeq Annotation Provides annotation of known and novel SNPs httpsnpgswashingtoneduSeattleSeqAnnotationsnpEff Variant annotation and effect prediction tool httpsnpeffsourceforgenetSVA Software system designed for annotation and visualiza-

tion of genetic variantshttpwwwsvaprojectorg

STORMSeq Cloud computing solution for read mapping read clean-ing and variant calling and annotation

httpsgithubcomkonradjkstormseq

TREAT Targeted RE-sequencing Annotation Tool httpndcmayoedumayoresearchbiostatstand-alone-packagescfm

VAAST Probabilistic search tool for identifying damaged genesand their disease-causing variants in personal gen-ome sequences

httpwwwyandell-laborgsoftwarevaasthtml

VARIANT VARIANT (VARIant ANalysis Tool) can report the func-tional properties of any variant in all the humanmouse or rat genes

httpvariantbioinfocipfes

Variant Reporter Generate a report of known variants and functionalconsequences

wwwncbinlmnihgovvariationtoolsreporter

Variant Tools Tool for the annotation selection and analysis of vari-ants in the context of next-gen sequencing analysis

httpvarianttoolssourceforgenet

Interpreting functional effects of coding variants | 7

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

Table 4 Genome browsers and biological data visualization libraries

Name Description URL

1000 Genomesbrowser

Genome browser to access data from 1000 Genomesproject

httpwwwncbinlmnihgovvariationtools1000genomes

AnnoJ Web 20 application designed for visualizing sequencingand annotation data

httpwwwannojorg

Artemis Genome Browser and Annotation Tool httpwwwsangeracukresourcessoftwareartemis

BioGraphics Perl modules for biological data visualization httpsearchcpanorgdistBio-GraphicsBioGraphics (Ruby) Ruby library for drawing overviews of genomic regions httpbio-graphicsrubyforgeorgBluejay Java-based integrated computational environment for the

exploration of genomic datahttpbluejayucalgaryca

CGView Circular Genome Viewer httpwishartbiologyualbertacacgviewCircos Circos is a software package for visualizing genomic data

and annotations in circular layouthttpcircosca

Dalliance Interactive genome viewer which runs directly in a modernweb browser

httpwwwbiodallianceorg

DAS DAS is an integrated visualization toolkit to share and col-late genomic annotation information

wwwbiodasorg

DASTY Web client for visualizing protein sequence feature infor-mation using DAS

httpwwwebiacukdasty

DNAPlotter DNAPlotter can be used to generate images of circular andlinear DNA maps to display regions and annotations ofinterest

httpwwwsangeracukresourcessoftwarednaplotter

Ensembl GenomeBrowser

Ensembl Genome Browser enables the visualization of gen-omic and transcriptomic sequence and related informa-tion for several vertebrate and non-vertebrate species

httpuseastensemblorgindexhtml

GASV Geometric Analysis of Structural Variants httpcompbiocsbrownedusoftwarehtmlGbrowse Generic Genome Browser (GBrowse) is a genome viewer de-

veloped as part of Generic Model Organism Database(GMOD) project

httpgmodorgwikiGBrowse

GENBOREE Customizable genome browser httpgenboreeorgjava-binloginjspGeneViTo JAVA-based workbench for genome-wide analysis through

visual interactionhttpathinabioluoagrbioinformaticsGENEVITO

GenomeGraphs R-based interface to plot genomic information fromEnsembl

httpwwwbioconductororgpackagesreleasebiochtmlGenomeGraphshtml

GenomePixelizer Tool to generate custom images of genomes out of thegiven set of genes

httpwwwatgcorgGenomePixelizer

GenomeTools Versatile genome analysis software httpgenometoolsorgGenomeView Next-generation stand-alone genome browser and editor httpgenomevieworgGremlin Interactive visualization model for analyzing genomic

rearrangementshttpcompbiocsbrownedusoftwarehtml

IGB Integrated genome browser httpbiovizorgigbindexhtmlIGV Integrative Genomics Viewer httpwwwbroadinstituteorgigvJbrowse Genome browser with a fully dynamic AJAX interface httpgmodorgwikiJBrowsejsDAS JavaScript client library for the DAS httpwwwebiacukdastyebihtmljsdashtmlMagicViewer Integrated solution for NGS data visualization and genetic

variation detection and annotationhttpbioinformaticszjcnmagicviewerindexphp

NCBI GraphicalSequence Viewer

Graphical display for the Nucleotide and Protein sequences httpwwwncbinlmnihgovprojectssviewer

Rover Genome browser framework to build custom genomic tools httpchmille4githubcomRoversitehomehtmlRviewer Interactive online tool for comparing and prioritizing gen-

omic regionshttprviewerlblgovrviewer

Savant Genome Browser for high-throughput sequencing data httpgenomesavantcomScribl HTML5 Canvas-based graphics library for visualization of

genomic data and annotationshttpchmille4githubcomScribl

UCSC GenomeBrowser

Interactive genome browser that provide access to se-quence data from different species integrated with alarge collection of layered annotations from experimentsand prediction algorithms

httpgenomeucsceducgi-binhgGateway

VISTA Browser Visualization of pairwise and multiple alignments of wholegenome assemblies

httppipelinelblgovcgi-bingateway2selectorfrac14vista

Whole GenomerVISTA

Visualization of TFBS that are conserved between speciesand overrepresented in upstream regions of groups ofgenes

httpgenomelblgovvistaindexshtml

8 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

Previously studies have established (Table 4) that functionalvariants may impact the overall structural conformation andfunction of proteins (ligand binding substrate specificity pro-teinndashprotein interaction transcription factor-target bindingetc) Residue-specific interactions can play a key role in suchvariations Sequence-based investigations will further help toidentify linear sequence motifs that confer diversity in the splicevariants [6] but the identification and analysis of the impact ofpoint mutations on spatially distributed structural motifs mayrequire structural data [55] However the limited availability ofstructural data compared with sequence data may pose add-itional challenges for incorporating structural analysis of thefunctional variants [96] Moreover additional challenges ariseowing to the dynamic protein conformations and allosteric orother long-distance effects (including higher order interactions)of few mutations on the activity of the protein

UniProtKB provides extensive information about residueproperties and curated information about globular domains (viaannotations derived from Pfam SMART or InterPro)posttranslational modifications (PTMs) and several bioinfor-matics tools and webservers are available to investigate residueproperties and structural properties (See Table 5 for a list oftools to predict various sequence features See URL httpbioinformaticsca links_directorycategoryprotein for list of toolsavailable for different protein-centric analyses)

Occasionally multiple sequence or structural features canbe annotated onto a single variant for example a residue couldbe part of a functional motif andor a structural motif and cod-ing SNVs may perturb either or both features Quantifying mul-tiple features perturbed by variants using an objective-scoringschema will help to capture the entire set of features perturbedby a functional variant Comparative analysis of the sequenceand structural features of the wild type and the variant se-quences will also enhance the understanding of the impact ofmultiple coding variants within a protein

Gain and loss of PTMs owing to SNVs

Amino acids in a mature peptide chain are targets of variedPTM [97] events PTMs include phosphorylation methylationacetylation amidation hydroxylation sulfation lipidation

glycosylation and palmitoylation [98] Several studies have nowcompiled and assessed the impact of phosphorylation in cancer[97] and have reported mutational landscapes of various PTMevents including gain or loss of particular PTMs owing to SNVs(for example gain of glycosylation [99] gain and loss of phos-phorylation in cancer [100]) Pan-cancer [101] and proteomewide studies [102] have also assessed the impact of variety ofPTMs Several tools and databases are currently available toassess impact of individual PTMs owing to SNVs [refs]However tools that can provide complete in silico profiling ofmutations will help the researchers to identify PTM-relevantmutations and use the information to assess therapeuticstratifications

Coding SNVs in unassigned regions of proteins

An unassigned region refers to the segments in proteins withno known functional domain assignments [84] Human proteinshave a variable degree of unassigned regions and small-un-assigned regions are often defined as the linker regions betweentwo domains (Figure 2i) We surveyed the human proteomeusing Pfam-based domain annotations to understand the un-assigned regions in the human proteome We computed as-signed and unassigned regions (in percentage) for 20 242 proteinsequences from the SwissProt database (reviewed sequences)Pfam-A-based domain assignments were retrieved for 20 137 se-quences A subset of 105 proteins was excluded from the ana-lysis owing to overlapping domain assignments Of the 20 137sequences that were analyzed (Figure 4A) one protein (haloaciddehalogenase-like hydrolase domain-containing protein 2) wasassigned with a conserved domain over its entire length 3234sequences were completely unassigned and the rest of the pro-teins had varying segments of unassigned regions (mean5387 SD63094) Current approaches in variant annotationand interpretation are primarily focused on highly conservedglobular domains Given that domains are presently assigned toonly 50 of the human proteome analytical methods such asPrediction of Unassigned Regions (PURE) that use intermediatesequence search techniques for domain assignments [103] orsimilar approaches that use sensitive sequence search protocols

Table 5 Examples of the functional variants located in sequence features perturbing diverse functional and structural effects in proteins

Variant location Description URL

Protein domain DMDM A database that compiles domain mapping of dis-ease mutations have information about 202 507 muta-tions associated with 10 919 domains (compiled fromCDD Pfam COG and SMART databases)

httpbioinfumbcedudmdm

Phosphorylation site Mutation of an AKT phosphorylation site of human B-raf httpwwwncbinlmnihgovpubmed15791648Propeptide Mutation in the von Willebrand factor (VWF) propeptide

affects the oligomerizationhttpwwwncbinlmnihgovpubmed20335223

Signal peptide Mutation in signal peptide of ADAMTS10 influence secre-tion of full-length enzyme

httpwwwncbinlmnihgovpubmed18567016

Active site Mutation in the active site of human deoxycytidine kinaseaffects the substrate specificity

httpwwwncbinlmnihgovpubmed18361501

Linear motif Linear motifs mediate functional diversity of transcriptvariants

httpwwwncbinlmnihgovpubmed22638587

Structural motif Heterozygous missense mutation of a spatially distributedstructural motif in human connexin (GJB3) gene causeerythrokeratodermia variabilis

httpwwwncbinlmnihgovpubmed9843209

Subcellular localization Missense mutations in the NPHS2 gene altering the traffick-ing of nephrin to the plasma membrane

wwwncbinlmnihgovpubmed15496146

Interpreting functional effects of coding variants | 9

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

Table 6 Tools for predicting various sequence and structural features

Name Description URL

3dswap-pred Classify a protein sequence as domain-swapping ornon-domain swapping using an SVM model

httpcapsncbsresin3dswap-predindexhtml

AAIndex An amino acid index is a set of 20 numerical values rep-resenting various physico-chemical and biochemicalproperties of amino acids

httpwwwgenomejpaaindex

Bioinformatics LinkDirectory (Protein)

Extensive list of tools for prediction of protein sequencefeatures structure features and function

httpbioinformaticsca links_directorycategoryprotein

dbPTM Comprehensive resource for protein PTMs httpdbptmmbcnctuedutwDISOPRED Dynamically disordered protein chains do not have sta-

ble secondary structures and have high flexibility insolution Disordered regions also play critical roles inprotein function

httpbioinfcsuclacukdisopred

ELM Eukaryotic Linear Motif server httpelmeuorgEris Eris server computes the change of the protein stability

induced by mutations using structural datahttpdokhlabuncedutoolserisindexhtml

FoldAmyloid Method for predicting of amyloidogenic regions fromprotein sequence

httpbioinfoprotresrufold-amyloidogacgi

FoldX FoldX can be used to find interactions contributing tothe stability of proteins and protein complexes usingstructural data

httpfoldxcrges

Globplot Globplot can predict disordered regions in proteinsequence

httpglobplotemblde

H-Predictor Predict hinge regions involved in protein oligomeriza-tion via the domain-swapping mechanism fromstructural data

httptrollmeduncedudokhlabindexphpSpecialHpredictor

Harmony Substitution and propensity score-based protein struc-ture assessment algorithm

httpcapsncbsresinharmony

HORI Webserver for prediction of higher order residue inter-actions in protein structures

httpcapsncbsresinhori

InterPro Integrated database of predictive protein signaturesused for the classification and automatic annotationof proteins and genomes

httpwwwebiacukinterpro

IUPred Prediction of intrinsically unstructured proteins httpiupredenzimhuLIMBO Predicts the amylogenic regions in a protein httplimbovibbeMUPRO Prediction of protein stability changes for single-site

mutations from sequenceshttpmuproproteomicsicsuciedu

NCBI-CDD Extensive protein domain and family annotationdatabase

httpwwwncbinlmnihgovStructurecddcddshtml

Pfam Database of conserved protein domain families httppfamsangeracukPFILT Program to filter various sequence regions including

low-complexity regionshttpbioinfadmincsuclacukdownloadspfilt

PIC Protein interactions calculator httppicmbuiiscernetinProtParam Compute biochemical features like Molecular Weight

Theoretical pI Grand Average of Hydropathy(GRAVY) instability index etc

httpwebexpasyorgprotparam

PSIPRED Secondary structure prediction httpbioinfcsuclacukpsipredPURE Prediction of unassigned regions in proteins httpcapsncbsresinpureSABBLE Relative solvent accessibility prediction httpsablecchmcorgScanProsite Report the functional motifspatterns encoded in the se-

quence Helps to assess the gainloss of functionalsites owing to mutation

httpprositeexpasyorgscanprosite

SignalP Predicts the presence and location of signal peptidecleavage sites in amino-acid sequences

httpwwwcbsdtudkservicesSignalP

SMART Simple modular architecture research tool for assigningdomains to protein

httpsmartembl-heidelbergde

TANGO Predicts the aggregation-prone regions in a protein httptangocrgesTargetP Predicts the subcellular location of eukaryotic proteins httpwwwcbsdtudkservicesTargetPUniProtKB Catalog of information on proteins httpwwwuniprotorgWALTZ Predicts the aggregation-prone regions in a protein httpwaltzvubacbe

10 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

(Table 5) may potentially help investigate the evolutionary andfunctional roles of variants in unassigned regions

Impact of coding SNVs in low-complexityregions

Low-complexity regions (LCRs) in the protein universe refer to astretch of amino acids with low Shannon entropy (leucine-richdomains or poly-alanine tracts) Unlike linear motifs whichhave a specific function and sequence signature the individualfunctions of LCRs are poorly characterized LCRs do not adopt adefinite secondary structure but may exist as solvent-exposedamino acids in the coiled or disordered regions in proteins LCRsare observed in functionally diverse proteins and in both eu-karyotes and prokaryotes The predominant functions of LCRsinclude promoting mRNA stability and mediating a diverse set

of proteinndashprotein interactions [104] We scanned the referencehuman proteome (reviewed subset of 20 237 sequences) usingPFILT [105] and observed that 1432 (nfrac14 2899) protein se-quences have LCRs with a median length of 16 amino acids(Figure 4B) A recent study showed that a functional variantlocalized to an LCR in Nance-Horan Syndrome (NHS) gene wasassociated with clinical features of NHS including cardiacanomaly [106] In the absence of direct functional informationto interpret the impact of functional variants scanning proteinsequences with an LCR prediction tool like PFILT is highly rec-ommended to investigate the probable gain or loss of LCRsowing to the functional variations

Coding SNVs and intrinsically disorderedregions in proteins

Proteins are believed to be functional when the structure attainsits definite globular fold [107] Recent studies however suggestthat proteins may perform their functions even when not in afully folded state Such proteins or regions within proteins thatexist in a stable conformation without attaining a definite struc-tural conformation are generally referred to as intrinsically dis-ordered proteins (IDPs) or proteins with intrinsically disorderedregions (IDRs) or simply disordered proteins [108 109] Disprotis a database that catalogs a curated list of disordered proteinsthat includes 248 experimentally validated human proteinswith disordered regions (httpwwwdisprotorgactionsearchphpkeywordfrac14humanampcriterionfrac14organism) Prediction modelssuggest that 30ndash40 of human proteins are considered to beIDPs or have IDRs and approximately 25 of eukaryotic proteinsare predicted to be fully disordered [108 109]

Regions of the intrinsically disordered segments in proteinshave been found to be functionally important (Figure 5)Disordered regions mediate various functional roles includingprotein binding and proteinndashprotein interactions Previous stud-ies have shown that IDPs can attain a definite structure on bind-ing to their interacting partner and may thus exist in anintermediate stage of disordered (unfolded) to ordered (folded)stage [110 111] Irrespective of the structural plasticity recent

Figure 3 Proportional Venn diagram of genes with coding variants annotated as

polymorphism (Polymorphism nfrac1411 527) disease (Disease nfrac142105) and un-

classified (Unclassified nfrac141910) in Human polymorphisms and disease muta-

tions index Percentages of genes shared between the three groups are provided

Unassigned regions in human proteome

Percentage of unassigned regions in human proteins

Fre

qu

en

cy in

hu

ma

n p

rote

om

e (

Sw

issp

rot

re

vie

we

d s

eq

ue

nce

s)

0 20 40 60 80 100

01

00

02

00

03

00

04

00

0

Low-complexity regions in human proteome

Percentage of LCRs in human proteins

Fre

quency in h

um

an p

rote

om

e (

Sw

isspro

t r

evie

wed s

equences)

0 20 40 60 80 100

0500

1000

1500

2000

2500

3000(a) (b)

Figure 4 Histograms depicting the distribution (in percentage) of (A) unassigned regions and (B) LCRs in the human proteome

Interpreting functional effects of coding variants | 11

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

studies have shown the presence of evolutionarily conservedfunctional motifs within IDPs [112 113]

Recently the influence of disease mutations in the dis-ordered regions was extensively surveyed and the crucialroles of disorder-promoting amino acids in imparting vari-ations in the protein structures were proposed [114 115]Disorder-promoting amino acids are a potential cause ofvariations in protein structures and structural modelsDisorder-to-order transitions are enriched among diseasemutations compared with neutral polymorphisms [116]Currently protein modeling or variant effect prediction al-gorithms do not take the intrinsic disorder characteristicsof a protein into consideration Integrating tools such asDisopred [117] for disorder region prediction into the vari-ant annotation pipelines will improve our understanding ofthe impact of coding SNVs in the disordered regions of aprotein

Influence of coding SNVs on proteinmisfolding domain swapping aggregationmacromolecular crowding and degradation

Diseases have multiple environmental as well as geneticcauses and structural biology is an active area of research tounderstand how changes in individual protein structures orprotein complexes may play a key role in disease manifest-ations Understanding the structural bases of human diseases

would help to identify better targets and design better ligandsfor more effective therapeutic interventions

Protein misfolding and aggregation

Protein folding pathways play a crucial role in mediating cellu-lar homeostasis Defective folding of a protein product is themechanistic basis of several disease phenotypes likeAlzheimerrsquos disease and Parkinsonrsquos disease [118] When non-synonymous mutations lead to the production of misfoldedproteins with aberrant function several gate-keeping pathwaysact on such misfolded proteins to clear them from the cellularenvironment [119] Misfolded proteins targeted for ubiquityla-tion may also lead to protein aggregation Protein aggregation[120ndash123] is defined as a molecular phenomenon where a pro-tein is not cleared from the cellular environment by the normalpathways (Figure 6A) This leads to the aggregation of proteinsin the cellular environment which in turn leads to cellular tox-icity and is considered to be the mechanistic basis of varioushuman diseases such as prion diseases [123 124]

Domain swapping

3D domain swapping (Figure 6B) is a phenomenon observed in asubset of proteins where intermolecular interactions arereplaced by intramolecular interactions [125 126] Domainswapping is also recognized as a mechanism for forming pro-tein aggregates via open-ended mode Domain swapping is

Figure 5 Examples of IDRs in human proteins (A) Breast cancer type 1 susceptibility protein (BRCA1) (B) Cellular tumor antigen p53 (Oncoprotein p53) (C) T-cell sur-

face glycoprotein CD4 (Disprot) (D) Secondary structure information (UniProtKB) and (E) mutation histogram (COSMIC) of BRCA1 is provided to illustrate mutations in

the unstructured regions

12 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

observed in a variety of therapeutically important proteins andis considered to be the mechanism mediating deposition dis-eases such as neurodegenerative diseases and Alzheimerrsquos dis-ease that are caused by conformational perturbations [127 128]Missense mutations in the human phenylalanine hydroxylasehave been shown to infleunce aggregation and degradationproperties [129] Functional mutations are also considered to bea causative factor for proteins to adopt swapped conformationfrom monomers to higher oligomers [130ndash132] A recent study[128] that surveyed a subset of human proteins involved in 3Ddomain swapping suggested that swapping is not only confinedto conformational diseases it is also associated with severalkey biological pathways and also plays a role in mediating di-verse diseases in humans Another study [133] that investigatedthe structural properties of proteins involved in 3D domainswapping suggested that 10 of protein folds and 5 of proteinfamilies include domain-swapped structures

Macromolecular crowding

Proteins interact with a variety of small molecules and othermacromolecules to perform a specific functional task The qua-ternary assembly of a protein is influenced by the importantyet understudied phenomenon of macromolecular crowding

(117 118) Protein function is driven by interactions mediated byinter- and intra-chain amino acids and mutations in the sur-face amino acids may affect macromolecular homeostasisMutagenesis studies have shown the impact of functional mu-tations on proteinndashprotein interfaces (119) oligomerization (120)and stability (121) Mutations that influence the intra-chaininteractions may influence the crowding effect in the cellularenvironment Probing the impact of functional mutations onsuch effects using theoretical models and molecular dynamicsimulation studies is likely to enhance the understanding of therelationship between coding SNVs and macromolecular crowd-ing A list of selected tools and databases available for the pre-diction of misfolding protein aggregation and domainswapping is summarized in Table 5

Protein degradation

Targeted biochemical studies have revealed that the proteindegradation pathway plays an important role in the clearanceof the mutated proteins to reduce cellular toxicity For examplemutant Cu Zn-superoxide dismutase associated with amyo-trophic lateral sclerosis was cleared by macroautophagy path-way that includes proteasomal cleavage phenylalaninehydroxylase (109) and NAD(P)H quinone oxidoreductase 1 (156)

(a) (b)

Figure 6 Schematic representation of the impact of functional mutation on protein misfolding folding aggregation domain swapping macromolecular crowding and

protein degradation pathways Structure of misfolded rat CD2 structure (PDB ID 1A6P) and normal CD2 (PDB ID 1HNG) is used for representing misfolded and folded

structures Functional variant is represented using red asterisk

Interpreting functional effects of coding variants | 13

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

were also cleared by similar mechanisms Functional variantsplay a crucial role in aggregation and protein degration path-ways [134 135] Computational approaches that can predict thedegradation properties of mutated proteins using sequence orstructural information will be useful for rapid characterizationof functionally active proteins Integrated models that combinefolding misfolding aggregation 3D domain swapping macro-molecular crowding and degradation pathways (Figure 6) in asystems biology approach are likely to provide additional in-sights into the regulation of these important mechanisms andthe role of coding SNVs in mediating such mechanisms

Coding SNVs and metamorphic proteins

The molecular paradigm of sequence-structure-function sug-gests that a diverse set of sequences could attain a similarstructural fold that may lead to a functional convergence Whilethis is generally true for the protein universe occasionally devi-ations are observed Metamorphic proteins refer to a relativelynew class of proteins in which a given sequence has beenshown to attain different folded conformations under nativeconditions while performing distinct functions [136ndash138]Metamorphic proteins such as human chemokine lymphotactinhave been shown to influence evolutionary transitions of struc-ture and coding SNVs which could play an important role inswitching between two or more folds and functions for a singlesequence [139] Computational approaches to catalog and pre-dict the metamorphic properties and the impact of coding SNVson metamorphic proteins would help us understand how muta-tions drive functional plasticity at the proteome level

Impact of coding SNVs on the transcriptomicdiversity

Dynamic features of the human proteome are driven by tran-script diversity and the average number of characterized tran-scripts per gene is rapidly expanding The recent adoption ofRNA-Seq as a tool for expression profiling has led to the charac-terization of a large number of novel transcripts includingfusion transcripts [140] Several novel tissue-specific orcell-type-specific transcripts were reported from RNA-Seqexperiments The cellular compartment-based functional rolesof such transcripts are determined by alternative splicingevents [141] RNA editing is a phenomenon where a pre-mRNAmolecule is altered through a chemical change in the basemakeup thereby adding to the diversity of the transcriptomeRNA editing events occur via two distinct mechanisms of sub-stitution editing and insertiondeletion editing leading to func-tional diversity Fusion transcripts [142] and RNA editing eventsare also associated with various diseases including prostatecancer and amyotrophic lateral sclerosis [143] Computationalapproaches are available for the identification of fusion tran-scripts [144] and RNA editing events [145ndash147] from sequencingdata Understanding the precise roles and the impact of variantson such a diverse set of transcripts using computationalapproaches is a challenging task Ascribing the functional roleand assessing the impact of coding SNVs for mediating noveland fusion transcripts using bioinformatics approaches is anemerging problem and being addressed in recent studies [148]Recently we designed an integrated method to simultaneousanalyses of genome and transcriptome from RNA-seq data TheeSNV-Detect method can precisely capture genetic variation(genotypes) from RNA-seq data and helps to design cost-

effective and sustainable experimental strategies [149] Analyticand interpretation strategies that rely on multiple data-typeswould provide greater confidence for variant calling annotationand interpretation and thus actionability for clinicalinterventions

Functional impact of synonymous variations

Synonymous mutations are defined as mutations that result ina variation at the DNA level that code for the same amino acidin the protein level owing to codon degeneracy Synonymousvariants may either be coding or noncoding and the functionalroles of such variants are emerging A recent study [150] sug-gests that introns are involved in functional mechanisms andintrons with positional conservation across eukaryotic lineageare classified as functional introns A comparative analysis ofsynonymous and non-synonymous variants associated withcomplex diseases has shown similar likelihood and effect sizewith disease association [151] Synonymous mutations mayalso influence the introns regulating gene expression [152] A re-cent report also suggests that the coding exons function as tis-sue-specific enhancers and synonymous variations in suchenhancer sites may influence the expression level of certaingenes [153] Synonymous variants could influence the expres-sion of intronic noncoding RNAs [154] perturb the transientproteinndashDNA [155 156] and proteinndashRNA interactions in the cellSuch perturbations can lead to diseases such as cancer neuro-logical disorders and cardiac disease [157] Developing bioinfor-matics approaches to explore the putative impact ofsynonymous mutations on different layers of the coding andnoncoding genome and their relationships will be an importantaspect in the detailed interpretation of the genomic variantsThe detailed structural exploration of proteinndashDNA [158] andproteinndashRNA interactions [159] will help to precisely map themechanistic bases of the loss or gain of interactions owing tosynonymous variations

Impact of coding SNVs on the interactomeof a protein

The individual interactome of proteins varies to a great extent[160] Proteins often interact with nucleotides (proteinndashDNA orproteinndashRNA interactions) proteins (protein complexes obligateor non-obligate) [161] and proteinndashprotein interactions (transientor permanent) [107]) and small molecules (molecular reactionsenzymatic reactions metabolic pathways) [162 163] CodingSNVs play an important role in mediating such interactions Aninteractome of a protein can be defined using information gath-ered from high-throughput experiments that systematicallyidentify interactants of proteins Public protein-protein databaseslike BioGRID [164] and STRING [165] provide large data sets forderiving a protein interactome Network-level investigations tounderstand the category of interactors that are perturbed owingto coding SNVs will help delineate the impact of such mutationson the protein interactomes Using network measures (centralitydegree stress betweenness closeness cliques radiality transi-tivity reciprocity assortativity structural equivalence networkheterogeneity network density clustering coefficients neighbor-hood connectivity shared neighbors network topology etc) asquantitative parameters in variant annotation pipelines will helpthe researchers gain better insights into the impact of codingSNVs on the protein interactome Biologically relevant networkproperties such as network modularity [166] network fragility

14 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

[166] and lethality [166] can also be derived from an interactomeIncorporating a comparative network analysis framework thatcompares wild-type and variant interactomes will help quantifythe impact of coding SNVs on a network scale Methods such asVCF2Networks employ genotype networks (ie all genotypesassociated with a single phenotype) to understand the relation-ships between the genotype space and clinical or biological phe-notypes [167] Table 7 summarizes the available tools toinvestigate gene lists and to derive global functional trends andnetwork properties

Pathway-level impact of coding SNVs

Multiple functional genomics studies have investigated the per-turbations of pathways as a result of mutations in diseasessuch as adenocarcinoma (MAPK signaling p53 signaling Wntsignaling cell cycle and mammalian target of rapamycin path-ways) [168] childhood acute lymphoblastic leukemia (RAS path-way) [169] colorectal carcinoma (WNT signaling pathway) [170]etc Several SNP-centric approaches for pathway-level inferencehave been designed for the interpretation of SNPs identifiedfrom GWA studies or differentially expressed genes from gene-expression studies [171ndash175] Coding SNVs may also influencethe cross talk between various signaling pathways Currentvariant interpretation algorithms strive to identify pathways

associated with genes harboring coding SNVs yet a detailedunderstanding of the impact of variants on biological pathwaysand pathway cross talks [176] is often lacking Incorporatinganalytical routines to quantify the effect of functional muta-tions on pathways and pathway cross talk will be useful in in-terpreting functional variants from a biological perspective

A classification schema for annotatingcoding SNVs

Identifying the entire spectrum of molecular perturbationsowing to coding variations at the level of sequence structureinteraction and function of proteins is considered the basis ofvariant interpretation Multiple automated prediction tools thatcan assess the functional effect of mutations are currentlyavailable [See Table 6] Several variant annotation tools focuson the task of predicting the type or effect of mutations provideextended annotations from precomputed databases or lookuptables and map a coding variant to its corresponding gene pro-tein functional domain or signaling pathway Theseapproaches have several limitations because additional layersof protein-centric information that can be derived from the pre-diction or computation of sequence-based features with codingvariants are lacking To deal with this challenge and to designbroad-spectrum tools for deep variant interpretation we

Table 7 Software libraries and tools for biological network analysis

Name Description URL

Bio4J Bio4j is a bioinformatics graph-based database httpwwwbio4jcomBioConductor (Graphs

and Networks view)Collection of BioConductor modules for biological net-

work analysis and visualizationhttpwwwbioconductororgpackagesrelease

BiocViewshtml___GraphsAndNetworksCytoscape Open source platform for complex network analysis and

visualization with a large collection of plug-ins forbiological network analysis

httpwwwcytoscapeorg

DAVID Integrated functional annotation tool httpdavidabccncifcrfgovhomejspFunDO Functional Disease Ontology server httpdjangonubicnorthwesternedufundogene2pathway R package for prediction of KEGG pathway membership

for individual genes based on InterPro domainsignatures

httpwwwbioconductororgpackagesreleasebiochtmlgene2pathwayhtml

GeneAnswers R package for biological or medical interpretation of thegiven one or more groups of genes by means of statis-tical test

httpwwwbioconductororgpackagesreleasebiochtmlGeneAnswershtml

GO Tools Tools for analysis of GO T term enrichment statisticalanalysis semantic similarity and functional similar-ity using GO terms derived from gene lists

httpwwwgeneontologyorgGOtoolsshtml

Gephi Open-source graph visualization and analysis software httpwwwgephiorgGremlin Graph-based programming language httpsgithubcomtinkerpopgremlinwikiiGraph Network analysis and visualization library in C Also

available R package and a Python extensionhttpigraphsourceforgenet

KEGGgraph R package for analysis of KEGG pathways httpwwwbioconductororgpackages210biochtmlKEGGgraphhtml

LEDA A broad-spectrum Cthornthorn class library for efficient datatypes and algorithms including large-scale networkanalysis

httpwwwalgorithmic-solutionscomledaaboutindexhtm

Neat Web-based network analysis tools httpgephiorgOntology Analysis

plugins for Cytoscapehttpchiantiucsdeducyto_webplugins Plugins for functional enrichment analysis using

network dataPANTHER Classification of genes and proteins httppantherdborgReactome Curated knowledgebase of molecular events and

pathwayshttpwwwreactomeorg

TargetMine Integrates different types of biological data and enableflexible queries export results and analyze lists ofdata

httptargetminemizuguchilaborg

Interpreting functional effects of coding variants | 15

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

recommend a three-level annotation schema for the interpret-ation of coding variants (Figure 7) Primary level annotation pro-vides an overview of the types of mutations and associationannotations Secondary level annotation enables the systematicinvestigation of multiple variants in a single gene Tertiary-levelannotations help to find the global characteristics of genes har-boring coding variants and various properties in a network-scale

Level 1 Primary annotation of coding SNVs

Tools that use position information from VCF files to derive thetype of mutation (synonymous or non-synonymous) the loca-tion of the corresponding gene and its mapping onto several an-notation resources using positional data may be viewed aslsquoprimary annotationsrsquo See Table 2 for a list of tools that useminimum input data to gather a diverse set of annotationsusing database lookup tables and identifier mapping

Level 2 Secondary annotation of coding SNVs- Impactof multiple variants in a single gene

Sequence and structural explorations of coding variants arereferred to as secondary annotations VCF files can have mul-tiple variants of the same gene associated with a given pheno-type or disease A systematic investigation of multiple variantsin a single gene and an assessment of the relative impact of cod-ing variants could help in classifying multiple variants in a geneand help to delineate the variants as pathogenic moderatelypathogenic or VUS Comparative analysis of wild-type and vari-ant sequences can be performed and quantified using the totalnumber of gains or losses in the sequence features Based on the

availability of the structural data a given variant can be modeledin a protein structure using in-silico mutagenesis experimentsIn the absence of experimental protein structural data a proteinstructure model can be obtained from the homology model data-base or a new model can be built using homology modelingapproaches Once the wild type and mutant type structures areobtained the impact of variation can be rapidly computed usingstructural feature analyses or using computationally expensivemolecular dynamics simulations Table 5 summarizes key se-quence and structure feature prediction tools

Level 3 Tertiary annotation of coding SNVs- Impact ofcoding SNVs on multiple genes pathways andinteractome

The global impact of multiple coding variants on a genome orexome can be assessed using a combination of both knowledge-based enrichment or depletion analysis and network or interac-tome analyses Functional profiling using GO terms have beenused as an effective approach for characterizing collective func-tional characteristics of a perturbed set of genes [177 178]Gene-list-based analyses can provide biologically relevant infor-mation about the variants Enrichment analysis is not just con-fined to a priori defined gene sets or GO annotationsenrichment can be performed using several types of annota-tions and could provide insights into the probable functional as-sociations of genes with coding variants Enrichment analysisusing protein annotations would help to identify functionallysignificant protein domains protein classes families (mem-brane proteins or kinases etc) protein superfamilies (angioten-sin receptors or G-protein coupled receptors) or protein folds(Rossman fold beta-propeller) associated with proteins

Figure 7 Three-level schema for annotation of functional variants

16 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

harboring coding variants Furthermore it is possible to identifyenriched molecular events mediated by genes enriched TFBS inthe upstream of genes and biological pathways mediated by thegenes using annotations from Reactome UCSC and KEGG repo-sitories Disease ontology can be used to find known genendashdisease associations Following the knowledge-based analysis anetwork-level analysis of genes harboring coding SNVs can beperformed using publicly available network analysis tools suchas Cytoscape and Gephi

Future directions

The rapidly increasing availability of sequence structure func-tional and interaction information offers an attractive means toobtain a detailed characterization of coding variants Howeverthe use of large data sets to systematically explore the functionalvariants is currently limited owing to the gaps in the availablesequencendashstructurendashfunction interaction data Sequence dataare available for the entire human proteome but the availabilityof protein structural data experimentally verified functional as-sociations and biomolecular interaction data is limited Theavailability of structural data of homologous protein structuresis a prerequisite for structural investigations of functional vari-ants and its impact on the structural environment Availabilityof high-quality interaction data will also help analyze the cellu-lar networks modulated by proteins harboring coding variantsThe expansion of structural data sets using structural genomics[179] homology modeling [180] and the application of computa-tional approaches such as genomics-aided structure prediction[181] is likely to lower the increasing chasm between sequencestructure function and interaction data and may help to ascer-tain the impact of variants at the biomolecular level [163 182183] Improving the annotation databases using high-qualitycurated data and integration with cloud-based or stand-alonepipelines [184ndash189] can also help facilitate such efforts

The current version of the Bioinformatics Link Directory en-lists gt900 bioinformatics tools that are capable of processing dif-ferent types of protein data (sequence structure interactionquantification and annotation) Unifying multiple resources andenabling programmatic access via application program interfacesand web services for rapid integration will significantly enhancethe efficacy of variant annotation pipelines The development ofnovel statistical methods that can quantify various residue-specific properties would enable a comparative analysis of wild-type and variant sequences and can thereby help to facilitate theidentification and prioritization of functional mutations The in-clusion of a diverse set of sequence structure and network fea-tures into a graphically depicted database and an assessment ofthe impact of various data types (sequence structure functionand interaction) using probabilistic or machine learning modelswould also help automate variant annotation interpretation

From a sequence perspective understanding the precisefunctional role of LCRs unassigned regions and disordered re-gions as well as the relationship between these features and cod-ing SNVs is a key step in variant annotation and interpretationIdentifying deviations in protein structural space that are likelyto lead to misfolding domain swapping aggregation degrad-ation or deviations that are metamorphic in nature owing to theimpact of coding SNVs is an emerging challenge Several algo-rithms are available to predict sequence or structural domainsassign distantly related domains to unassigned regions func-tional motifs structural motifs and structural features proteindisorder and aggregation propensities However algorithms andtools to predict 3D domain swapping or to analyze metamorphic

properties degradation pathways and the network-level impactof coding SNVs are less abundant This may be attributed to thefact that these concepts are continually evolving in the proteinuniverse and concerted efforts are needed to understand the im-pact of coding variants of such unique features from both experi-mental and computational biologists alike

This review summarizes the major bioinformatics chal-lenges involved in gaining a deeper understanding of codingSNVs Coding SNVs may impart a molecular or disease pheno-type by interacting with noncoding regulatory parts of the gen-ome and the structural variants may provide an importantcontribution to this phenomenon Analyzing functional andregulatory variants in a single analytical framework may furtherenhance the interpretation of the genomic variants

Conclusions

A rapid decline in sequencing costs using NGS technologies hasled to an exponential increase in the frequency of the sequenc-ing projects over the past decade A large number of personalgenomes and exomes clinical samples and cancer genomes arebeing sequenced as part of large-scale collaborative projectsThe identification of a plethora of sequence variants associatedwith diverse molecular and disease phenotypes Scalable com-putational approaches that integrate annotations using se-quence structure functional and interaction data are necessaryfor the rapid interpretation of coding SNVs We have discussedthe bioinformatics resources available for the prediction anno-tation and visualization of coding SNVs summarized the majorbioinformatics challenges into 10 different themes for a deeperinterpretation of coding SNVs and recommended a three-levelschema to assess the phenotypic impact of functional vari-ations on individual protein sequences structures differentfunctional categories biological pathways and interactomesWe envisage that addressing the key challenges discussed inthis review and adopting a comprehensive annotation schemafor variant annotation could improve genomic reports that aregenerated as part of genomic medicine investigations and ex-perimental studies to better understand the variations impli-cated in rare common and complex disease manifestations

Key Points

bull Interpretations of variants identified from next-gener-ation sequencing pose several challenges

bull Systems-level experimental investigation of the func-tional variants is expensive and time-consumingefficient computational techniques are required toidentify the impact of functional variants

bull We recommended an integrated approach that com-bines multiple data types and tools for the predictionannotation and visualization of functional variantsand we have proposed a systematic approach for func-tional variant annotation and interpretation

bull Significant challenges that need to be negotiated dur-ing the interpretation of coding single nucleotide vari-ants are presented with the help of various examples

bull A three-level annotation approach that combines theinformation at the level of an individual variant mul-tiple variants in a single protein and global trends ofmultiple genes harboring multiple variants is proposedfor an effective interpretation of coding single nucleo-tide variants

Interpreting functional effects of coding variants | 17

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

Acknowledgements

KS and JTD would like to thank Icahn Institute forGenomics and Multiscale Biology (httpicahnmssmedudepartments-and-institutesgenomics) Mount Sinai HealthSystem for infrastructural support KS would like to ac-knowledge Michael J Zimmermann and Jean-Pierre Kocher(Mayo Clinic Rochester) for useful discussions The MayoClinic Bioinformatics Core provided infrastructure and ana-lytical support to KRK RS would like to thank NationalCentre for Biological Sciences (TIFR) for infrastructuralsupport

Funding

KS and JTD was supported by grant R01-DK098242-03from the National Institute of Diabetes and Digestive andKidney Diseases KRK is funded in part by the Mayo ClinicCenter for Individualized Medicine the PharmacogenomicsResearch Network (PGRN) and the Mayo Clinic BreastSpecialized Program of Research Excellence (SPORE)

References1 Lander ES Initial impact of the sequencing of the human

genome Nature 2011470187ndash972 Mardis ER A decadersquos perspective on DNA sequencing tech-

nology Nature 2011470198ndash2033 Green ED Guyer MS Charting a course for genomic medi-

cine from base pairs to bedside Nature 2011470204ndash134 Maxmen A Exome sequencing deciphers rare diseases Cell

2011144635ndash75 Wheeler DA Srinivasan M Egholm M et al The complete

genome of an individual by massively parallel DNAsequencing Nature 2008452872ndash6

6 Weatheritt RJ Davey NE Gibson TJ Linear motifs conferfunctional diversity onto splice variants Nucleic Acids Res201240123ndash31

7 Goldstein DB Common genetic variation and human traitsN Engl J Med 20093601696ndash8

8 Manolio TA Collins FS Cox NJ et al Finding the missing her-itability of complex diseases Nature 2009461747ndash53

9 Genome-wide association study of 14 000 cases of sevencommon diseases and 3 000 shared controls Nature2007447661ndash78

10 Kullo IJ Ding K Shameer K et al Complement receptor 1gene variants are associated with erythrocyte sedimenta-tion rate Am J Hum Genet 201189131ndash8

11 Welter D MacArthur J Morales J et al The NHGRI GWASCatalog a curated resource of SNP-trait associations NucleicAcids Res 201442D1001ndash6

12 Leslie R OrsquoDonnell CJ Johnson AD GRASP analysis of geno-type-phenotype results from 1390 genome-wide associationstudies and corresponding open access databaseBioinformatics 201430i185ndash94

13 Nelson MR Wegmann D Ehm MG et al An abundance ofrare functional variants in 202 drug target genes sequencedin 14 002 people Science 2012337100ndash4

14 MacArthur DG Balasubramanian S Frankish A et al A sys-tematic survey of loss-of-function variants in human pro-tein-coding genes Science 2012335823ndash8

15 Raychaudhuri S Mapping rare and common causal allelesfor complex human diseases Cell 201114757ndash69

16 Stankiewicz P Lupski JR Structural variation in the humangenome and its role in disease Annu Rev Med201061437ndash455

17 McCarroll SA Altshuler DM Copy-number variation and as-sociation studies of human disease Nat Genet200739S37ndash42

18 Zhang F Gu W Hurles ME et al Copy number variation inhuman health disease and evolution Annu Rev GenomicsHum Genet 200910451ndash81

19 Medvedev P Stanciu M Brudno M Computational methodsfor discovering structural variation with next-generationsequencing Nat Methods 20096S13ndash20

20 Sherry ST Ward MH Kholodov M et al dbSNP the NCBI data-base of genetic variation Nucleic Acids Res 200129308ndash11

21 Rios D McLaren WM Chen Y et al A database and API forvariation dense genotyping and resequencing data BMCBioinformatics 201011238

22 Iafrate AJ Feuk L Rivera MN et al Detection of large-scale variation in the human genome Nat Genet200436949ndash51

23 Ashley EA Butte AJ Wheeler MT et al Clinical assess-ment incorporating a personal genome Lancet20103751525ndash35

24 Chen R Mias GI Li-Pook-Than J et al Personal omics profil-ing reveals dynamic molecular and medical phenotypesCell 20121481293ndash307

25 Lupski JR Reid JG Gonzaga-Jauregui C et al Whole-genomesequencing in a patient with Charcot-Marie-Tooth neur-opathy N Engl J Med 20103621181ndash91

26 Li Y Vinckenbosch N Tian G et al Resequencing of200 human exomes identifies an excess of low-fre-quency non-synonymous coding variants Nat Genet201042969ndash72

27 Clarke L Zheng-Bradley X Smith R et al The 1000 GenomesProject data management and community access NatMethods 20129459ndash462

28 Schunkert H Konig IR Kathiresan S et al Large-scale associ-ation analysis identifies 13 new susceptibility loci for coron-ary artery disease Nat Genet 201143333ndash8

29 Helgadottir A Thorleifsson G Manolescu A et al A commonvariant on chromosome 9p21 affects the risk of myocardialinfarction Science 20073161491ndash3

30 Visel A Zhu Y May D et al Targeted deletion of the 9p21non-coding coronary artery disease risk interval in miceNature 2010464409ndash12

31 Soranzo N Rendon A Gieger C et al A novel variant onchromosome 7q223 associated with mean platelet volumecounts and function Blood 20091133831ndash7

32 Paul DS Nisbet JP Yang TP et al Maps of open chromatinguide the functional follow-up of genome-wide associationsignals application to hematological traits PLoS Genet20117e1002139

33 Hull J Campino S Rowlands K et al Identification of com-mon genetic variation that modulates alternative splicingPLoS Genet 20073e99

34 Albert FW Kruglyak L The role of regulatory vari-ation in complex traits and disease Nat Rev Genet201516197ndash212

35 Battle A Khan Z Wang SH et al Genomic variation Impactof regulatory variation from RNA to protein Science2015347664ndash7

36 den Dunnen JT Antonarakis SE Nomenclature for the de-scription of human sequence variations Hum Genet2001109121ndash4

18 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

37 Eilbeck K Lewis SE Mungall CJ et al The sequence ontologya tool for the unification of genome annotations Genome Biol20056R44

38 Landrum MJ Lee JM Riley GR et al ClinVar public archive ofrelationships among sequence variation and human pheno-type Nucleic Acids Res 201442D980ndash5

39 Kircher M Witten DM Jain P et al A general framework forestimating the relative pathogenicity of human genetic vari-ants Nat Genet 201446310ndash15

40 Kumar P Henikoff S Ng PC Predicting the effects of codingnon-synonymous variants on protein function using theSIFT algorithm Nat Protoc 200941073ndash81

41 Adzhubei I Schmidt S Peshkin L et al A method and serverfor predicting damaging missense mutations Nat Meth20107248ndash9

42 Gonzalez-Perez A Lopez-Bigas N Improving the assessmentof the outcome of nonsynonymous SNVs with a consensusdeleteriousness score Condel Am J Hum Genet 201188440ndash9

43 Grantham R Amino acid difference formula to help explainprotein evolution Science 1974185862ndash4

44 Cooper GM Stone EA Asimenos G et al Distribution and in-tensity of constraint in mammalian genomic sequenceGenome Res 200515901ndash13

45 Pollard KS Hubisz MJ Rosenbloom KR et al Detection ofnonneutral substitution rates on mammalian phylogeniesGenome Res 201020110ndash21

46 Siepel A Bejerano G Pedersen JS et al Evolutionarily con-served elements in vertebrate insect worm and yeast gen-omes Genome Res 2005151034ndash50

47 Dayhoff MO Eck RV Park CM Model of evolutionary changein proteins In MO Dayhoff (ed) Atlas of Protein Sequence andStructure Washington DC National Biomedical ResearchFoundation 1972 89ndash99

48 Henikoff S Henikoff JG Amino acid substitution matricesfrom protein blocks Proc Natl Acad Sci USA 19928910915ndash19

49 Altschul SF Gish W Miller W et al Basic local alignmentsearch tool J Mol Biol 1990215403ndash10

50 Altschul SF Madden TL Schaffer AA et al Gapped BLASTand PSI-BLAST a new generation of protein database searchprograms Nucleic Acids Res 1997253389ndash402

51 Sunyaev S Ramensky V Koch I et al Prediction of deleteri-ous human alleles Hum Mol Genet 200110591ndash7

52 Yang Z Bielawski JP Statistical methods for detecting mo-lecular adaptation Trends Ecol Evol 200015496ndash503

53 Ng PC Henikoff S Predicting deleterious amino acid substi-tutions Genome Res 200111863ndash74

54 Wang Z Moult J SNPs protein structure and disease HumMutat 200117263ndash70

55 Thomas PD Kejariwal A Coding single-nucleotide poly-morphisms associated with complex vs Mendelian diseaseevolutionary evidence for differences in molecular effectsProc Natl Acad Sci USA 200410115398ndash403

56 Stone EA Sidow A Physicochemical constraint violation bymissense substitutions mediates impairment of proteinfunction and disease severity Genome Res 200515978ndash86

57 Miller MP Kumar S Understanding human disease muta-tions through the use of interspecific genetic variation HumMol Genet 2001102319ndash28

58 Johnson AD Single-nucleotide polymorphism bioinfor-matics a comprehensive review of resources Circ CardiovascGenet 20092530ndash6

59 Marjoram P Tavare S Modern computational approachesfor analysing molecular genetic variation data Nat Rev Genet20067759ndash70

60 Excoffier L Heckel G Computer programs for populationgenetics data analysis a survival guide Nat Rev Genet20067745ndash58

61 Peterson TA Nehrt NL Park D et al Incorporating molecularand functional context into the analysis and prioritizationof human variants associated with cancer J Am Med InformAssoc 201219275ndash83

62 Kann MG Advances in translational bioinformatics compu-tational approaches for the hunting of disease genes BriefBioinform 20101196ndash110

63 Cooper GM Shendure J Needles in stacks of needles findingdisease-causal variants in a wealth of genomic data Nat RevGenet 201112628ndash40

64 Stitziel NO Kiezun A Sunyaev S Computational and statis-tical approaches to analyzing variants identified by exomesequencing Genome Biol 201112227

65 Alexander RP Fang G Rozowsky J et al Annotating non-coding regions of the genome Nat Rev Genet 201011559ndash71

66 Mort M Evani US Krishnan VG et al In silico func-tional profiling of human disease-associated and poly-morphic amino acid substitutions Hum Mutat 201031335ndash46

67 Hamosh A Scott AF Amberger JS et al Online MendelianInheritance in Man (OMIM) a knowledgebase of humangenes and genetic disorders Nucleic Acids Res200533D514ndash17

68 Stenson PD Mort M Ball EV et al The human gene mutationdatabase 2008 update Genome Med 2009113

69 Forbes SA Bindal N Bamford S et al COSMIC mining com-plete cancer genomes in the Catalogue of SomaticMutations in Cancer Nucleic Acids Res 201139D945ndash50

70 Chen YA Tripathi LP Mizuguchi K TargetMine an inte-grated data warehouse for candidate gene prioritisation andtarget discovery PLoS One 20116e17844

71 Wang K Li M Hakonarson H ANNOVAR functional annota-tion of genetic variants from high-throughput sequencingdata Nucleic Acids Res 201038e164

72 Danecek P Auton A Abecasis G et al The variant call formatand VCFtools Bioinformatics 2011272156ndash8

73 Macintyre G Jimeno Yepes A Ong CS et al Associating dis-ease-related genetic variants in intergenic regions to thegenes they impact Peer J 20142e639

74 Javed A Agrawal S Ng PC Phen-Gen combining phenotypeand genotype to analyze rare disorders Nat Methods201411935ndash7

75 Dai HJ Wu JC Tsai RT et al T-HOD a literature-based candi-date gene database for hypertension obesity and diabetesDatabase (Oxford) 20132013bas061

76 Cheng D Knox C Young N et al PolySearch a web-basedtext mining system for extracting relationships betweenhuman diseases genes mutations drugs and metabolitesNucleic Acids Res 200836W399ndash405

77 Do R Kathiresan S Abecasis GR Exome sequencing andcomplex disease practical aspects of rare variant associ-ation studies Hum Mol Genet 201221R1ndash9

78 Wagner MJ Rare-variant genome-wide association studiesa new frontier in genetic analysis of complex traitsPharmacogenomics 201314413ndash24

79 Genomes Project C Abecasis GR Auton A et al An inte-grated map of genetic variation from 1 092 human genomesNature 201249156ndash65

80 Anfinsen CB The formation and stabilization of proteinstructure Biochem J 1972128737ndash49

81 Anfinsen CB Principles that govern the folding of proteinchains Science 1973181223ndash30

Interpreting functional effects of coding variants | 19

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

82 Fay J Wu C Sequence divergence functional constraintand selection in protein evolution Annu Rev Genomics HumGenet 20034213ndash35

83 Ponting CP Russell RR The natural history of protein do-mains Annu Rev Biophys Biomol Struct 20023145ndash71

84 Ekman D Bjorklund AK Frey-Skott J et al Multi-domain pro-teins in the three kingdoms of life orphan domains andother unassigned regions J Mol Biol 2005348231ndash43

85 Sigrist CJ Cerutti L Hulo N et al PROSITE a documenteddatabase using patterns and profiles as motif descriptorsBrief Bioinform 20023265ndash74

86 Dinkel H Michael S Weatheritt RJ et al ELMndashthe database ofeukaryotic linear motifs Nucleic Acids Res 201240D242ndash51

87 Haberichter SL Budde U Obser T et al The mutation N528Sin the von Willebrand factor (VWF) propeptide causes de-fective multimerization and storage of VWF Blood20101154580ndash7

88 Kutz WE Wang LW Dagoneau N et al Functional analysis ofan ADAMTS10 signal peptide mutation in Weill-Marchesanisyndrome demonstrates a long-range effect on secretion ofthe full-length enzyme Hum Mutat 2008291425ndash34

89 Pugalenthi G Suganthan PN Sowdhamini R et al SMotif aserver for structural motifs in proteins Bioinformatics200723637ndash8

90 Sundaramurthy P Shameer K Sreenivasan R et al HORI aweb server to compute Higher Order Residue Interactions inprotein structures BMC Bioinformatics 201011(Suppl 1)S24

91 Porter CT Bartlett GJ Thornton JM The Catalytic Site Atlas aresource of catalytic sites and residues identified in enzymesusing structural data Nucleic Acids Res 200432D129ndash33

92 Stuart AC Ilyin VA Sali A LigBase a database of families ofaligned ligand binding sites in known protein sequencesand structures Bioinformatics 200218200ndash01

93 Swapna LS Mahajan S de Brevern A et al Comparison oftertiary structures of proteins in protein-protein complexeswith unbound forms suggests prevalence of allostery in sig-nalling proteins BMC Struct Biol 2012126

94 Mathivanan S Ahmed M Ahn NG et al Human Proteinpediaenables sharing of human protein data Nat Biotechnol200826164ndash7

95 Bhaskara RM Srinivasan N Stability of domain structures inmulti-domain proteins Sci Rep 2011140

96 Furnham N de Beer TA Thornton JM Current challenges ingenome annotation through structural biology and bioinfor-matics Curr Opin Struct Biol 201222594ndash601

97 Wang Y Cheng H Pan Z et al Reconfiguring phosphoryl-ation signaling by genetic polymorphisms affects cancersusceptibility J Mol Cell Biol 20157187ndash202

98 Gnad F Gunawardena J Mann M PHOSIDA 2011 the post-translational modification database Nucleic Acids Res201139D253ndash60

99 Nicolaou N Margadant C Kevelam SH et al Gain of glycosy-lation in integrin alpha3 causes lung disease and nephroticsyndrome J Clin Invest 20121224375ndash87

100 Radivojac P Baenziger PH Kann MG et al Gain and loss ofphosphorylation sites in human cancer Bioinformatics200824i241ndash7

101 Pan Y Karagiannis K Zhang H et al Human germline andpan-cancer variomes and their distinct functional profilesNucleic Acids Res 20144211570ndash88

102 Mazumder R Morampudi KS Motwani M et al Proteome-wide analysis of single-nucleotide variations in the N-glyco-sylation sequon of human genes PLoS One 20127e36212

103 Reddy CC Shameer K Offmann BO et al PURE a webserverfor the prediction of domains in unassigned regions in pro-teins BMC Bioinformatics 20089281

104 Coletta A Pinney JW Solis DY et al Low-complexity regionswithin protein sequences have position-dependent rolesBMC Syst Biol 2010443

105 Jones DT Swindells MB Getting the most from PSI-BLASTTrends Biochem Sci 200227161ndash4

106 Chograni M Rejeb I Jemaa LB et al The first missense muta-tion of NHS gene in a Tunisian family with clinical featuresof NHS syndrome including cardiac anomaly Eur J HumGenet 201119851ndash6

107 Jones S Thornton JM Principles of protein-protein inter-actions Proc Natl Acad Sci USA 19969313ndash20

108 Babu MM van der Lee R de Groot NS et al Intrinsically dis-ordered proteins regulation and disease Curr Opin Struct Biol201121432ndash40

109 Dunker AK Silman I Uversky VN et al Function and struc-ture of inherently disordered proteins Curr Opin Struct Biol200818756ndash64

110 Wright PE Dyson HJ Intrinsically unstructured proteins re-assessing the protein structure-function paradigm J Mol Biol1999293321ndash31

111 Dyson HJ Wright PE Coupling of folding and binding for un-structured proteins Curr Opin Struct Biol 20021254ndash60

112 Nguyen Ba AN Yeh BJ van Dyk D et al Proteome-wide dis-covery of evolutionary conserved sequences in disorderedregions Sci Signal 20125rs1

113 Das RK Mao AH Pappu RV Unmasking functionalmotifs within disordered regions of proteins Sci Signal20125pe17

114 Hu Y Liu Y Jung J et al Changes in predicted protein dis-order tendency may contribute to disease risk BMCGenomics 201112(Suppl 5)S2

115 Vacic V Iakoucheva LM Disease mutations in disordered re-gionsndashexception to the rule Mol BioSyst 2012827ndash32

116 Srivastava SK Gayathri S Manjasetty BA et al Analysis ofconformational variation in macromolecular structuralmodels PLoS One 20127e39993

117 Ward JJ Sodhi JS McGuffin LJ et al Prediction and functionalanalysis of native disorder in proteins from the three king-doms of life J Mol Biol 2004337635ndash45

118 Thomas PJ Qu BH Pedersen PL Defective protein folding asa basis of human disease Trends Biochem Sci 199520456ndash9

119 Welch W Role of quality control pathways in human dis-eases involving protein misfolding Semin Cell Dev Biol20041531ndash8

120 Fink AL Protein aggregation folding aggregates inclusionbodies and amyloid Fold Des 19983R9ndash23

121 Thirumalai D Klimov DK Dima RI Emerging ideas on themolecular basis of protein and peptide aggregation CurrOpin Struct Biol 200313146ndash59

122 Khare SD Dokholyan NV Molecular mechanisms of poly-peptide aggregation in human diseases Curr Protein Pept Sci20078573ndash9

123 Soto C Estrada L Castilla J Amyloids prions and the inher-ent infectious nature of misfolded protein aggregatesTrends Biochem Sci 200631150ndash5

124 Prusiner SB Molecular biology and pathogenesis of priondiseases Trends Biochem Sci 199621482ndash7

125 Shameer K Shingate PN Manjunath SC et al 3DSwap cura-ted knowledgebase of proteins involved in 3D domain swap-ping Database (Oxford) 20112011bar042

20 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

126 Bennett MJ Choe S Eisenberg D Domain swapping entan-gling alliances between proteins Proc Natl Acad Sci USA1994913127ndash31

127 Bennett MJ Sawaya MR Eisenberg D Deposition diseasesand 3D domain swapping Structure 200614811ndash24

128 Shameer K Sowdhamini R Functional repertoire molecularpathways and diseases associated with 3D domain swap-ping in the human proteome J Clin Bioinform 201228

129 Waters PJ Parniak MA Hewson AS et al Alterations in pro-tein aggregation and degradation due to mild and severemissense mutations (A104D R157N) in the human phenyl-alanine hydroxylase gene (PAH) Hum Mutat 199812344ndash54

130 OrsquoNeill JW Kim DE Johnsen K et al Single-site mutations in-duce 3D domain swapping in the B1 domain of protein Lfrom Peptostreptococcus magnus Structure 200191017ndash27

131 Seeliger MA Spichty M Kelly SE et al Role of conformationalheterogeneity in domain swapping and adapter function ofthe Cks proteins J Biol Chem 200528030448ndash59

132 Kirsten Frank M Dyda F Dobrodumov A et al Core muta-tions switch monomeric protein GB1 into an intertwinedtetramer Nat Struct Biol 20029877ndash85

133 Huang Y Cao H Liu Z Three-dimensional domain swappingin the protein structure space Proteins 2012801610ndash19

134 Gidalevitz T Wang N Deravaj T et al Natural genetic vari-ation determines susceptibility to aggregation or toxicity ina C elegans model for polyglutamine disease BMC Biol201311100

135 Ross CA Poirier MA Protein aggregation and neurodegener-ative disease Nat Med 200410(Suppl)S10ndash17

136 Murzin A Metamorphic proteins Science 20083201725ndash6137 Shortle D One sequence plus one mutation equals two

folds Proc Natl Acad Sci USA 200910621011ndash12138 Yadid I Kirshenbaum N Sharon M et al Metamorphic pro-

teins mediate evolutionary transitions of structure Proc NatlAcad Sci USA 20101077287ndash92

139 Yadid I Kirshenbaum N Sharon M et al Metamorphic pro-teins mediate evolutionary transitions of structure Proc NatlAcad Sci USA 20101077287ndash92

140 Pickrell J Marioni J Pai A et al Understanding mechanismsunderlying human gene expression variation with RNAsequencing Nature 2010464768ndash72

141 Matlin AJ Clark F Smith CW Understanding alternative splic-ing towards a cellular code Nat Rev Mol Cell Biol 20056386ndash98

142 Rickman DS Pflueger D Moss B et al SLC45A3-ELK4 is anovel and frequent erythroblast transformation-specific fu-sion transcript in prostate cancer Cancer Res 2009692734ndash8

143 Flomen R Makoff A Increased RNA editing in EAAT2 pre-mRNA from amyotrophic lateral sclerosis patients involve-ment of a cryptic polyadenylation site Neurosci Lett2011497139ndash43

144 Asmann YW Hossain A Necela BM et al A novel bioinfor-matics pipeline for identification and characterization of fu-sion transcripts in breast cancer and normal cell linesNucleic Acids Res 201139e100

145 Bundschuh R Computational prediction of RNA editingsites Bioinformatics 2004203214ndash20

146 Eisenberg E Li JB Levanon EY Sequence based identificationof RNA editing sites RNA Biol 20107248ndash52

147 Levanon EY Eisenberg E Algorithmic approaches for identi-fication of RNA editing sites Brief Funct Genomics Proteomic2006543ndash5

148 McCarthy DJ Humburg P Kanapin A et al Choice of tran-scripts and software has a large effect on variant annota-tion Genome Med 2014626

149 Tang X Baheti S Shameer K et al The eSNV-detect a com-putational system to identify expressed single nucleotidevariants from transcriptome sequencing data Nucleic AcidsRes 201442e172

150 Chorev M Carmel L The function of introns Front Genet2012355

151 Chen R Davydov EV Sirota M et al Non-synonymous andsynonymous coding SNPs show similar likelihood and effectsize of human disease association PLoS One 20105e13574

152 Rose AB Intron-mediated regulation of gene expressionCurr Top Microbiol Immunol 2008326277ndash90

153 Birnbaum RY Clowney EJ Agamy O et al Coding exonsfunction as tissue-specific enhancers of nearby genesGenome Res 2012221059ndash68

154 Nakaya HI Amaral PP Louro R et al Genome mapping andexpression analyses of human intronic noncoding RNAsreveal tissue-specific patterns and enrichment in genesrelated to regulation of transcription Genome Biol 20078R43

155 Rohs R Jin X West SM et al Origins of specificity in protein-DNA recognition Ann Rev Biochem 201079233ndash69

156 Karczewski KJ Dudley JT Kukurba KR et al Systematic func-tional regulatory assessment of disease-associated variantsProc Natl Acad Sci USA 20131109607ndash12

157 Khalil AM Rinn JL RNA-protein interactions in humanhealth and disease Semin Cell Dev Biol 201122359ndash65

158 Jones S van Heyningen P Berman HM et al Protein-DNAinteractions a structural analysis J Mol Biol 1999287877ndash96

159 Jones S Daley DT Luscombe NM et al Protein-RNAinteractions a structural analysis Nucleic Acids Res200129943ndash54

160 Barabasi AL Oltvai ZN Network biology understanding thecellrsquos functional organization Nat Rev Genet 20045101ndash13

161 Nooren IM Thornton JM Diversity of protein-protein inter-actions EMBO J 2003223486ndash92

162 Caspi R Altman T Dreher K et al The MetaCyc database ofmetabolic pathways and enzymes and the BioCyc collectionof pathwaygenome databases Nucleic Acids Res201240D742ndash53

163 Yamada T Bork P Evolution of biomolecular networks les-sons from metabolic and protein interactions Nat Rev MolCell Biol 200910791ndash803

164 Stark C Breitkreutz BJ Chatr-Aryamontri A et al TheBioGRID interaction database 2011 update Nucleic Acids Res201139D698ndash704

165 Szklarczyk D Franceschini A Kuhn M et al The STRINGdatabase in 2011 functional interaction networks of pro-teins globally integrated and scored Nucleic Acids Res201139D561ndash8

166 Bader GD Hogue CW An automated method for finding mo-lecular complexes in large protein interaction networksBMC Bioinformatics 200342

167 DallrsquoOlio GM Vahdati AR Bertranpetit J et alVCF2Networks applying Genotype Networks to SingleNucleotide Variants data Bioinformatics 201531438ndash9

168 Ding L Getz G Wheeler DA et al Somatic mutations affectkey pathways in lung adenocarcinoma Nature20084551069ndash75

169 Case M Matheson E Minto L et al Mutation of genes affect-ing the RAS pathway is common in childhood acutelymphoblastic leukemia Cancer Res 2008686803ndash9

170 Thorstensen L Lind GE Lovig T et al Genetic and epigeneticchanges of components affecting the WNT pathway in colo-rectal carcinomas stratified by microsatellite instabilityNeoplasia 2005799ndash108

Interpreting functional effects of coding variants | 21

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

171 Weng L Macciardi F Subramanian A et al SNP-based path-way enrichment analysis for genome-wide associationstudies BMC Bioinformatics 20111299

172 Wang K Li M Bucan M Pathway-based approaches for ana-lysis of genomewide association studies Am J Hum Genet2007811278ndash83

173 OrsquoDushlaine C Kenny E Heron EA et al The SNP ratio testpathway analysis of genome-wide association datasetsBioinformatics 2009252762ndash3

174 Tarca AL Draghici S Khatri P et al A novel signaling path-way impact analysis Bioinformatics 20092575ndash82

175 Ramanan VK Shen L Moore JH et al Pathway analysis ofgenomic data concepts methods and prospects for futuredevelopment Trends Genet 201228323ndash2

176 Li Y Agarwal P Rajagopalan D A global pathway crosstalknetwork Bioinformatics 2008241442ndash7

177 Rivals I Personnaz L Taing L et al Enrichment or depletionof a GO category within a class of genes which testBioinformatics 200723401ndash7

178 Rhee SY Wood V Dolinski K et al Use and misuse of thegene ontology annotations Nat Rev Genet 20089509ndash15

179 Chandonia JM Brenner SE The impact of structural gen-omics expectations and outcomes Science 2006311347ndash51

180 Johnson MS Srinivasan N Sowdhamini R et al Knowledge-based protein modeling Crit Rev Biochem Mol Biol 1994291ndash68

181 Sulkowska JI Morcos F Weigt M et al Genomics-aided struc-ture prediction Proc Natl Acad Sci USA 201210910340ndash5

182 Lee D Redfern O Orengo C Predicting protein function from se-quence and structure Nat Rev Mol Cell Biol 20078995ndash1005

183 Worth CL Gong S Blundell TL Structural and functionalconstraints in the evolution of protein families Nat Rev MolCell Biol 200910709ndash20

184 Karczewski KJ Fernald GH Martin AR et al STORMSeq anopen-source user-friendly pipeline for processing personalgenomics data in the cloud PLoS One 20149e84860

185 Kalari KR Rossell D Necela BM et al Deep sequence ana-lysis of non-small cell lung cancer integrated analysis ofgene expression alternative splicing and single nucleotidevariations in lung adenocarcinomas with and without onco-genic KRAS mutations Front Oncol 2012212

186 Patil S Upadhayay A Arya D et al A high throughput exomesequencing approach to analyse events within a good re-sponder CML patient under imatinib at diagnosis and underremission Blood 20131225161

187 Kullo IJ Shameer K Jouni H et al The ATXN2-SH2B3 locus isassociated with peripheral arterial disease an electronicmedical record-based genome-wide association study FrontGenet 20145166

188 Shameer K Denny JC Ding K et al A genome- and phe-nome-wide association study to identify genetic variantsinfluencing platelet count and volume and their pleiotropiceffects Hum Genet 201413395ndash109

189 Sabarinathan R Wenzel A Novotny P et al Transcriptome-wide analysis of UTRs in non-small cell lung cancer revealscancer-related genes with SNV-induced changes on RNAsecondary structure and miRNA target sites PLoS One20149e82699

22 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

  • bbv084-TF1
Page 6: challenges in proteome-scale prediction, annotation and assessment

terms and pathways associated with the variant-containinggenes associations mapped within genetic databases such asdbSNP Ensembl Online Mendelian Inheritance in Man [67]Human Gene Mutation Database [68] Catalogue of SomaticMutations in Cancer (COSMIC) [69] and other clinically relevantgenomic regions also enable enhanced variant annotationTogether integrative data analysis platforms (such asTargetMine [70]) and integrated annotation tools such asANNOVAR [71] or webservers that can handle Variant CallFormat (VCF) [72] files facilitate the annotations for a largenumber of sequence variants in a small amount of timeAdopting a standard set of terms from the SO to define the typeof variants would also help in streamlining the interoperabilityof results from the different prediction tools A comprehensivelist of tools for rapid variant annotation is provided in Table 2

The visualization of a variant in the context of multiplelayers of biological information is helpful to interpret and priori-tize variants for functional studies A growing number of gen-ome browsers and data visualization libraries enable theinteractive and static visualizations of variants in the context ofthe human genome or transcriptome with biological clinicaland population scale annotation data compiled from multipleresources Genome browsers [Integrative Genomics Viewer(IGV) UCSC Genome Browsers NCBI Sequence Viewer etc] en-able the visualization of variants in the context of a referencegenome Resources such as Distributed Annotation System(DAS) and DASTY (a protein-centric DAS client) can be leveragedfor interactive visualization of coding variants in a protein withrich annotations A list of genome browsers and visualizationlibraries capable of visualizing variants and multiple annotationcomponents is provided in Table 3

Text mining and natural language processingfor knowledge aggregation for SNVs

Biomedical knowledge about relationships between genes dis-eases phenotypes and genetic variations are scattered across alarge number of unstructured literature databases Applicationof natural language processing and text-mining thereforeoffers a useful approach for function assignment of codingSNVs Further text mining of large biomedical literature data-bases like PubMed and Medline helps to provide clues for fur-ther investigations and leads to hypothesis [73] For examplePhenGen [74] offers links to a variety of literature evidence tosupport genotypendashphenotype connections Integrated databases

like T-HOD database [75] and PolySearch [76] provide text-mining tools to derive meaningful biological inferences tointerpret coding SNVs

Emerging challenges in annotation andinterpretation of coding SNVs

The growing number of tools for the prediction annotation andvisualization of coding SNVs can address several gaps in thecurrent state of knowledge on variant interpretation The devel-opment of new algorithms for variant interpretation could beconsidered for several emerging themes of the protein se-quence-structure-function paradigm Several tools [See Tables4 5 and 6] are currently available to assess sequence and struc-ture-based features yet a reliable interpretation on how a gen-omic variant could perturb a protein or a protein network isoften a challenging task

VUS (also known as incidental variations or secondary vari-ations) are a class of variants hitherto unknown in known dis-ease genes but the influence of these variants on the diseasephenotype is largely unknown Pilot data from 1000 Genomesproject on exon capture sequencing of 1092 individuals selectedfrom 14 populations across Europe East Asia sub-SaharanAfrica and the Americas reported an observed frequency of 2500non-synonymous variants at conserved positions 20ndash40 vari-ants identified as damaging 24 at conserved sites and about 150loss-of-function (LOF) variants that includes stop-gains frame-shift insertions and deletions (indels) in a coding sequence anddisruptions to essential splice sites Emerging evidence suggeststhat rare variants may have a higher collective impact on dis-ease incidence rate than common variants and considering theallele frequency as a metric to assess the clinical impact of thevariant may help to assess the clinical impact of VUS Majorityof the variants identified in individuals in 1000 Genomes projectare common with gt5 of MAF or low frequency (MAF 05ndash5)than rare variants (MAF lt05) Rare variant frequencies esti-mated as 130ndash400 non-synonymous variants per individual thatincludes 10ndash20 LOF variants 2ndash5 damaging mutations and 1ndash2variants identified from cancer genome sequencing projects[77ndash79] As an ever-increasing number of VUS are being charac-terized their annotation and interpretation is becoming morechallenging With the advent of WGS and WES in clinical set-tings the repertoire of VUS associated with complex chronicand rare diseases is rapidly expanding We surveyed theHuman polymorphisms and disease mutations index (humsa-vartxt) to gather unclassified variants reported in UniProtKB acurated database of functional information on proteins Thecurrent release (2014_05) of the humsavar lists 6564 variants aslsquoUnclassified variantsrsquo these variants were mapped to 1910 pro-tein coding genes A manually curated annotation of the vari-ants and disease ontology-based disease term-gene enrichmentanalyses indicated that the genes were largely from cancer pa-tients (322 genes tested Plt 005) We noted that 497 genesencoded polymorphism disease and unclassified variants sug-gesting that VUS-labeled variants may directly influence thegenes that are relevant to human diseases and clinically rele-vant phenotypes A proportional Venn diagram of genesmapped to variants labeled as lsquoPolymorphismrsquo lsquoDiseasersquo andlsquoUnclassifiedrsquo is illustrated in Figure 3 The top 10 disease ontol-ogy terms enriched among the genes within the three classes ofvariants are summarized in Table 2 We observed that certaindiseases were represented by genes that share multiple classesof variants thereby suggesting that unclassified variants may

Table 2 Top-10 terms from a gene-disease enrichment analyses per-formed using list of genes with shared polymorphism disease andunclassified variants using disease ontology

DO term No ofgenes

P-values(Bonferronicorrected)

Congenital abnormality 38 913E-34Cancer 64 7222E-33Diabetes mellitus 39 8913E-23Breast cancer 36 356E-17Atherosclerosis 26 1478E-16Retinal disease 16 2044E-16Adenovirus infection 15 8816E-14Hypertension 21 2072E-13Alzheimerrsquos disease 22 7548E-13

6 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

play a key role in certain clinical conditions such as cancerschronic diseases (diabetes atherosclerosis hypertension andkidney disease) neurodegenerative disease (Alzheimerrsquos dis-ease) and infection (Plt 005 Bonferroni corrected) Annotatingfunctional coding SNVs including VUS validated using an or-thogonal method also requires detailed sequence structure andinteraction-based analyses In this context we discuss some ofthe predictive features and analytical approaches that onceincorporated in coding variant annotation algorithms may po-tentially enhance the interpretation of the functional variantson a proteome-wide scale

Sequence and structural propertiesperturbed by coding SNVs

A protein performs its defined function after attaining a specifictertiary or quaternary structure [80 81] This is often mediatedby cross-links of inter-chain and intra-chain amino-acid residueinteractions within a protein These interactions (hydrogenbond disulphide bonds salt bridges ionic interactions electro-static interactions hydrophobic interactions etc) stabilize the

fold of individual protein chains and the overall quaternary as-sembly Individual amino acids within a protein are under theinfluence of the varied elements of the sequence-structure-function paradigm that modulate their biochemical role [82]Primarily a residue may be located within an evolutionarilyconserved compact globular domain [83] or in an unassigned re-gion with no known protein domain association [84] A residuecan be a part of a motif which can be functional [85] that of alinear motif [86] propeptide [87] signaling peptide [88] or a partof structural [89] and spatially interacting sites which partici-pate in higher order interactions [90] catalytic sites [91] ligandbinding sites [92] and allosteric sites [93] associated with exten-sive inter-chain and intra-chain interactions [90] based onthe oligomeric property of the proteins Amino acids can also bepart of specific sub-cellular localization signals that directthe proteins to specific locations in the cell [94] The lengths ofthe proteins [62] sequential localization (N-terminalC-terminal or other region of protein sequence) and the locationof a coding SNV with respect to the surfacemdashinterior or inter-face of the protein structuremdashcould influence disease manifest-ation [95]

Table 3 Rapid variant annotation tools

Name Description URL

ANNOVAR Efficient software tool to use up-to-date information tofunctionally annotate genetic variants detected fromdiverse genomes

httpwwwopenbioinformaticsorgannovar

AnnTools Comprehensive and versatile annotation toolkit for gen-omic variants

httpanntoolssourceforgenet

dbNSFP A lightweight database of human non-synonymousSNPs and their functional predictions

httpssitesgooglecomsitejpopgendbNSFP

EVA An efficient and versatile tool for filtering strategies inmedical genomics

httpplateforme-genomique-iribuniv-rouenfrEVAindexphp

Exome VariantServer

Provides different calculated values (GERP GRANTHAMetc) and annotations for SNPs

httpevsgswashingtoneduEVS

gSearch gSearch compares sequence variants in the GenomeVariation Format (GVF) or VCF with a pre-compiledannotation or with variants in other genomes

httpmlssuackrgSearchindexhtml

HugeSeq A pipeline for detection and annotation of geneticvariations

httphugeseqsnyderlaborg

MuSiC Comprehensive mutational analysis pipeline to segre-gate passenger and driver mutations from cancergenomes

httpgmtgenomewustledugenome-musiccurrent

NGS-SNP Collection of command-line scripts for providing richannotations for SNPs

httpstothardafnsualbertacadownloadsNGS-SNP

SeattleSeq Annotation Provides annotation of known and novel SNPs httpsnpgswashingtoneduSeattleSeqAnnotationsnpEff Variant annotation and effect prediction tool httpsnpeffsourceforgenetSVA Software system designed for annotation and visualiza-

tion of genetic variantshttpwwwsvaprojectorg

STORMSeq Cloud computing solution for read mapping read clean-ing and variant calling and annotation

httpsgithubcomkonradjkstormseq

TREAT Targeted RE-sequencing Annotation Tool httpndcmayoedumayoresearchbiostatstand-alone-packagescfm

VAAST Probabilistic search tool for identifying damaged genesand their disease-causing variants in personal gen-ome sequences

httpwwwyandell-laborgsoftwarevaasthtml

VARIANT VARIANT (VARIant ANalysis Tool) can report the func-tional properties of any variant in all the humanmouse or rat genes

httpvariantbioinfocipfes

Variant Reporter Generate a report of known variants and functionalconsequences

wwwncbinlmnihgovvariationtoolsreporter

Variant Tools Tool for the annotation selection and analysis of vari-ants in the context of next-gen sequencing analysis

httpvarianttoolssourceforgenet

Interpreting functional effects of coding variants | 7

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

Table 4 Genome browsers and biological data visualization libraries

Name Description URL

1000 Genomesbrowser

Genome browser to access data from 1000 Genomesproject

httpwwwncbinlmnihgovvariationtools1000genomes

AnnoJ Web 20 application designed for visualizing sequencingand annotation data

httpwwwannojorg

Artemis Genome Browser and Annotation Tool httpwwwsangeracukresourcessoftwareartemis

BioGraphics Perl modules for biological data visualization httpsearchcpanorgdistBio-GraphicsBioGraphics (Ruby) Ruby library for drawing overviews of genomic regions httpbio-graphicsrubyforgeorgBluejay Java-based integrated computational environment for the

exploration of genomic datahttpbluejayucalgaryca

CGView Circular Genome Viewer httpwishartbiologyualbertacacgviewCircos Circos is a software package for visualizing genomic data

and annotations in circular layouthttpcircosca

Dalliance Interactive genome viewer which runs directly in a modernweb browser

httpwwwbiodallianceorg

DAS DAS is an integrated visualization toolkit to share and col-late genomic annotation information

wwwbiodasorg

DASTY Web client for visualizing protein sequence feature infor-mation using DAS

httpwwwebiacukdasty

DNAPlotter DNAPlotter can be used to generate images of circular andlinear DNA maps to display regions and annotations ofinterest

httpwwwsangeracukresourcessoftwarednaplotter

Ensembl GenomeBrowser

Ensembl Genome Browser enables the visualization of gen-omic and transcriptomic sequence and related informa-tion for several vertebrate and non-vertebrate species

httpuseastensemblorgindexhtml

GASV Geometric Analysis of Structural Variants httpcompbiocsbrownedusoftwarehtmlGbrowse Generic Genome Browser (GBrowse) is a genome viewer de-

veloped as part of Generic Model Organism Database(GMOD) project

httpgmodorgwikiGBrowse

GENBOREE Customizable genome browser httpgenboreeorgjava-binloginjspGeneViTo JAVA-based workbench for genome-wide analysis through

visual interactionhttpathinabioluoagrbioinformaticsGENEVITO

GenomeGraphs R-based interface to plot genomic information fromEnsembl

httpwwwbioconductororgpackagesreleasebiochtmlGenomeGraphshtml

GenomePixelizer Tool to generate custom images of genomes out of thegiven set of genes

httpwwwatgcorgGenomePixelizer

GenomeTools Versatile genome analysis software httpgenometoolsorgGenomeView Next-generation stand-alone genome browser and editor httpgenomevieworgGremlin Interactive visualization model for analyzing genomic

rearrangementshttpcompbiocsbrownedusoftwarehtml

IGB Integrated genome browser httpbiovizorgigbindexhtmlIGV Integrative Genomics Viewer httpwwwbroadinstituteorgigvJbrowse Genome browser with a fully dynamic AJAX interface httpgmodorgwikiJBrowsejsDAS JavaScript client library for the DAS httpwwwebiacukdastyebihtmljsdashtmlMagicViewer Integrated solution for NGS data visualization and genetic

variation detection and annotationhttpbioinformaticszjcnmagicviewerindexphp

NCBI GraphicalSequence Viewer

Graphical display for the Nucleotide and Protein sequences httpwwwncbinlmnihgovprojectssviewer

Rover Genome browser framework to build custom genomic tools httpchmille4githubcomRoversitehomehtmlRviewer Interactive online tool for comparing and prioritizing gen-

omic regionshttprviewerlblgovrviewer

Savant Genome Browser for high-throughput sequencing data httpgenomesavantcomScribl HTML5 Canvas-based graphics library for visualization of

genomic data and annotationshttpchmille4githubcomScribl

UCSC GenomeBrowser

Interactive genome browser that provide access to se-quence data from different species integrated with alarge collection of layered annotations from experimentsand prediction algorithms

httpgenomeucsceducgi-binhgGateway

VISTA Browser Visualization of pairwise and multiple alignments of wholegenome assemblies

httppipelinelblgovcgi-bingateway2selectorfrac14vista

Whole GenomerVISTA

Visualization of TFBS that are conserved between speciesand overrepresented in upstream regions of groups ofgenes

httpgenomelblgovvistaindexshtml

8 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

Previously studies have established (Table 4) that functionalvariants may impact the overall structural conformation andfunction of proteins (ligand binding substrate specificity pro-teinndashprotein interaction transcription factor-target bindingetc) Residue-specific interactions can play a key role in suchvariations Sequence-based investigations will further help toidentify linear sequence motifs that confer diversity in the splicevariants [6] but the identification and analysis of the impact ofpoint mutations on spatially distributed structural motifs mayrequire structural data [55] However the limited availability ofstructural data compared with sequence data may pose add-itional challenges for incorporating structural analysis of thefunctional variants [96] Moreover additional challenges ariseowing to the dynamic protein conformations and allosteric orother long-distance effects (including higher order interactions)of few mutations on the activity of the protein

UniProtKB provides extensive information about residueproperties and curated information about globular domains (viaannotations derived from Pfam SMART or InterPro)posttranslational modifications (PTMs) and several bioinfor-matics tools and webservers are available to investigate residueproperties and structural properties (See Table 5 for a list oftools to predict various sequence features See URL httpbioinformaticsca links_directorycategoryprotein for list of toolsavailable for different protein-centric analyses)

Occasionally multiple sequence or structural features canbe annotated onto a single variant for example a residue couldbe part of a functional motif andor a structural motif and cod-ing SNVs may perturb either or both features Quantifying mul-tiple features perturbed by variants using an objective-scoringschema will help to capture the entire set of features perturbedby a functional variant Comparative analysis of the sequenceand structural features of the wild type and the variant se-quences will also enhance the understanding of the impact ofmultiple coding variants within a protein

Gain and loss of PTMs owing to SNVs

Amino acids in a mature peptide chain are targets of variedPTM [97] events PTMs include phosphorylation methylationacetylation amidation hydroxylation sulfation lipidation

glycosylation and palmitoylation [98] Several studies have nowcompiled and assessed the impact of phosphorylation in cancer[97] and have reported mutational landscapes of various PTMevents including gain or loss of particular PTMs owing to SNVs(for example gain of glycosylation [99] gain and loss of phos-phorylation in cancer [100]) Pan-cancer [101] and proteomewide studies [102] have also assessed the impact of variety ofPTMs Several tools and databases are currently available toassess impact of individual PTMs owing to SNVs [refs]However tools that can provide complete in silico profiling ofmutations will help the researchers to identify PTM-relevantmutations and use the information to assess therapeuticstratifications

Coding SNVs in unassigned regions of proteins

An unassigned region refers to the segments in proteins withno known functional domain assignments [84] Human proteinshave a variable degree of unassigned regions and small-un-assigned regions are often defined as the linker regions betweentwo domains (Figure 2i) We surveyed the human proteomeusing Pfam-based domain annotations to understand the un-assigned regions in the human proteome We computed as-signed and unassigned regions (in percentage) for 20 242 proteinsequences from the SwissProt database (reviewed sequences)Pfam-A-based domain assignments were retrieved for 20 137 se-quences A subset of 105 proteins was excluded from the ana-lysis owing to overlapping domain assignments Of the 20 137sequences that were analyzed (Figure 4A) one protein (haloaciddehalogenase-like hydrolase domain-containing protein 2) wasassigned with a conserved domain over its entire length 3234sequences were completely unassigned and the rest of the pro-teins had varying segments of unassigned regions (mean5387 SD63094) Current approaches in variant annotationand interpretation are primarily focused on highly conservedglobular domains Given that domains are presently assigned toonly 50 of the human proteome analytical methods such asPrediction of Unassigned Regions (PURE) that use intermediatesequence search techniques for domain assignments [103] orsimilar approaches that use sensitive sequence search protocols

Table 5 Examples of the functional variants located in sequence features perturbing diverse functional and structural effects in proteins

Variant location Description URL

Protein domain DMDM A database that compiles domain mapping of dis-ease mutations have information about 202 507 muta-tions associated with 10 919 domains (compiled fromCDD Pfam COG and SMART databases)

httpbioinfumbcedudmdm

Phosphorylation site Mutation of an AKT phosphorylation site of human B-raf httpwwwncbinlmnihgovpubmed15791648Propeptide Mutation in the von Willebrand factor (VWF) propeptide

affects the oligomerizationhttpwwwncbinlmnihgovpubmed20335223

Signal peptide Mutation in signal peptide of ADAMTS10 influence secre-tion of full-length enzyme

httpwwwncbinlmnihgovpubmed18567016

Active site Mutation in the active site of human deoxycytidine kinaseaffects the substrate specificity

httpwwwncbinlmnihgovpubmed18361501

Linear motif Linear motifs mediate functional diversity of transcriptvariants

httpwwwncbinlmnihgovpubmed22638587

Structural motif Heterozygous missense mutation of a spatially distributedstructural motif in human connexin (GJB3) gene causeerythrokeratodermia variabilis

httpwwwncbinlmnihgovpubmed9843209

Subcellular localization Missense mutations in the NPHS2 gene altering the traffick-ing of nephrin to the plasma membrane

wwwncbinlmnihgovpubmed15496146

Interpreting functional effects of coding variants | 9

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

Table 6 Tools for predicting various sequence and structural features

Name Description URL

3dswap-pred Classify a protein sequence as domain-swapping ornon-domain swapping using an SVM model

httpcapsncbsresin3dswap-predindexhtml

AAIndex An amino acid index is a set of 20 numerical values rep-resenting various physico-chemical and biochemicalproperties of amino acids

httpwwwgenomejpaaindex

Bioinformatics LinkDirectory (Protein)

Extensive list of tools for prediction of protein sequencefeatures structure features and function

httpbioinformaticsca links_directorycategoryprotein

dbPTM Comprehensive resource for protein PTMs httpdbptmmbcnctuedutwDISOPRED Dynamically disordered protein chains do not have sta-

ble secondary structures and have high flexibility insolution Disordered regions also play critical roles inprotein function

httpbioinfcsuclacukdisopred

ELM Eukaryotic Linear Motif server httpelmeuorgEris Eris server computes the change of the protein stability

induced by mutations using structural datahttpdokhlabuncedutoolserisindexhtml

FoldAmyloid Method for predicting of amyloidogenic regions fromprotein sequence

httpbioinfoprotresrufold-amyloidogacgi

FoldX FoldX can be used to find interactions contributing tothe stability of proteins and protein complexes usingstructural data

httpfoldxcrges

Globplot Globplot can predict disordered regions in proteinsequence

httpglobplotemblde

H-Predictor Predict hinge regions involved in protein oligomeriza-tion via the domain-swapping mechanism fromstructural data

httptrollmeduncedudokhlabindexphpSpecialHpredictor

Harmony Substitution and propensity score-based protein struc-ture assessment algorithm

httpcapsncbsresinharmony

HORI Webserver for prediction of higher order residue inter-actions in protein structures

httpcapsncbsresinhori

InterPro Integrated database of predictive protein signaturesused for the classification and automatic annotationof proteins and genomes

httpwwwebiacukinterpro

IUPred Prediction of intrinsically unstructured proteins httpiupredenzimhuLIMBO Predicts the amylogenic regions in a protein httplimbovibbeMUPRO Prediction of protein stability changes for single-site

mutations from sequenceshttpmuproproteomicsicsuciedu

NCBI-CDD Extensive protein domain and family annotationdatabase

httpwwwncbinlmnihgovStructurecddcddshtml

Pfam Database of conserved protein domain families httppfamsangeracukPFILT Program to filter various sequence regions including

low-complexity regionshttpbioinfadmincsuclacukdownloadspfilt

PIC Protein interactions calculator httppicmbuiiscernetinProtParam Compute biochemical features like Molecular Weight

Theoretical pI Grand Average of Hydropathy(GRAVY) instability index etc

httpwebexpasyorgprotparam

PSIPRED Secondary structure prediction httpbioinfcsuclacukpsipredPURE Prediction of unassigned regions in proteins httpcapsncbsresinpureSABBLE Relative solvent accessibility prediction httpsablecchmcorgScanProsite Report the functional motifspatterns encoded in the se-

quence Helps to assess the gainloss of functionalsites owing to mutation

httpprositeexpasyorgscanprosite

SignalP Predicts the presence and location of signal peptidecleavage sites in amino-acid sequences

httpwwwcbsdtudkservicesSignalP

SMART Simple modular architecture research tool for assigningdomains to protein

httpsmartembl-heidelbergde

TANGO Predicts the aggregation-prone regions in a protein httptangocrgesTargetP Predicts the subcellular location of eukaryotic proteins httpwwwcbsdtudkservicesTargetPUniProtKB Catalog of information on proteins httpwwwuniprotorgWALTZ Predicts the aggregation-prone regions in a protein httpwaltzvubacbe

10 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

(Table 5) may potentially help investigate the evolutionary andfunctional roles of variants in unassigned regions

Impact of coding SNVs in low-complexityregions

Low-complexity regions (LCRs) in the protein universe refer to astretch of amino acids with low Shannon entropy (leucine-richdomains or poly-alanine tracts) Unlike linear motifs whichhave a specific function and sequence signature the individualfunctions of LCRs are poorly characterized LCRs do not adopt adefinite secondary structure but may exist as solvent-exposedamino acids in the coiled or disordered regions in proteins LCRsare observed in functionally diverse proteins and in both eu-karyotes and prokaryotes The predominant functions of LCRsinclude promoting mRNA stability and mediating a diverse set

of proteinndashprotein interactions [104] We scanned the referencehuman proteome (reviewed subset of 20 237 sequences) usingPFILT [105] and observed that 1432 (nfrac14 2899) protein se-quences have LCRs with a median length of 16 amino acids(Figure 4B) A recent study showed that a functional variantlocalized to an LCR in Nance-Horan Syndrome (NHS) gene wasassociated with clinical features of NHS including cardiacanomaly [106] In the absence of direct functional informationto interpret the impact of functional variants scanning proteinsequences with an LCR prediction tool like PFILT is highly rec-ommended to investigate the probable gain or loss of LCRsowing to the functional variations

Coding SNVs and intrinsically disorderedregions in proteins

Proteins are believed to be functional when the structure attainsits definite globular fold [107] Recent studies however suggestthat proteins may perform their functions even when not in afully folded state Such proteins or regions within proteins thatexist in a stable conformation without attaining a definite struc-tural conformation are generally referred to as intrinsically dis-ordered proteins (IDPs) or proteins with intrinsically disorderedregions (IDRs) or simply disordered proteins [108 109] Disprotis a database that catalogs a curated list of disordered proteinsthat includes 248 experimentally validated human proteinswith disordered regions (httpwwwdisprotorgactionsearchphpkeywordfrac14humanampcriterionfrac14organism) Prediction modelssuggest that 30ndash40 of human proteins are considered to beIDPs or have IDRs and approximately 25 of eukaryotic proteinsare predicted to be fully disordered [108 109]

Regions of the intrinsically disordered segments in proteinshave been found to be functionally important (Figure 5)Disordered regions mediate various functional roles includingprotein binding and proteinndashprotein interactions Previous stud-ies have shown that IDPs can attain a definite structure on bind-ing to their interacting partner and may thus exist in anintermediate stage of disordered (unfolded) to ordered (folded)stage [110 111] Irrespective of the structural plasticity recent

Figure 3 Proportional Venn diagram of genes with coding variants annotated as

polymorphism (Polymorphism nfrac1411 527) disease (Disease nfrac142105) and un-

classified (Unclassified nfrac141910) in Human polymorphisms and disease muta-

tions index Percentages of genes shared between the three groups are provided

Unassigned regions in human proteome

Percentage of unassigned regions in human proteins

Fre

qu

en

cy in

hu

ma

n p

rote

om

e (

Sw

issp

rot

re

vie

we

d s

eq

ue

nce

s)

0 20 40 60 80 100

01

00

02

00

03

00

04

00

0

Low-complexity regions in human proteome

Percentage of LCRs in human proteins

Fre

quency in h

um

an p

rote

om

e (

Sw

isspro

t r

evie

wed s

equences)

0 20 40 60 80 100

0500

1000

1500

2000

2500

3000(a) (b)

Figure 4 Histograms depicting the distribution (in percentage) of (A) unassigned regions and (B) LCRs in the human proteome

Interpreting functional effects of coding variants | 11

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

studies have shown the presence of evolutionarily conservedfunctional motifs within IDPs [112 113]

Recently the influence of disease mutations in the dis-ordered regions was extensively surveyed and the crucialroles of disorder-promoting amino acids in imparting vari-ations in the protein structures were proposed [114 115]Disorder-promoting amino acids are a potential cause ofvariations in protein structures and structural modelsDisorder-to-order transitions are enriched among diseasemutations compared with neutral polymorphisms [116]Currently protein modeling or variant effect prediction al-gorithms do not take the intrinsic disorder characteristicsof a protein into consideration Integrating tools such asDisopred [117] for disorder region prediction into the vari-ant annotation pipelines will improve our understanding ofthe impact of coding SNVs in the disordered regions of aprotein

Influence of coding SNVs on proteinmisfolding domain swapping aggregationmacromolecular crowding and degradation

Diseases have multiple environmental as well as geneticcauses and structural biology is an active area of research tounderstand how changes in individual protein structures orprotein complexes may play a key role in disease manifest-ations Understanding the structural bases of human diseases

would help to identify better targets and design better ligandsfor more effective therapeutic interventions

Protein misfolding and aggregation

Protein folding pathways play a crucial role in mediating cellu-lar homeostasis Defective folding of a protein product is themechanistic basis of several disease phenotypes likeAlzheimerrsquos disease and Parkinsonrsquos disease [118] When non-synonymous mutations lead to the production of misfoldedproteins with aberrant function several gate-keeping pathwaysact on such misfolded proteins to clear them from the cellularenvironment [119] Misfolded proteins targeted for ubiquityla-tion may also lead to protein aggregation Protein aggregation[120ndash123] is defined as a molecular phenomenon where a pro-tein is not cleared from the cellular environment by the normalpathways (Figure 6A) This leads to the aggregation of proteinsin the cellular environment which in turn leads to cellular tox-icity and is considered to be the mechanistic basis of varioushuman diseases such as prion diseases [123 124]

Domain swapping

3D domain swapping (Figure 6B) is a phenomenon observed in asubset of proteins where intermolecular interactions arereplaced by intramolecular interactions [125 126] Domainswapping is also recognized as a mechanism for forming pro-tein aggregates via open-ended mode Domain swapping is

Figure 5 Examples of IDRs in human proteins (A) Breast cancer type 1 susceptibility protein (BRCA1) (B) Cellular tumor antigen p53 (Oncoprotein p53) (C) T-cell sur-

face glycoprotein CD4 (Disprot) (D) Secondary structure information (UniProtKB) and (E) mutation histogram (COSMIC) of BRCA1 is provided to illustrate mutations in

the unstructured regions

12 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

observed in a variety of therapeutically important proteins andis considered to be the mechanism mediating deposition dis-eases such as neurodegenerative diseases and Alzheimerrsquos dis-ease that are caused by conformational perturbations [127 128]Missense mutations in the human phenylalanine hydroxylasehave been shown to infleunce aggregation and degradationproperties [129] Functional mutations are also considered to bea causative factor for proteins to adopt swapped conformationfrom monomers to higher oligomers [130ndash132] A recent study[128] that surveyed a subset of human proteins involved in 3Ddomain swapping suggested that swapping is not only confinedto conformational diseases it is also associated with severalkey biological pathways and also plays a role in mediating di-verse diseases in humans Another study [133] that investigatedthe structural properties of proteins involved in 3D domainswapping suggested that 10 of protein folds and 5 of proteinfamilies include domain-swapped structures

Macromolecular crowding

Proteins interact with a variety of small molecules and othermacromolecules to perform a specific functional task The qua-ternary assembly of a protein is influenced by the importantyet understudied phenomenon of macromolecular crowding

(117 118) Protein function is driven by interactions mediated byinter- and intra-chain amino acids and mutations in the sur-face amino acids may affect macromolecular homeostasisMutagenesis studies have shown the impact of functional mu-tations on proteinndashprotein interfaces (119) oligomerization (120)and stability (121) Mutations that influence the intra-chaininteractions may influence the crowding effect in the cellularenvironment Probing the impact of functional mutations onsuch effects using theoretical models and molecular dynamicsimulation studies is likely to enhance the understanding of therelationship between coding SNVs and macromolecular crowd-ing A list of selected tools and databases available for the pre-diction of misfolding protein aggregation and domainswapping is summarized in Table 5

Protein degradation

Targeted biochemical studies have revealed that the proteindegradation pathway plays an important role in the clearanceof the mutated proteins to reduce cellular toxicity For examplemutant Cu Zn-superoxide dismutase associated with amyo-trophic lateral sclerosis was cleared by macroautophagy path-way that includes proteasomal cleavage phenylalaninehydroxylase (109) and NAD(P)H quinone oxidoreductase 1 (156)

(a) (b)

Figure 6 Schematic representation of the impact of functional mutation on protein misfolding folding aggregation domain swapping macromolecular crowding and

protein degradation pathways Structure of misfolded rat CD2 structure (PDB ID 1A6P) and normal CD2 (PDB ID 1HNG) is used for representing misfolded and folded

structures Functional variant is represented using red asterisk

Interpreting functional effects of coding variants | 13

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

were also cleared by similar mechanisms Functional variantsplay a crucial role in aggregation and protein degration path-ways [134 135] Computational approaches that can predict thedegradation properties of mutated proteins using sequence orstructural information will be useful for rapid characterizationof functionally active proteins Integrated models that combinefolding misfolding aggregation 3D domain swapping macro-molecular crowding and degradation pathways (Figure 6) in asystems biology approach are likely to provide additional in-sights into the regulation of these important mechanisms andthe role of coding SNVs in mediating such mechanisms

Coding SNVs and metamorphic proteins

The molecular paradigm of sequence-structure-function sug-gests that a diverse set of sequences could attain a similarstructural fold that may lead to a functional convergence Whilethis is generally true for the protein universe occasionally devi-ations are observed Metamorphic proteins refer to a relativelynew class of proteins in which a given sequence has beenshown to attain different folded conformations under nativeconditions while performing distinct functions [136ndash138]Metamorphic proteins such as human chemokine lymphotactinhave been shown to influence evolutionary transitions of struc-ture and coding SNVs which could play an important role inswitching between two or more folds and functions for a singlesequence [139] Computational approaches to catalog and pre-dict the metamorphic properties and the impact of coding SNVson metamorphic proteins would help us understand how muta-tions drive functional plasticity at the proteome level

Impact of coding SNVs on the transcriptomicdiversity

Dynamic features of the human proteome are driven by tran-script diversity and the average number of characterized tran-scripts per gene is rapidly expanding The recent adoption ofRNA-Seq as a tool for expression profiling has led to the charac-terization of a large number of novel transcripts includingfusion transcripts [140] Several novel tissue-specific orcell-type-specific transcripts were reported from RNA-Seqexperiments The cellular compartment-based functional rolesof such transcripts are determined by alternative splicingevents [141] RNA editing is a phenomenon where a pre-mRNAmolecule is altered through a chemical change in the basemakeup thereby adding to the diversity of the transcriptomeRNA editing events occur via two distinct mechanisms of sub-stitution editing and insertiondeletion editing leading to func-tional diversity Fusion transcripts [142] and RNA editing eventsare also associated with various diseases including prostatecancer and amyotrophic lateral sclerosis [143] Computationalapproaches are available for the identification of fusion tran-scripts [144] and RNA editing events [145ndash147] from sequencingdata Understanding the precise roles and the impact of variantson such a diverse set of transcripts using computationalapproaches is a challenging task Ascribing the functional roleand assessing the impact of coding SNVs for mediating noveland fusion transcripts using bioinformatics approaches is anemerging problem and being addressed in recent studies [148]Recently we designed an integrated method to simultaneousanalyses of genome and transcriptome from RNA-seq data TheeSNV-Detect method can precisely capture genetic variation(genotypes) from RNA-seq data and helps to design cost-

effective and sustainable experimental strategies [149] Analyticand interpretation strategies that rely on multiple data-typeswould provide greater confidence for variant calling annotationand interpretation and thus actionability for clinicalinterventions

Functional impact of synonymous variations

Synonymous mutations are defined as mutations that result ina variation at the DNA level that code for the same amino acidin the protein level owing to codon degeneracy Synonymousvariants may either be coding or noncoding and the functionalroles of such variants are emerging A recent study [150] sug-gests that introns are involved in functional mechanisms andintrons with positional conservation across eukaryotic lineageare classified as functional introns A comparative analysis ofsynonymous and non-synonymous variants associated withcomplex diseases has shown similar likelihood and effect sizewith disease association [151] Synonymous mutations mayalso influence the introns regulating gene expression [152] A re-cent report also suggests that the coding exons function as tis-sue-specific enhancers and synonymous variations in suchenhancer sites may influence the expression level of certaingenes [153] Synonymous variants could influence the expres-sion of intronic noncoding RNAs [154] perturb the transientproteinndashDNA [155 156] and proteinndashRNA interactions in the cellSuch perturbations can lead to diseases such as cancer neuro-logical disorders and cardiac disease [157] Developing bioinfor-matics approaches to explore the putative impact ofsynonymous mutations on different layers of the coding andnoncoding genome and their relationships will be an importantaspect in the detailed interpretation of the genomic variantsThe detailed structural exploration of proteinndashDNA [158] andproteinndashRNA interactions [159] will help to precisely map themechanistic bases of the loss or gain of interactions owing tosynonymous variations

Impact of coding SNVs on the interactomeof a protein

The individual interactome of proteins varies to a great extent[160] Proteins often interact with nucleotides (proteinndashDNA orproteinndashRNA interactions) proteins (protein complexes obligateor non-obligate) [161] and proteinndashprotein interactions (transientor permanent) [107]) and small molecules (molecular reactionsenzymatic reactions metabolic pathways) [162 163] CodingSNVs play an important role in mediating such interactions Aninteractome of a protein can be defined using information gath-ered from high-throughput experiments that systematicallyidentify interactants of proteins Public protein-protein databaseslike BioGRID [164] and STRING [165] provide large data sets forderiving a protein interactome Network-level investigations tounderstand the category of interactors that are perturbed owingto coding SNVs will help delineate the impact of such mutationson the protein interactomes Using network measures (centralitydegree stress betweenness closeness cliques radiality transi-tivity reciprocity assortativity structural equivalence networkheterogeneity network density clustering coefficients neighbor-hood connectivity shared neighbors network topology etc) asquantitative parameters in variant annotation pipelines will helpthe researchers gain better insights into the impact of codingSNVs on the protein interactome Biologically relevant networkproperties such as network modularity [166] network fragility

14 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

[166] and lethality [166] can also be derived from an interactomeIncorporating a comparative network analysis framework thatcompares wild-type and variant interactomes will help quantifythe impact of coding SNVs on a network scale Methods such asVCF2Networks employ genotype networks (ie all genotypesassociated with a single phenotype) to understand the relation-ships between the genotype space and clinical or biological phe-notypes [167] Table 7 summarizes the available tools toinvestigate gene lists and to derive global functional trends andnetwork properties

Pathway-level impact of coding SNVs

Multiple functional genomics studies have investigated the per-turbations of pathways as a result of mutations in diseasessuch as adenocarcinoma (MAPK signaling p53 signaling Wntsignaling cell cycle and mammalian target of rapamycin path-ways) [168] childhood acute lymphoblastic leukemia (RAS path-way) [169] colorectal carcinoma (WNT signaling pathway) [170]etc Several SNP-centric approaches for pathway-level inferencehave been designed for the interpretation of SNPs identifiedfrom GWA studies or differentially expressed genes from gene-expression studies [171ndash175] Coding SNVs may also influencethe cross talk between various signaling pathways Currentvariant interpretation algorithms strive to identify pathways

associated with genes harboring coding SNVs yet a detailedunderstanding of the impact of variants on biological pathwaysand pathway cross talks [176] is often lacking Incorporatinganalytical routines to quantify the effect of functional muta-tions on pathways and pathway cross talk will be useful in in-terpreting functional variants from a biological perspective

A classification schema for annotatingcoding SNVs

Identifying the entire spectrum of molecular perturbationsowing to coding variations at the level of sequence structureinteraction and function of proteins is considered the basis ofvariant interpretation Multiple automated prediction tools thatcan assess the functional effect of mutations are currentlyavailable [See Table 6] Several variant annotation tools focuson the task of predicting the type or effect of mutations provideextended annotations from precomputed databases or lookuptables and map a coding variant to its corresponding gene pro-tein functional domain or signaling pathway Theseapproaches have several limitations because additional layersof protein-centric information that can be derived from the pre-diction or computation of sequence-based features with codingvariants are lacking To deal with this challenge and to designbroad-spectrum tools for deep variant interpretation we

Table 7 Software libraries and tools for biological network analysis

Name Description URL

Bio4J Bio4j is a bioinformatics graph-based database httpwwwbio4jcomBioConductor (Graphs

and Networks view)Collection of BioConductor modules for biological net-

work analysis and visualizationhttpwwwbioconductororgpackagesrelease

BiocViewshtml___GraphsAndNetworksCytoscape Open source platform for complex network analysis and

visualization with a large collection of plug-ins forbiological network analysis

httpwwwcytoscapeorg

DAVID Integrated functional annotation tool httpdavidabccncifcrfgovhomejspFunDO Functional Disease Ontology server httpdjangonubicnorthwesternedufundogene2pathway R package for prediction of KEGG pathway membership

for individual genes based on InterPro domainsignatures

httpwwwbioconductororgpackagesreleasebiochtmlgene2pathwayhtml

GeneAnswers R package for biological or medical interpretation of thegiven one or more groups of genes by means of statis-tical test

httpwwwbioconductororgpackagesreleasebiochtmlGeneAnswershtml

GO Tools Tools for analysis of GO T term enrichment statisticalanalysis semantic similarity and functional similar-ity using GO terms derived from gene lists

httpwwwgeneontologyorgGOtoolsshtml

Gephi Open-source graph visualization and analysis software httpwwwgephiorgGremlin Graph-based programming language httpsgithubcomtinkerpopgremlinwikiiGraph Network analysis and visualization library in C Also

available R package and a Python extensionhttpigraphsourceforgenet

KEGGgraph R package for analysis of KEGG pathways httpwwwbioconductororgpackages210biochtmlKEGGgraphhtml

LEDA A broad-spectrum Cthornthorn class library for efficient datatypes and algorithms including large-scale networkanalysis

httpwwwalgorithmic-solutionscomledaaboutindexhtm

Neat Web-based network analysis tools httpgephiorgOntology Analysis

plugins for Cytoscapehttpchiantiucsdeducyto_webplugins Plugins for functional enrichment analysis using

network dataPANTHER Classification of genes and proteins httppantherdborgReactome Curated knowledgebase of molecular events and

pathwayshttpwwwreactomeorg

TargetMine Integrates different types of biological data and enableflexible queries export results and analyze lists ofdata

httptargetminemizuguchilaborg

Interpreting functional effects of coding variants | 15

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

recommend a three-level annotation schema for the interpret-ation of coding variants (Figure 7) Primary level annotation pro-vides an overview of the types of mutations and associationannotations Secondary level annotation enables the systematicinvestigation of multiple variants in a single gene Tertiary-levelannotations help to find the global characteristics of genes har-boring coding variants and various properties in a network-scale

Level 1 Primary annotation of coding SNVs

Tools that use position information from VCF files to derive thetype of mutation (synonymous or non-synonymous) the loca-tion of the corresponding gene and its mapping onto several an-notation resources using positional data may be viewed aslsquoprimary annotationsrsquo See Table 2 for a list of tools that useminimum input data to gather a diverse set of annotationsusing database lookup tables and identifier mapping

Level 2 Secondary annotation of coding SNVs- Impactof multiple variants in a single gene

Sequence and structural explorations of coding variants arereferred to as secondary annotations VCF files can have mul-tiple variants of the same gene associated with a given pheno-type or disease A systematic investigation of multiple variantsin a single gene and an assessment of the relative impact of cod-ing variants could help in classifying multiple variants in a geneand help to delineate the variants as pathogenic moderatelypathogenic or VUS Comparative analysis of wild-type and vari-ant sequences can be performed and quantified using the totalnumber of gains or losses in the sequence features Based on the

availability of the structural data a given variant can be modeledin a protein structure using in-silico mutagenesis experimentsIn the absence of experimental protein structural data a proteinstructure model can be obtained from the homology model data-base or a new model can be built using homology modelingapproaches Once the wild type and mutant type structures areobtained the impact of variation can be rapidly computed usingstructural feature analyses or using computationally expensivemolecular dynamics simulations Table 5 summarizes key se-quence and structure feature prediction tools

Level 3 Tertiary annotation of coding SNVs- Impact ofcoding SNVs on multiple genes pathways andinteractome

The global impact of multiple coding variants on a genome orexome can be assessed using a combination of both knowledge-based enrichment or depletion analysis and network or interac-tome analyses Functional profiling using GO terms have beenused as an effective approach for characterizing collective func-tional characteristics of a perturbed set of genes [177 178]Gene-list-based analyses can provide biologically relevant infor-mation about the variants Enrichment analysis is not just con-fined to a priori defined gene sets or GO annotationsenrichment can be performed using several types of annota-tions and could provide insights into the probable functional as-sociations of genes with coding variants Enrichment analysisusing protein annotations would help to identify functionallysignificant protein domains protein classes families (mem-brane proteins or kinases etc) protein superfamilies (angioten-sin receptors or G-protein coupled receptors) or protein folds(Rossman fold beta-propeller) associated with proteins

Figure 7 Three-level schema for annotation of functional variants

16 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

harboring coding variants Furthermore it is possible to identifyenriched molecular events mediated by genes enriched TFBS inthe upstream of genes and biological pathways mediated by thegenes using annotations from Reactome UCSC and KEGG repo-sitories Disease ontology can be used to find known genendashdisease associations Following the knowledge-based analysis anetwork-level analysis of genes harboring coding SNVs can beperformed using publicly available network analysis tools suchas Cytoscape and Gephi

Future directions

The rapidly increasing availability of sequence structure func-tional and interaction information offers an attractive means toobtain a detailed characterization of coding variants Howeverthe use of large data sets to systematically explore the functionalvariants is currently limited owing to the gaps in the availablesequencendashstructurendashfunction interaction data Sequence dataare available for the entire human proteome but the availabilityof protein structural data experimentally verified functional as-sociations and biomolecular interaction data is limited Theavailability of structural data of homologous protein structuresis a prerequisite for structural investigations of functional vari-ants and its impact on the structural environment Availabilityof high-quality interaction data will also help analyze the cellu-lar networks modulated by proteins harboring coding variantsThe expansion of structural data sets using structural genomics[179] homology modeling [180] and the application of computa-tional approaches such as genomics-aided structure prediction[181] is likely to lower the increasing chasm between sequencestructure function and interaction data and may help to ascer-tain the impact of variants at the biomolecular level [163 182183] Improving the annotation databases using high-qualitycurated data and integration with cloud-based or stand-alonepipelines [184ndash189] can also help facilitate such efforts

The current version of the Bioinformatics Link Directory en-lists gt900 bioinformatics tools that are capable of processing dif-ferent types of protein data (sequence structure interactionquantification and annotation) Unifying multiple resources andenabling programmatic access via application program interfacesand web services for rapid integration will significantly enhancethe efficacy of variant annotation pipelines The development ofnovel statistical methods that can quantify various residue-specific properties would enable a comparative analysis of wild-type and variant sequences and can thereby help to facilitate theidentification and prioritization of functional mutations The in-clusion of a diverse set of sequence structure and network fea-tures into a graphically depicted database and an assessment ofthe impact of various data types (sequence structure functionand interaction) using probabilistic or machine learning modelswould also help automate variant annotation interpretation

From a sequence perspective understanding the precisefunctional role of LCRs unassigned regions and disordered re-gions as well as the relationship between these features and cod-ing SNVs is a key step in variant annotation and interpretationIdentifying deviations in protein structural space that are likelyto lead to misfolding domain swapping aggregation degrad-ation or deviations that are metamorphic in nature owing to theimpact of coding SNVs is an emerging challenge Several algo-rithms are available to predict sequence or structural domainsassign distantly related domains to unassigned regions func-tional motifs structural motifs and structural features proteindisorder and aggregation propensities However algorithms andtools to predict 3D domain swapping or to analyze metamorphic

properties degradation pathways and the network-level impactof coding SNVs are less abundant This may be attributed to thefact that these concepts are continually evolving in the proteinuniverse and concerted efforts are needed to understand the im-pact of coding variants of such unique features from both experi-mental and computational biologists alike

This review summarizes the major bioinformatics chal-lenges involved in gaining a deeper understanding of codingSNVs Coding SNVs may impart a molecular or disease pheno-type by interacting with noncoding regulatory parts of the gen-ome and the structural variants may provide an importantcontribution to this phenomenon Analyzing functional andregulatory variants in a single analytical framework may furtherenhance the interpretation of the genomic variants

Conclusions

A rapid decline in sequencing costs using NGS technologies hasled to an exponential increase in the frequency of the sequenc-ing projects over the past decade A large number of personalgenomes and exomes clinical samples and cancer genomes arebeing sequenced as part of large-scale collaborative projectsThe identification of a plethora of sequence variants associatedwith diverse molecular and disease phenotypes Scalable com-putational approaches that integrate annotations using se-quence structure functional and interaction data are necessaryfor the rapid interpretation of coding SNVs We have discussedthe bioinformatics resources available for the prediction anno-tation and visualization of coding SNVs summarized the majorbioinformatics challenges into 10 different themes for a deeperinterpretation of coding SNVs and recommended a three-levelschema to assess the phenotypic impact of functional vari-ations on individual protein sequences structures differentfunctional categories biological pathways and interactomesWe envisage that addressing the key challenges discussed inthis review and adopting a comprehensive annotation schemafor variant annotation could improve genomic reports that aregenerated as part of genomic medicine investigations and ex-perimental studies to better understand the variations impli-cated in rare common and complex disease manifestations

Key Points

bull Interpretations of variants identified from next-gener-ation sequencing pose several challenges

bull Systems-level experimental investigation of the func-tional variants is expensive and time-consumingefficient computational techniques are required toidentify the impact of functional variants

bull We recommended an integrated approach that com-bines multiple data types and tools for the predictionannotation and visualization of functional variantsand we have proposed a systematic approach for func-tional variant annotation and interpretation

bull Significant challenges that need to be negotiated dur-ing the interpretation of coding single nucleotide vari-ants are presented with the help of various examples

bull A three-level annotation approach that combines theinformation at the level of an individual variant mul-tiple variants in a single protein and global trends ofmultiple genes harboring multiple variants is proposedfor an effective interpretation of coding single nucleo-tide variants

Interpreting functional effects of coding variants | 17

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

Acknowledgements

KS and JTD would like to thank Icahn Institute forGenomics and Multiscale Biology (httpicahnmssmedudepartments-and-institutesgenomics) Mount Sinai HealthSystem for infrastructural support KS would like to ac-knowledge Michael J Zimmermann and Jean-Pierre Kocher(Mayo Clinic Rochester) for useful discussions The MayoClinic Bioinformatics Core provided infrastructure and ana-lytical support to KRK RS would like to thank NationalCentre for Biological Sciences (TIFR) for infrastructuralsupport

Funding

KS and JTD was supported by grant R01-DK098242-03from the National Institute of Diabetes and Digestive andKidney Diseases KRK is funded in part by the Mayo ClinicCenter for Individualized Medicine the PharmacogenomicsResearch Network (PGRN) and the Mayo Clinic BreastSpecialized Program of Research Excellence (SPORE)

References1 Lander ES Initial impact of the sequencing of the human

genome Nature 2011470187ndash972 Mardis ER A decadersquos perspective on DNA sequencing tech-

nology Nature 2011470198ndash2033 Green ED Guyer MS Charting a course for genomic medi-

cine from base pairs to bedside Nature 2011470204ndash134 Maxmen A Exome sequencing deciphers rare diseases Cell

2011144635ndash75 Wheeler DA Srinivasan M Egholm M et al The complete

genome of an individual by massively parallel DNAsequencing Nature 2008452872ndash6

6 Weatheritt RJ Davey NE Gibson TJ Linear motifs conferfunctional diversity onto splice variants Nucleic Acids Res201240123ndash31

7 Goldstein DB Common genetic variation and human traitsN Engl J Med 20093601696ndash8

8 Manolio TA Collins FS Cox NJ et al Finding the missing her-itability of complex diseases Nature 2009461747ndash53

9 Genome-wide association study of 14 000 cases of sevencommon diseases and 3 000 shared controls Nature2007447661ndash78

10 Kullo IJ Ding K Shameer K et al Complement receptor 1gene variants are associated with erythrocyte sedimenta-tion rate Am J Hum Genet 201189131ndash8

11 Welter D MacArthur J Morales J et al The NHGRI GWASCatalog a curated resource of SNP-trait associations NucleicAcids Res 201442D1001ndash6

12 Leslie R OrsquoDonnell CJ Johnson AD GRASP analysis of geno-type-phenotype results from 1390 genome-wide associationstudies and corresponding open access databaseBioinformatics 201430i185ndash94

13 Nelson MR Wegmann D Ehm MG et al An abundance ofrare functional variants in 202 drug target genes sequencedin 14 002 people Science 2012337100ndash4

14 MacArthur DG Balasubramanian S Frankish A et al A sys-tematic survey of loss-of-function variants in human pro-tein-coding genes Science 2012335823ndash8

15 Raychaudhuri S Mapping rare and common causal allelesfor complex human diseases Cell 201114757ndash69

16 Stankiewicz P Lupski JR Structural variation in the humangenome and its role in disease Annu Rev Med201061437ndash455

17 McCarroll SA Altshuler DM Copy-number variation and as-sociation studies of human disease Nat Genet200739S37ndash42

18 Zhang F Gu W Hurles ME et al Copy number variation inhuman health disease and evolution Annu Rev GenomicsHum Genet 200910451ndash81

19 Medvedev P Stanciu M Brudno M Computational methodsfor discovering structural variation with next-generationsequencing Nat Methods 20096S13ndash20

20 Sherry ST Ward MH Kholodov M et al dbSNP the NCBI data-base of genetic variation Nucleic Acids Res 200129308ndash11

21 Rios D McLaren WM Chen Y et al A database and API forvariation dense genotyping and resequencing data BMCBioinformatics 201011238

22 Iafrate AJ Feuk L Rivera MN et al Detection of large-scale variation in the human genome Nat Genet200436949ndash51

23 Ashley EA Butte AJ Wheeler MT et al Clinical assess-ment incorporating a personal genome Lancet20103751525ndash35

24 Chen R Mias GI Li-Pook-Than J et al Personal omics profil-ing reveals dynamic molecular and medical phenotypesCell 20121481293ndash307

25 Lupski JR Reid JG Gonzaga-Jauregui C et al Whole-genomesequencing in a patient with Charcot-Marie-Tooth neur-opathy N Engl J Med 20103621181ndash91

26 Li Y Vinckenbosch N Tian G et al Resequencing of200 human exomes identifies an excess of low-fre-quency non-synonymous coding variants Nat Genet201042969ndash72

27 Clarke L Zheng-Bradley X Smith R et al The 1000 GenomesProject data management and community access NatMethods 20129459ndash462

28 Schunkert H Konig IR Kathiresan S et al Large-scale associ-ation analysis identifies 13 new susceptibility loci for coron-ary artery disease Nat Genet 201143333ndash8

29 Helgadottir A Thorleifsson G Manolescu A et al A commonvariant on chromosome 9p21 affects the risk of myocardialinfarction Science 20073161491ndash3

30 Visel A Zhu Y May D et al Targeted deletion of the 9p21non-coding coronary artery disease risk interval in miceNature 2010464409ndash12

31 Soranzo N Rendon A Gieger C et al A novel variant onchromosome 7q223 associated with mean platelet volumecounts and function Blood 20091133831ndash7

32 Paul DS Nisbet JP Yang TP et al Maps of open chromatinguide the functional follow-up of genome-wide associationsignals application to hematological traits PLoS Genet20117e1002139

33 Hull J Campino S Rowlands K et al Identification of com-mon genetic variation that modulates alternative splicingPLoS Genet 20073e99

34 Albert FW Kruglyak L The role of regulatory vari-ation in complex traits and disease Nat Rev Genet201516197ndash212

35 Battle A Khan Z Wang SH et al Genomic variation Impactof regulatory variation from RNA to protein Science2015347664ndash7

36 den Dunnen JT Antonarakis SE Nomenclature for the de-scription of human sequence variations Hum Genet2001109121ndash4

18 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

37 Eilbeck K Lewis SE Mungall CJ et al The sequence ontologya tool for the unification of genome annotations Genome Biol20056R44

38 Landrum MJ Lee JM Riley GR et al ClinVar public archive ofrelationships among sequence variation and human pheno-type Nucleic Acids Res 201442D980ndash5

39 Kircher M Witten DM Jain P et al A general framework forestimating the relative pathogenicity of human genetic vari-ants Nat Genet 201446310ndash15

40 Kumar P Henikoff S Ng PC Predicting the effects of codingnon-synonymous variants on protein function using theSIFT algorithm Nat Protoc 200941073ndash81

41 Adzhubei I Schmidt S Peshkin L et al A method and serverfor predicting damaging missense mutations Nat Meth20107248ndash9

42 Gonzalez-Perez A Lopez-Bigas N Improving the assessmentof the outcome of nonsynonymous SNVs with a consensusdeleteriousness score Condel Am J Hum Genet 201188440ndash9

43 Grantham R Amino acid difference formula to help explainprotein evolution Science 1974185862ndash4

44 Cooper GM Stone EA Asimenos G et al Distribution and in-tensity of constraint in mammalian genomic sequenceGenome Res 200515901ndash13

45 Pollard KS Hubisz MJ Rosenbloom KR et al Detection ofnonneutral substitution rates on mammalian phylogeniesGenome Res 201020110ndash21

46 Siepel A Bejerano G Pedersen JS et al Evolutionarily con-served elements in vertebrate insect worm and yeast gen-omes Genome Res 2005151034ndash50

47 Dayhoff MO Eck RV Park CM Model of evolutionary changein proteins In MO Dayhoff (ed) Atlas of Protein Sequence andStructure Washington DC National Biomedical ResearchFoundation 1972 89ndash99

48 Henikoff S Henikoff JG Amino acid substitution matricesfrom protein blocks Proc Natl Acad Sci USA 19928910915ndash19

49 Altschul SF Gish W Miller W et al Basic local alignmentsearch tool J Mol Biol 1990215403ndash10

50 Altschul SF Madden TL Schaffer AA et al Gapped BLASTand PSI-BLAST a new generation of protein database searchprograms Nucleic Acids Res 1997253389ndash402

51 Sunyaev S Ramensky V Koch I et al Prediction of deleteri-ous human alleles Hum Mol Genet 200110591ndash7

52 Yang Z Bielawski JP Statistical methods for detecting mo-lecular adaptation Trends Ecol Evol 200015496ndash503

53 Ng PC Henikoff S Predicting deleterious amino acid substi-tutions Genome Res 200111863ndash74

54 Wang Z Moult J SNPs protein structure and disease HumMutat 200117263ndash70

55 Thomas PD Kejariwal A Coding single-nucleotide poly-morphisms associated with complex vs Mendelian diseaseevolutionary evidence for differences in molecular effectsProc Natl Acad Sci USA 200410115398ndash403

56 Stone EA Sidow A Physicochemical constraint violation bymissense substitutions mediates impairment of proteinfunction and disease severity Genome Res 200515978ndash86

57 Miller MP Kumar S Understanding human disease muta-tions through the use of interspecific genetic variation HumMol Genet 2001102319ndash28

58 Johnson AD Single-nucleotide polymorphism bioinfor-matics a comprehensive review of resources Circ CardiovascGenet 20092530ndash6

59 Marjoram P Tavare S Modern computational approachesfor analysing molecular genetic variation data Nat Rev Genet20067759ndash70

60 Excoffier L Heckel G Computer programs for populationgenetics data analysis a survival guide Nat Rev Genet20067745ndash58

61 Peterson TA Nehrt NL Park D et al Incorporating molecularand functional context into the analysis and prioritizationof human variants associated with cancer J Am Med InformAssoc 201219275ndash83

62 Kann MG Advances in translational bioinformatics compu-tational approaches for the hunting of disease genes BriefBioinform 20101196ndash110

63 Cooper GM Shendure J Needles in stacks of needles findingdisease-causal variants in a wealth of genomic data Nat RevGenet 201112628ndash40

64 Stitziel NO Kiezun A Sunyaev S Computational and statis-tical approaches to analyzing variants identified by exomesequencing Genome Biol 201112227

65 Alexander RP Fang G Rozowsky J et al Annotating non-coding regions of the genome Nat Rev Genet 201011559ndash71

66 Mort M Evani US Krishnan VG et al In silico func-tional profiling of human disease-associated and poly-morphic amino acid substitutions Hum Mutat 201031335ndash46

67 Hamosh A Scott AF Amberger JS et al Online MendelianInheritance in Man (OMIM) a knowledgebase of humangenes and genetic disorders Nucleic Acids Res200533D514ndash17

68 Stenson PD Mort M Ball EV et al The human gene mutationdatabase 2008 update Genome Med 2009113

69 Forbes SA Bindal N Bamford S et al COSMIC mining com-plete cancer genomes in the Catalogue of SomaticMutations in Cancer Nucleic Acids Res 201139D945ndash50

70 Chen YA Tripathi LP Mizuguchi K TargetMine an inte-grated data warehouse for candidate gene prioritisation andtarget discovery PLoS One 20116e17844

71 Wang K Li M Hakonarson H ANNOVAR functional annota-tion of genetic variants from high-throughput sequencingdata Nucleic Acids Res 201038e164

72 Danecek P Auton A Abecasis G et al The variant call formatand VCFtools Bioinformatics 2011272156ndash8

73 Macintyre G Jimeno Yepes A Ong CS et al Associating dis-ease-related genetic variants in intergenic regions to thegenes they impact Peer J 20142e639

74 Javed A Agrawal S Ng PC Phen-Gen combining phenotypeand genotype to analyze rare disorders Nat Methods201411935ndash7

75 Dai HJ Wu JC Tsai RT et al T-HOD a literature-based candi-date gene database for hypertension obesity and diabetesDatabase (Oxford) 20132013bas061

76 Cheng D Knox C Young N et al PolySearch a web-basedtext mining system for extracting relationships betweenhuman diseases genes mutations drugs and metabolitesNucleic Acids Res 200836W399ndash405

77 Do R Kathiresan S Abecasis GR Exome sequencing andcomplex disease practical aspects of rare variant associ-ation studies Hum Mol Genet 201221R1ndash9

78 Wagner MJ Rare-variant genome-wide association studiesa new frontier in genetic analysis of complex traitsPharmacogenomics 201314413ndash24

79 Genomes Project C Abecasis GR Auton A et al An inte-grated map of genetic variation from 1 092 human genomesNature 201249156ndash65

80 Anfinsen CB The formation and stabilization of proteinstructure Biochem J 1972128737ndash49

81 Anfinsen CB Principles that govern the folding of proteinchains Science 1973181223ndash30

Interpreting functional effects of coding variants | 19

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

82 Fay J Wu C Sequence divergence functional constraintand selection in protein evolution Annu Rev Genomics HumGenet 20034213ndash35

83 Ponting CP Russell RR The natural history of protein do-mains Annu Rev Biophys Biomol Struct 20023145ndash71

84 Ekman D Bjorklund AK Frey-Skott J et al Multi-domain pro-teins in the three kingdoms of life orphan domains andother unassigned regions J Mol Biol 2005348231ndash43

85 Sigrist CJ Cerutti L Hulo N et al PROSITE a documenteddatabase using patterns and profiles as motif descriptorsBrief Bioinform 20023265ndash74

86 Dinkel H Michael S Weatheritt RJ et al ELMndashthe database ofeukaryotic linear motifs Nucleic Acids Res 201240D242ndash51

87 Haberichter SL Budde U Obser T et al The mutation N528Sin the von Willebrand factor (VWF) propeptide causes de-fective multimerization and storage of VWF Blood20101154580ndash7

88 Kutz WE Wang LW Dagoneau N et al Functional analysis ofan ADAMTS10 signal peptide mutation in Weill-Marchesanisyndrome demonstrates a long-range effect on secretion ofthe full-length enzyme Hum Mutat 2008291425ndash34

89 Pugalenthi G Suganthan PN Sowdhamini R et al SMotif aserver for structural motifs in proteins Bioinformatics200723637ndash8

90 Sundaramurthy P Shameer K Sreenivasan R et al HORI aweb server to compute Higher Order Residue Interactions inprotein structures BMC Bioinformatics 201011(Suppl 1)S24

91 Porter CT Bartlett GJ Thornton JM The Catalytic Site Atlas aresource of catalytic sites and residues identified in enzymesusing structural data Nucleic Acids Res 200432D129ndash33

92 Stuart AC Ilyin VA Sali A LigBase a database of families ofaligned ligand binding sites in known protein sequencesand structures Bioinformatics 200218200ndash01

93 Swapna LS Mahajan S de Brevern A et al Comparison oftertiary structures of proteins in protein-protein complexeswith unbound forms suggests prevalence of allostery in sig-nalling proteins BMC Struct Biol 2012126

94 Mathivanan S Ahmed M Ahn NG et al Human Proteinpediaenables sharing of human protein data Nat Biotechnol200826164ndash7

95 Bhaskara RM Srinivasan N Stability of domain structures inmulti-domain proteins Sci Rep 2011140

96 Furnham N de Beer TA Thornton JM Current challenges ingenome annotation through structural biology and bioinfor-matics Curr Opin Struct Biol 201222594ndash601

97 Wang Y Cheng H Pan Z et al Reconfiguring phosphoryl-ation signaling by genetic polymorphisms affects cancersusceptibility J Mol Cell Biol 20157187ndash202

98 Gnad F Gunawardena J Mann M PHOSIDA 2011 the post-translational modification database Nucleic Acids Res201139D253ndash60

99 Nicolaou N Margadant C Kevelam SH et al Gain of glycosy-lation in integrin alpha3 causes lung disease and nephroticsyndrome J Clin Invest 20121224375ndash87

100 Radivojac P Baenziger PH Kann MG et al Gain and loss ofphosphorylation sites in human cancer Bioinformatics200824i241ndash7

101 Pan Y Karagiannis K Zhang H et al Human germline andpan-cancer variomes and their distinct functional profilesNucleic Acids Res 20144211570ndash88

102 Mazumder R Morampudi KS Motwani M et al Proteome-wide analysis of single-nucleotide variations in the N-glyco-sylation sequon of human genes PLoS One 20127e36212

103 Reddy CC Shameer K Offmann BO et al PURE a webserverfor the prediction of domains in unassigned regions in pro-teins BMC Bioinformatics 20089281

104 Coletta A Pinney JW Solis DY et al Low-complexity regionswithin protein sequences have position-dependent rolesBMC Syst Biol 2010443

105 Jones DT Swindells MB Getting the most from PSI-BLASTTrends Biochem Sci 200227161ndash4

106 Chograni M Rejeb I Jemaa LB et al The first missense muta-tion of NHS gene in a Tunisian family with clinical featuresof NHS syndrome including cardiac anomaly Eur J HumGenet 201119851ndash6

107 Jones S Thornton JM Principles of protein-protein inter-actions Proc Natl Acad Sci USA 19969313ndash20

108 Babu MM van der Lee R de Groot NS et al Intrinsically dis-ordered proteins regulation and disease Curr Opin Struct Biol201121432ndash40

109 Dunker AK Silman I Uversky VN et al Function and struc-ture of inherently disordered proteins Curr Opin Struct Biol200818756ndash64

110 Wright PE Dyson HJ Intrinsically unstructured proteins re-assessing the protein structure-function paradigm J Mol Biol1999293321ndash31

111 Dyson HJ Wright PE Coupling of folding and binding for un-structured proteins Curr Opin Struct Biol 20021254ndash60

112 Nguyen Ba AN Yeh BJ van Dyk D et al Proteome-wide dis-covery of evolutionary conserved sequences in disorderedregions Sci Signal 20125rs1

113 Das RK Mao AH Pappu RV Unmasking functionalmotifs within disordered regions of proteins Sci Signal20125pe17

114 Hu Y Liu Y Jung J et al Changes in predicted protein dis-order tendency may contribute to disease risk BMCGenomics 201112(Suppl 5)S2

115 Vacic V Iakoucheva LM Disease mutations in disordered re-gionsndashexception to the rule Mol BioSyst 2012827ndash32

116 Srivastava SK Gayathri S Manjasetty BA et al Analysis ofconformational variation in macromolecular structuralmodels PLoS One 20127e39993

117 Ward JJ Sodhi JS McGuffin LJ et al Prediction and functionalanalysis of native disorder in proteins from the three king-doms of life J Mol Biol 2004337635ndash45

118 Thomas PJ Qu BH Pedersen PL Defective protein folding asa basis of human disease Trends Biochem Sci 199520456ndash9

119 Welch W Role of quality control pathways in human dis-eases involving protein misfolding Semin Cell Dev Biol20041531ndash8

120 Fink AL Protein aggregation folding aggregates inclusionbodies and amyloid Fold Des 19983R9ndash23

121 Thirumalai D Klimov DK Dima RI Emerging ideas on themolecular basis of protein and peptide aggregation CurrOpin Struct Biol 200313146ndash59

122 Khare SD Dokholyan NV Molecular mechanisms of poly-peptide aggregation in human diseases Curr Protein Pept Sci20078573ndash9

123 Soto C Estrada L Castilla J Amyloids prions and the inher-ent infectious nature of misfolded protein aggregatesTrends Biochem Sci 200631150ndash5

124 Prusiner SB Molecular biology and pathogenesis of priondiseases Trends Biochem Sci 199621482ndash7

125 Shameer K Shingate PN Manjunath SC et al 3DSwap cura-ted knowledgebase of proteins involved in 3D domain swap-ping Database (Oxford) 20112011bar042

20 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

126 Bennett MJ Choe S Eisenberg D Domain swapping entan-gling alliances between proteins Proc Natl Acad Sci USA1994913127ndash31

127 Bennett MJ Sawaya MR Eisenberg D Deposition diseasesand 3D domain swapping Structure 200614811ndash24

128 Shameer K Sowdhamini R Functional repertoire molecularpathways and diseases associated with 3D domain swap-ping in the human proteome J Clin Bioinform 201228

129 Waters PJ Parniak MA Hewson AS et al Alterations in pro-tein aggregation and degradation due to mild and severemissense mutations (A104D R157N) in the human phenyl-alanine hydroxylase gene (PAH) Hum Mutat 199812344ndash54

130 OrsquoNeill JW Kim DE Johnsen K et al Single-site mutations in-duce 3D domain swapping in the B1 domain of protein Lfrom Peptostreptococcus magnus Structure 200191017ndash27

131 Seeliger MA Spichty M Kelly SE et al Role of conformationalheterogeneity in domain swapping and adapter function ofthe Cks proteins J Biol Chem 200528030448ndash59

132 Kirsten Frank M Dyda F Dobrodumov A et al Core muta-tions switch monomeric protein GB1 into an intertwinedtetramer Nat Struct Biol 20029877ndash85

133 Huang Y Cao H Liu Z Three-dimensional domain swappingin the protein structure space Proteins 2012801610ndash19

134 Gidalevitz T Wang N Deravaj T et al Natural genetic vari-ation determines susceptibility to aggregation or toxicity ina C elegans model for polyglutamine disease BMC Biol201311100

135 Ross CA Poirier MA Protein aggregation and neurodegener-ative disease Nat Med 200410(Suppl)S10ndash17

136 Murzin A Metamorphic proteins Science 20083201725ndash6137 Shortle D One sequence plus one mutation equals two

folds Proc Natl Acad Sci USA 200910621011ndash12138 Yadid I Kirshenbaum N Sharon M et al Metamorphic pro-

teins mediate evolutionary transitions of structure Proc NatlAcad Sci USA 20101077287ndash92

139 Yadid I Kirshenbaum N Sharon M et al Metamorphic pro-teins mediate evolutionary transitions of structure Proc NatlAcad Sci USA 20101077287ndash92

140 Pickrell J Marioni J Pai A et al Understanding mechanismsunderlying human gene expression variation with RNAsequencing Nature 2010464768ndash72

141 Matlin AJ Clark F Smith CW Understanding alternative splic-ing towards a cellular code Nat Rev Mol Cell Biol 20056386ndash98

142 Rickman DS Pflueger D Moss B et al SLC45A3-ELK4 is anovel and frequent erythroblast transformation-specific fu-sion transcript in prostate cancer Cancer Res 2009692734ndash8

143 Flomen R Makoff A Increased RNA editing in EAAT2 pre-mRNA from amyotrophic lateral sclerosis patients involve-ment of a cryptic polyadenylation site Neurosci Lett2011497139ndash43

144 Asmann YW Hossain A Necela BM et al A novel bioinfor-matics pipeline for identification and characterization of fu-sion transcripts in breast cancer and normal cell linesNucleic Acids Res 201139e100

145 Bundschuh R Computational prediction of RNA editingsites Bioinformatics 2004203214ndash20

146 Eisenberg E Li JB Levanon EY Sequence based identificationof RNA editing sites RNA Biol 20107248ndash52

147 Levanon EY Eisenberg E Algorithmic approaches for identi-fication of RNA editing sites Brief Funct Genomics Proteomic2006543ndash5

148 McCarthy DJ Humburg P Kanapin A et al Choice of tran-scripts and software has a large effect on variant annota-tion Genome Med 2014626

149 Tang X Baheti S Shameer K et al The eSNV-detect a com-putational system to identify expressed single nucleotidevariants from transcriptome sequencing data Nucleic AcidsRes 201442e172

150 Chorev M Carmel L The function of introns Front Genet2012355

151 Chen R Davydov EV Sirota M et al Non-synonymous andsynonymous coding SNPs show similar likelihood and effectsize of human disease association PLoS One 20105e13574

152 Rose AB Intron-mediated regulation of gene expressionCurr Top Microbiol Immunol 2008326277ndash90

153 Birnbaum RY Clowney EJ Agamy O et al Coding exonsfunction as tissue-specific enhancers of nearby genesGenome Res 2012221059ndash68

154 Nakaya HI Amaral PP Louro R et al Genome mapping andexpression analyses of human intronic noncoding RNAsreveal tissue-specific patterns and enrichment in genesrelated to regulation of transcription Genome Biol 20078R43

155 Rohs R Jin X West SM et al Origins of specificity in protein-DNA recognition Ann Rev Biochem 201079233ndash69

156 Karczewski KJ Dudley JT Kukurba KR et al Systematic func-tional regulatory assessment of disease-associated variantsProc Natl Acad Sci USA 20131109607ndash12

157 Khalil AM Rinn JL RNA-protein interactions in humanhealth and disease Semin Cell Dev Biol 201122359ndash65

158 Jones S van Heyningen P Berman HM et al Protein-DNAinteractions a structural analysis J Mol Biol 1999287877ndash96

159 Jones S Daley DT Luscombe NM et al Protein-RNAinteractions a structural analysis Nucleic Acids Res200129943ndash54

160 Barabasi AL Oltvai ZN Network biology understanding thecellrsquos functional organization Nat Rev Genet 20045101ndash13

161 Nooren IM Thornton JM Diversity of protein-protein inter-actions EMBO J 2003223486ndash92

162 Caspi R Altman T Dreher K et al The MetaCyc database ofmetabolic pathways and enzymes and the BioCyc collectionof pathwaygenome databases Nucleic Acids Res201240D742ndash53

163 Yamada T Bork P Evolution of biomolecular networks les-sons from metabolic and protein interactions Nat Rev MolCell Biol 200910791ndash803

164 Stark C Breitkreutz BJ Chatr-Aryamontri A et al TheBioGRID interaction database 2011 update Nucleic Acids Res201139D698ndash704

165 Szklarczyk D Franceschini A Kuhn M et al The STRINGdatabase in 2011 functional interaction networks of pro-teins globally integrated and scored Nucleic Acids Res201139D561ndash8

166 Bader GD Hogue CW An automated method for finding mo-lecular complexes in large protein interaction networksBMC Bioinformatics 200342

167 DallrsquoOlio GM Vahdati AR Bertranpetit J et alVCF2Networks applying Genotype Networks to SingleNucleotide Variants data Bioinformatics 201531438ndash9

168 Ding L Getz G Wheeler DA et al Somatic mutations affectkey pathways in lung adenocarcinoma Nature20084551069ndash75

169 Case M Matheson E Minto L et al Mutation of genes affect-ing the RAS pathway is common in childhood acutelymphoblastic leukemia Cancer Res 2008686803ndash9

170 Thorstensen L Lind GE Lovig T et al Genetic and epigeneticchanges of components affecting the WNT pathway in colo-rectal carcinomas stratified by microsatellite instabilityNeoplasia 2005799ndash108

Interpreting functional effects of coding variants | 21

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

171 Weng L Macciardi F Subramanian A et al SNP-based path-way enrichment analysis for genome-wide associationstudies BMC Bioinformatics 20111299

172 Wang K Li M Bucan M Pathway-based approaches for ana-lysis of genomewide association studies Am J Hum Genet2007811278ndash83

173 OrsquoDushlaine C Kenny E Heron EA et al The SNP ratio testpathway analysis of genome-wide association datasetsBioinformatics 2009252762ndash3

174 Tarca AL Draghici S Khatri P et al A novel signaling path-way impact analysis Bioinformatics 20092575ndash82

175 Ramanan VK Shen L Moore JH et al Pathway analysis ofgenomic data concepts methods and prospects for futuredevelopment Trends Genet 201228323ndash2

176 Li Y Agarwal P Rajagopalan D A global pathway crosstalknetwork Bioinformatics 2008241442ndash7

177 Rivals I Personnaz L Taing L et al Enrichment or depletionof a GO category within a class of genes which testBioinformatics 200723401ndash7

178 Rhee SY Wood V Dolinski K et al Use and misuse of thegene ontology annotations Nat Rev Genet 20089509ndash15

179 Chandonia JM Brenner SE The impact of structural gen-omics expectations and outcomes Science 2006311347ndash51

180 Johnson MS Srinivasan N Sowdhamini R et al Knowledge-based protein modeling Crit Rev Biochem Mol Biol 1994291ndash68

181 Sulkowska JI Morcos F Weigt M et al Genomics-aided struc-ture prediction Proc Natl Acad Sci USA 201210910340ndash5

182 Lee D Redfern O Orengo C Predicting protein function from se-quence and structure Nat Rev Mol Cell Biol 20078995ndash1005

183 Worth CL Gong S Blundell TL Structural and functionalconstraints in the evolution of protein families Nat Rev MolCell Biol 200910709ndash20

184 Karczewski KJ Fernald GH Martin AR et al STORMSeq anopen-source user-friendly pipeline for processing personalgenomics data in the cloud PLoS One 20149e84860

185 Kalari KR Rossell D Necela BM et al Deep sequence ana-lysis of non-small cell lung cancer integrated analysis ofgene expression alternative splicing and single nucleotidevariations in lung adenocarcinomas with and without onco-genic KRAS mutations Front Oncol 2012212

186 Patil S Upadhayay A Arya D et al A high throughput exomesequencing approach to analyse events within a good re-sponder CML patient under imatinib at diagnosis and underremission Blood 20131225161

187 Kullo IJ Shameer K Jouni H et al The ATXN2-SH2B3 locus isassociated with peripheral arterial disease an electronicmedical record-based genome-wide association study FrontGenet 20145166

188 Shameer K Denny JC Ding K et al A genome- and phe-nome-wide association study to identify genetic variantsinfluencing platelet count and volume and their pleiotropiceffects Hum Genet 201413395ndash109

189 Sabarinathan R Wenzel A Novotny P et al Transcriptome-wide analysis of UTRs in non-small cell lung cancer revealscancer-related genes with SNV-induced changes on RNAsecondary structure and miRNA target sites PLoS One20149e82699

22 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

  • bbv084-TF1
Page 7: challenges in proteome-scale prediction, annotation and assessment

play a key role in certain clinical conditions such as cancerschronic diseases (diabetes atherosclerosis hypertension andkidney disease) neurodegenerative disease (Alzheimerrsquos dis-ease) and infection (Plt 005 Bonferroni corrected) Annotatingfunctional coding SNVs including VUS validated using an or-thogonal method also requires detailed sequence structure andinteraction-based analyses In this context we discuss some ofthe predictive features and analytical approaches that onceincorporated in coding variant annotation algorithms may po-tentially enhance the interpretation of the functional variantson a proteome-wide scale

Sequence and structural propertiesperturbed by coding SNVs

A protein performs its defined function after attaining a specifictertiary or quaternary structure [80 81] This is often mediatedby cross-links of inter-chain and intra-chain amino-acid residueinteractions within a protein These interactions (hydrogenbond disulphide bonds salt bridges ionic interactions electro-static interactions hydrophobic interactions etc) stabilize the

fold of individual protein chains and the overall quaternary as-sembly Individual amino acids within a protein are under theinfluence of the varied elements of the sequence-structure-function paradigm that modulate their biochemical role [82]Primarily a residue may be located within an evolutionarilyconserved compact globular domain [83] or in an unassigned re-gion with no known protein domain association [84] A residuecan be a part of a motif which can be functional [85] that of alinear motif [86] propeptide [87] signaling peptide [88] or a partof structural [89] and spatially interacting sites which partici-pate in higher order interactions [90] catalytic sites [91] ligandbinding sites [92] and allosteric sites [93] associated with exten-sive inter-chain and intra-chain interactions [90] based onthe oligomeric property of the proteins Amino acids can also bepart of specific sub-cellular localization signals that directthe proteins to specific locations in the cell [94] The lengths ofthe proteins [62] sequential localization (N-terminalC-terminal or other region of protein sequence) and the locationof a coding SNV with respect to the surfacemdashinterior or inter-face of the protein structuremdashcould influence disease manifest-ation [95]

Table 3 Rapid variant annotation tools

Name Description URL

ANNOVAR Efficient software tool to use up-to-date information tofunctionally annotate genetic variants detected fromdiverse genomes

httpwwwopenbioinformaticsorgannovar

AnnTools Comprehensive and versatile annotation toolkit for gen-omic variants

httpanntoolssourceforgenet

dbNSFP A lightweight database of human non-synonymousSNPs and their functional predictions

httpssitesgooglecomsitejpopgendbNSFP

EVA An efficient and versatile tool for filtering strategies inmedical genomics

httpplateforme-genomique-iribuniv-rouenfrEVAindexphp

Exome VariantServer

Provides different calculated values (GERP GRANTHAMetc) and annotations for SNPs

httpevsgswashingtoneduEVS

gSearch gSearch compares sequence variants in the GenomeVariation Format (GVF) or VCF with a pre-compiledannotation or with variants in other genomes

httpmlssuackrgSearchindexhtml

HugeSeq A pipeline for detection and annotation of geneticvariations

httphugeseqsnyderlaborg

MuSiC Comprehensive mutational analysis pipeline to segre-gate passenger and driver mutations from cancergenomes

httpgmtgenomewustledugenome-musiccurrent

NGS-SNP Collection of command-line scripts for providing richannotations for SNPs

httpstothardafnsualbertacadownloadsNGS-SNP

SeattleSeq Annotation Provides annotation of known and novel SNPs httpsnpgswashingtoneduSeattleSeqAnnotationsnpEff Variant annotation and effect prediction tool httpsnpeffsourceforgenetSVA Software system designed for annotation and visualiza-

tion of genetic variantshttpwwwsvaprojectorg

STORMSeq Cloud computing solution for read mapping read clean-ing and variant calling and annotation

httpsgithubcomkonradjkstormseq

TREAT Targeted RE-sequencing Annotation Tool httpndcmayoedumayoresearchbiostatstand-alone-packagescfm

VAAST Probabilistic search tool for identifying damaged genesand their disease-causing variants in personal gen-ome sequences

httpwwwyandell-laborgsoftwarevaasthtml

VARIANT VARIANT (VARIant ANalysis Tool) can report the func-tional properties of any variant in all the humanmouse or rat genes

httpvariantbioinfocipfes

Variant Reporter Generate a report of known variants and functionalconsequences

wwwncbinlmnihgovvariationtoolsreporter

Variant Tools Tool for the annotation selection and analysis of vari-ants in the context of next-gen sequencing analysis

httpvarianttoolssourceforgenet

Interpreting functional effects of coding variants | 7

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

Table 4 Genome browsers and biological data visualization libraries

Name Description URL

1000 Genomesbrowser

Genome browser to access data from 1000 Genomesproject

httpwwwncbinlmnihgovvariationtools1000genomes

AnnoJ Web 20 application designed for visualizing sequencingand annotation data

httpwwwannojorg

Artemis Genome Browser and Annotation Tool httpwwwsangeracukresourcessoftwareartemis

BioGraphics Perl modules for biological data visualization httpsearchcpanorgdistBio-GraphicsBioGraphics (Ruby) Ruby library for drawing overviews of genomic regions httpbio-graphicsrubyforgeorgBluejay Java-based integrated computational environment for the

exploration of genomic datahttpbluejayucalgaryca

CGView Circular Genome Viewer httpwishartbiologyualbertacacgviewCircos Circos is a software package for visualizing genomic data

and annotations in circular layouthttpcircosca

Dalliance Interactive genome viewer which runs directly in a modernweb browser

httpwwwbiodallianceorg

DAS DAS is an integrated visualization toolkit to share and col-late genomic annotation information

wwwbiodasorg

DASTY Web client for visualizing protein sequence feature infor-mation using DAS

httpwwwebiacukdasty

DNAPlotter DNAPlotter can be used to generate images of circular andlinear DNA maps to display regions and annotations ofinterest

httpwwwsangeracukresourcessoftwarednaplotter

Ensembl GenomeBrowser

Ensembl Genome Browser enables the visualization of gen-omic and transcriptomic sequence and related informa-tion for several vertebrate and non-vertebrate species

httpuseastensemblorgindexhtml

GASV Geometric Analysis of Structural Variants httpcompbiocsbrownedusoftwarehtmlGbrowse Generic Genome Browser (GBrowse) is a genome viewer de-

veloped as part of Generic Model Organism Database(GMOD) project

httpgmodorgwikiGBrowse

GENBOREE Customizable genome browser httpgenboreeorgjava-binloginjspGeneViTo JAVA-based workbench for genome-wide analysis through

visual interactionhttpathinabioluoagrbioinformaticsGENEVITO

GenomeGraphs R-based interface to plot genomic information fromEnsembl

httpwwwbioconductororgpackagesreleasebiochtmlGenomeGraphshtml

GenomePixelizer Tool to generate custom images of genomes out of thegiven set of genes

httpwwwatgcorgGenomePixelizer

GenomeTools Versatile genome analysis software httpgenometoolsorgGenomeView Next-generation stand-alone genome browser and editor httpgenomevieworgGremlin Interactive visualization model for analyzing genomic

rearrangementshttpcompbiocsbrownedusoftwarehtml

IGB Integrated genome browser httpbiovizorgigbindexhtmlIGV Integrative Genomics Viewer httpwwwbroadinstituteorgigvJbrowse Genome browser with a fully dynamic AJAX interface httpgmodorgwikiJBrowsejsDAS JavaScript client library for the DAS httpwwwebiacukdastyebihtmljsdashtmlMagicViewer Integrated solution for NGS data visualization and genetic

variation detection and annotationhttpbioinformaticszjcnmagicviewerindexphp

NCBI GraphicalSequence Viewer

Graphical display for the Nucleotide and Protein sequences httpwwwncbinlmnihgovprojectssviewer

Rover Genome browser framework to build custom genomic tools httpchmille4githubcomRoversitehomehtmlRviewer Interactive online tool for comparing and prioritizing gen-

omic regionshttprviewerlblgovrviewer

Savant Genome Browser for high-throughput sequencing data httpgenomesavantcomScribl HTML5 Canvas-based graphics library for visualization of

genomic data and annotationshttpchmille4githubcomScribl

UCSC GenomeBrowser

Interactive genome browser that provide access to se-quence data from different species integrated with alarge collection of layered annotations from experimentsand prediction algorithms

httpgenomeucsceducgi-binhgGateway

VISTA Browser Visualization of pairwise and multiple alignments of wholegenome assemblies

httppipelinelblgovcgi-bingateway2selectorfrac14vista

Whole GenomerVISTA

Visualization of TFBS that are conserved between speciesand overrepresented in upstream regions of groups ofgenes

httpgenomelblgovvistaindexshtml

8 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

Previously studies have established (Table 4) that functionalvariants may impact the overall structural conformation andfunction of proteins (ligand binding substrate specificity pro-teinndashprotein interaction transcription factor-target bindingetc) Residue-specific interactions can play a key role in suchvariations Sequence-based investigations will further help toidentify linear sequence motifs that confer diversity in the splicevariants [6] but the identification and analysis of the impact ofpoint mutations on spatially distributed structural motifs mayrequire structural data [55] However the limited availability ofstructural data compared with sequence data may pose add-itional challenges for incorporating structural analysis of thefunctional variants [96] Moreover additional challenges ariseowing to the dynamic protein conformations and allosteric orother long-distance effects (including higher order interactions)of few mutations on the activity of the protein

UniProtKB provides extensive information about residueproperties and curated information about globular domains (viaannotations derived from Pfam SMART or InterPro)posttranslational modifications (PTMs) and several bioinfor-matics tools and webservers are available to investigate residueproperties and structural properties (See Table 5 for a list oftools to predict various sequence features See URL httpbioinformaticsca links_directorycategoryprotein for list of toolsavailable for different protein-centric analyses)

Occasionally multiple sequence or structural features canbe annotated onto a single variant for example a residue couldbe part of a functional motif andor a structural motif and cod-ing SNVs may perturb either or both features Quantifying mul-tiple features perturbed by variants using an objective-scoringschema will help to capture the entire set of features perturbedby a functional variant Comparative analysis of the sequenceand structural features of the wild type and the variant se-quences will also enhance the understanding of the impact ofmultiple coding variants within a protein

Gain and loss of PTMs owing to SNVs

Amino acids in a mature peptide chain are targets of variedPTM [97] events PTMs include phosphorylation methylationacetylation amidation hydroxylation sulfation lipidation

glycosylation and palmitoylation [98] Several studies have nowcompiled and assessed the impact of phosphorylation in cancer[97] and have reported mutational landscapes of various PTMevents including gain or loss of particular PTMs owing to SNVs(for example gain of glycosylation [99] gain and loss of phos-phorylation in cancer [100]) Pan-cancer [101] and proteomewide studies [102] have also assessed the impact of variety ofPTMs Several tools and databases are currently available toassess impact of individual PTMs owing to SNVs [refs]However tools that can provide complete in silico profiling ofmutations will help the researchers to identify PTM-relevantmutations and use the information to assess therapeuticstratifications

Coding SNVs in unassigned regions of proteins

An unassigned region refers to the segments in proteins withno known functional domain assignments [84] Human proteinshave a variable degree of unassigned regions and small-un-assigned regions are often defined as the linker regions betweentwo domains (Figure 2i) We surveyed the human proteomeusing Pfam-based domain annotations to understand the un-assigned regions in the human proteome We computed as-signed and unassigned regions (in percentage) for 20 242 proteinsequences from the SwissProt database (reviewed sequences)Pfam-A-based domain assignments were retrieved for 20 137 se-quences A subset of 105 proteins was excluded from the ana-lysis owing to overlapping domain assignments Of the 20 137sequences that were analyzed (Figure 4A) one protein (haloaciddehalogenase-like hydrolase domain-containing protein 2) wasassigned with a conserved domain over its entire length 3234sequences were completely unassigned and the rest of the pro-teins had varying segments of unassigned regions (mean5387 SD63094) Current approaches in variant annotationand interpretation are primarily focused on highly conservedglobular domains Given that domains are presently assigned toonly 50 of the human proteome analytical methods such asPrediction of Unassigned Regions (PURE) that use intermediatesequence search techniques for domain assignments [103] orsimilar approaches that use sensitive sequence search protocols

Table 5 Examples of the functional variants located in sequence features perturbing diverse functional and structural effects in proteins

Variant location Description URL

Protein domain DMDM A database that compiles domain mapping of dis-ease mutations have information about 202 507 muta-tions associated with 10 919 domains (compiled fromCDD Pfam COG and SMART databases)

httpbioinfumbcedudmdm

Phosphorylation site Mutation of an AKT phosphorylation site of human B-raf httpwwwncbinlmnihgovpubmed15791648Propeptide Mutation in the von Willebrand factor (VWF) propeptide

affects the oligomerizationhttpwwwncbinlmnihgovpubmed20335223

Signal peptide Mutation in signal peptide of ADAMTS10 influence secre-tion of full-length enzyme

httpwwwncbinlmnihgovpubmed18567016

Active site Mutation in the active site of human deoxycytidine kinaseaffects the substrate specificity

httpwwwncbinlmnihgovpubmed18361501

Linear motif Linear motifs mediate functional diversity of transcriptvariants

httpwwwncbinlmnihgovpubmed22638587

Structural motif Heterozygous missense mutation of a spatially distributedstructural motif in human connexin (GJB3) gene causeerythrokeratodermia variabilis

httpwwwncbinlmnihgovpubmed9843209

Subcellular localization Missense mutations in the NPHS2 gene altering the traffick-ing of nephrin to the plasma membrane

wwwncbinlmnihgovpubmed15496146

Interpreting functional effects of coding variants | 9

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

Table 6 Tools for predicting various sequence and structural features

Name Description URL

3dswap-pred Classify a protein sequence as domain-swapping ornon-domain swapping using an SVM model

httpcapsncbsresin3dswap-predindexhtml

AAIndex An amino acid index is a set of 20 numerical values rep-resenting various physico-chemical and biochemicalproperties of amino acids

httpwwwgenomejpaaindex

Bioinformatics LinkDirectory (Protein)

Extensive list of tools for prediction of protein sequencefeatures structure features and function

httpbioinformaticsca links_directorycategoryprotein

dbPTM Comprehensive resource for protein PTMs httpdbptmmbcnctuedutwDISOPRED Dynamically disordered protein chains do not have sta-

ble secondary structures and have high flexibility insolution Disordered regions also play critical roles inprotein function

httpbioinfcsuclacukdisopred

ELM Eukaryotic Linear Motif server httpelmeuorgEris Eris server computes the change of the protein stability

induced by mutations using structural datahttpdokhlabuncedutoolserisindexhtml

FoldAmyloid Method for predicting of amyloidogenic regions fromprotein sequence

httpbioinfoprotresrufold-amyloidogacgi

FoldX FoldX can be used to find interactions contributing tothe stability of proteins and protein complexes usingstructural data

httpfoldxcrges

Globplot Globplot can predict disordered regions in proteinsequence

httpglobplotemblde

H-Predictor Predict hinge regions involved in protein oligomeriza-tion via the domain-swapping mechanism fromstructural data

httptrollmeduncedudokhlabindexphpSpecialHpredictor

Harmony Substitution and propensity score-based protein struc-ture assessment algorithm

httpcapsncbsresinharmony

HORI Webserver for prediction of higher order residue inter-actions in protein structures

httpcapsncbsresinhori

InterPro Integrated database of predictive protein signaturesused for the classification and automatic annotationof proteins and genomes

httpwwwebiacukinterpro

IUPred Prediction of intrinsically unstructured proteins httpiupredenzimhuLIMBO Predicts the amylogenic regions in a protein httplimbovibbeMUPRO Prediction of protein stability changes for single-site

mutations from sequenceshttpmuproproteomicsicsuciedu

NCBI-CDD Extensive protein domain and family annotationdatabase

httpwwwncbinlmnihgovStructurecddcddshtml

Pfam Database of conserved protein domain families httppfamsangeracukPFILT Program to filter various sequence regions including

low-complexity regionshttpbioinfadmincsuclacukdownloadspfilt

PIC Protein interactions calculator httppicmbuiiscernetinProtParam Compute biochemical features like Molecular Weight

Theoretical pI Grand Average of Hydropathy(GRAVY) instability index etc

httpwebexpasyorgprotparam

PSIPRED Secondary structure prediction httpbioinfcsuclacukpsipredPURE Prediction of unassigned regions in proteins httpcapsncbsresinpureSABBLE Relative solvent accessibility prediction httpsablecchmcorgScanProsite Report the functional motifspatterns encoded in the se-

quence Helps to assess the gainloss of functionalsites owing to mutation

httpprositeexpasyorgscanprosite

SignalP Predicts the presence and location of signal peptidecleavage sites in amino-acid sequences

httpwwwcbsdtudkservicesSignalP

SMART Simple modular architecture research tool for assigningdomains to protein

httpsmartembl-heidelbergde

TANGO Predicts the aggregation-prone regions in a protein httptangocrgesTargetP Predicts the subcellular location of eukaryotic proteins httpwwwcbsdtudkservicesTargetPUniProtKB Catalog of information on proteins httpwwwuniprotorgWALTZ Predicts the aggregation-prone regions in a protein httpwaltzvubacbe

10 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

(Table 5) may potentially help investigate the evolutionary andfunctional roles of variants in unassigned regions

Impact of coding SNVs in low-complexityregions

Low-complexity regions (LCRs) in the protein universe refer to astretch of amino acids with low Shannon entropy (leucine-richdomains or poly-alanine tracts) Unlike linear motifs whichhave a specific function and sequence signature the individualfunctions of LCRs are poorly characterized LCRs do not adopt adefinite secondary structure but may exist as solvent-exposedamino acids in the coiled or disordered regions in proteins LCRsare observed in functionally diverse proteins and in both eu-karyotes and prokaryotes The predominant functions of LCRsinclude promoting mRNA stability and mediating a diverse set

of proteinndashprotein interactions [104] We scanned the referencehuman proteome (reviewed subset of 20 237 sequences) usingPFILT [105] and observed that 1432 (nfrac14 2899) protein se-quences have LCRs with a median length of 16 amino acids(Figure 4B) A recent study showed that a functional variantlocalized to an LCR in Nance-Horan Syndrome (NHS) gene wasassociated with clinical features of NHS including cardiacanomaly [106] In the absence of direct functional informationto interpret the impact of functional variants scanning proteinsequences with an LCR prediction tool like PFILT is highly rec-ommended to investigate the probable gain or loss of LCRsowing to the functional variations

Coding SNVs and intrinsically disorderedregions in proteins

Proteins are believed to be functional when the structure attainsits definite globular fold [107] Recent studies however suggestthat proteins may perform their functions even when not in afully folded state Such proteins or regions within proteins thatexist in a stable conformation without attaining a definite struc-tural conformation are generally referred to as intrinsically dis-ordered proteins (IDPs) or proteins with intrinsically disorderedregions (IDRs) or simply disordered proteins [108 109] Disprotis a database that catalogs a curated list of disordered proteinsthat includes 248 experimentally validated human proteinswith disordered regions (httpwwwdisprotorgactionsearchphpkeywordfrac14humanampcriterionfrac14organism) Prediction modelssuggest that 30ndash40 of human proteins are considered to beIDPs or have IDRs and approximately 25 of eukaryotic proteinsare predicted to be fully disordered [108 109]

Regions of the intrinsically disordered segments in proteinshave been found to be functionally important (Figure 5)Disordered regions mediate various functional roles includingprotein binding and proteinndashprotein interactions Previous stud-ies have shown that IDPs can attain a definite structure on bind-ing to their interacting partner and may thus exist in anintermediate stage of disordered (unfolded) to ordered (folded)stage [110 111] Irrespective of the structural plasticity recent

Figure 3 Proportional Venn diagram of genes with coding variants annotated as

polymorphism (Polymorphism nfrac1411 527) disease (Disease nfrac142105) and un-

classified (Unclassified nfrac141910) in Human polymorphisms and disease muta-

tions index Percentages of genes shared between the three groups are provided

Unassigned regions in human proteome

Percentage of unassigned regions in human proteins

Fre

qu

en

cy in

hu

ma

n p

rote

om

e (

Sw

issp

rot

re

vie

we

d s

eq

ue

nce

s)

0 20 40 60 80 100

01

00

02

00

03

00

04

00

0

Low-complexity regions in human proteome

Percentage of LCRs in human proteins

Fre

quency in h

um

an p

rote

om

e (

Sw

isspro

t r

evie

wed s

equences)

0 20 40 60 80 100

0500

1000

1500

2000

2500

3000(a) (b)

Figure 4 Histograms depicting the distribution (in percentage) of (A) unassigned regions and (B) LCRs in the human proteome

Interpreting functional effects of coding variants | 11

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

studies have shown the presence of evolutionarily conservedfunctional motifs within IDPs [112 113]

Recently the influence of disease mutations in the dis-ordered regions was extensively surveyed and the crucialroles of disorder-promoting amino acids in imparting vari-ations in the protein structures were proposed [114 115]Disorder-promoting amino acids are a potential cause ofvariations in protein structures and structural modelsDisorder-to-order transitions are enriched among diseasemutations compared with neutral polymorphisms [116]Currently protein modeling or variant effect prediction al-gorithms do not take the intrinsic disorder characteristicsof a protein into consideration Integrating tools such asDisopred [117] for disorder region prediction into the vari-ant annotation pipelines will improve our understanding ofthe impact of coding SNVs in the disordered regions of aprotein

Influence of coding SNVs on proteinmisfolding domain swapping aggregationmacromolecular crowding and degradation

Diseases have multiple environmental as well as geneticcauses and structural biology is an active area of research tounderstand how changes in individual protein structures orprotein complexes may play a key role in disease manifest-ations Understanding the structural bases of human diseases

would help to identify better targets and design better ligandsfor more effective therapeutic interventions

Protein misfolding and aggregation

Protein folding pathways play a crucial role in mediating cellu-lar homeostasis Defective folding of a protein product is themechanistic basis of several disease phenotypes likeAlzheimerrsquos disease and Parkinsonrsquos disease [118] When non-synonymous mutations lead to the production of misfoldedproteins with aberrant function several gate-keeping pathwaysact on such misfolded proteins to clear them from the cellularenvironment [119] Misfolded proteins targeted for ubiquityla-tion may also lead to protein aggregation Protein aggregation[120ndash123] is defined as a molecular phenomenon where a pro-tein is not cleared from the cellular environment by the normalpathways (Figure 6A) This leads to the aggregation of proteinsin the cellular environment which in turn leads to cellular tox-icity and is considered to be the mechanistic basis of varioushuman diseases such as prion diseases [123 124]

Domain swapping

3D domain swapping (Figure 6B) is a phenomenon observed in asubset of proteins where intermolecular interactions arereplaced by intramolecular interactions [125 126] Domainswapping is also recognized as a mechanism for forming pro-tein aggregates via open-ended mode Domain swapping is

Figure 5 Examples of IDRs in human proteins (A) Breast cancer type 1 susceptibility protein (BRCA1) (B) Cellular tumor antigen p53 (Oncoprotein p53) (C) T-cell sur-

face glycoprotein CD4 (Disprot) (D) Secondary structure information (UniProtKB) and (E) mutation histogram (COSMIC) of BRCA1 is provided to illustrate mutations in

the unstructured regions

12 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

observed in a variety of therapeutically important proteins andis considered to be the mechanism mediating deposition dis-eases such as neurodegenerative diseases and Alzheimerrsquos dis-ease that are caused by conformational perturbations [127 128]Missense mutations in the human phenylalanine hydroxylasehave been shown to infleunce aggregation and degradationproperties [129] Functional mutations are also considered to bea causative factor for proteins to adopt swapped conformationfrom monomers to higher oligomers [130ndash132] A recent study[128] that surveyed a subset of human proteins involved in 3Ddomain swapping suggested that swapping is not only confinedto conformational diseases it is also associated with severalkey biological pathways and also plays a role in mediating di-verse diseases in humans Another study [133] that investigatedthe structural properties of proteins involved in 3D domainswapping suggested that 10 of protein folds and 5 of proteinfamilies include domain-swapped structures

Macromolecular crowding

Proteins interact with a variety of small molecules and othermacromolecules to perform a specific functional task The qua-ternary assembly of a protein is influenced by the importantyet understudied phenomenon of macromolecular crowding

(117 118) Protein function is driven by interactions mediated byinter- and intra-chain amino acids and mutations in the sur-face amino acids may affect macromolecular homeostasisMutagenesis studies have shown the impact of functional mu-tations on proteinndashprotein interfaces (119) oligomerization (120)and stability (121) Mutations that influence the intra-chaininteractions may influence the crowding effect in the cellularenvironment Probing the impact of functional mutations onsuch effects using theoretical models and molecular dynamicsimulation studies is likely to enhance the understanding of therelationship between coding SNVs and macromolecular crowd-ing A list of selected tools and databases available for the pre-diction of misfolding protein aggregation and domainswapping is summarized in Table 5

Protein degradation

Targeted biochemical studies have revealed that the proteindegradation pathway plays an important role in the clearanceof the mutated proteins to reduce cellular toxicity For examplemutant Cu Zn-superoxide dismutase associated with amyo-trophic lateral sclerosis was cleared by macroautophagy path-way that includes proteasomal cleavage phenylalaninehydroxylase (109) and NAD(P)H quinone oxidoreductase 1 (156)

(a) (b)

Figure 6 Schematic representation of the impact of functional mutation on protein misfolding folding aggregation domain swapping macromolecular crowding and

protein degradation pathways Structure of misfolded rat CD2 structure (PDB ID 1A6P) and normal CD2 (PDB ID 1HNG) is used for representing misfolded and folded

structures Functional variant is represented using red asterisk

Interpreting functional effects of coding variants | 13

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

were also cleared by similar mechanisms Functional variantsplay a crucial role in aggregation and protein degration path-ways [134 135] Computational approaches that can predict thedegradation properties of mutated proteins using sequence orstructural information will be useful for rapid characterizationof functionally active proteins Integrated models that combinefolding misfolding aggregation 3D domain swapping macro-molecular crowding and degradation pathways (Figure 6) in asystems biology approach are likely to provide additional in-sights into the regulation of these important mechanisms andthe role of coding SNVs in mediating such mechanisms

Coding SNVs and metamorphic proteins

The molecular paradigm of sequence-structure-function sug-gests that a diverse set of sequences could attain a similarstructural fold that may lead to a functional convergence Whilethis is generally true for the protein universe occasionally devi-ations are observed Metamorphic proteins refer to a relativelynew class of proteins in which a given sequence has beenshown to attain different folded conformations under nativeconditions while performing distinct functions [136ndash138]Metamorphic proteins such as human chemokine lymphotactinhave been shown to influence evolutionary transitions of struc-ture and coding SNVs which could play an important role inswitching between two or more folds and functions for a singlesequence [139] Computational approaches to catalog and pre-dict the metamorphic properties and the impact of coding SNVson metamorphic proteins would help us understand how muta-tions drive functional plasticity at the proteome level

Impact of coding SNVs on the transcriptomicdiversity

Dynamic features of the human proteome are driven by tran-script diversity and the average number of characterized tran-scripts per gene is rapidly expanding The recent adoption ofRNA-Seq as a tool for expression profiling has led to the charac-terization of a large number of novel transcripts includingfusion transcripts [140] Several novel tissue-specific orcell-type-specific transcripts were reported from RNA-Seqexperiments The cellular compartment-based functional rolesof such transcripts are determined by alternative splicingevents [141] RNA editing is a phenomenon where a pre-mRNAmolecule is altered through a chemical change in the basemakeup thereby adding to the diversity of the transcriptomeRNA editing events occur via two distinct mechanisms of sub-stitution editing and insertiondeletion editing leading to func-tional diversity Fusion transcripts [142] and RNA editing eventsare also associated with various diseases including prostatecancer and amyotrophic lateral sclerosis [143] Computationalapproaches are available for the identification of fusion tran-scripts [144] and RNA editing events [145ndash147] from sequencingdata Understanding the precise roles and the impact of variantson such a diverse set of transcripts using computationalapproaches is a challenging task Ascribing the functional roleand assessing the impact of coding SNVs for mediating noveland fusion transcripts using bioinformatics approaches is anemerging problem and being addressed in recent studies [148]Recently we designed an integrated method to simultaneousanalyses of genome and transcriptome from RNA-seq data TheeSNV-Detect method can precisely capture genetic variation(genotypes) from RNA-seq data and helps to design cost-

effective and sustainable experimental strategies [149] Analyticand interpretation strategies that rely on multiple data-typeswould provide greater confidence for variant calling annotationand interpretation and thus actionability for clinicalinterventions

Functional impact of synonymous variations

Synonymous mutations are defined as mutations that result ina variation at the DNA level that code for the same amino acidin the protein level owing to codon degeneracy Synonymousvariants may either be coding or noncoding and the functionalroles of such variants are emerging A recent study [150] sug-gests that introns are involved in functional mechanisms andintrons with positional conservation across eukaryotic lineageare classified as functional introns A comparative analysis ofsynonymous and non-synonymous variants associated withcomplex diseases has shown similar likelihood and effect sizewith disease association [151] Synonymous mutations mayalso influence the introns regulating gene expression [152] A re-cent report also suggests that the coding exons function as tis-sue-specific enhancers and synonymous variations in suchenhancer sites may influence the expression level of certaingenes [153] Synonymous variants could influence the expres-sion of intronic noncoding RNAs [154] perturb the transientproteinndashDNA [155 156] and proteinndashRNA interactions in the cellSuch perturbations can lead to diseases such as cancer neuro-logical disorders and cardiac disease [157] Developing bioinfor-matics approaches to explore the putative impact ofsynonymous mutations on different layers of the coding andnoncoding genome and their relationships will be an importantaspect in the detailed interpretation of the genomic variantsThe detailed structural exploration of proteinndashDNA [158] andproteinndashRNA interactions [159] will help to precisely map themechanistic bases of the loss or gain of interactions owing tosynonymous variations

Impact of coding SNVs on the interactomeof a protein

The individual interactome of proteins varies to a great extent[160] Proteins often interact with nucleotides (proteinndashDNA orproteinndashRNA interactions) proteins (protein complexes obligateor non-obligate) [161] and proteinndashprotein interactions (transientor permanent) [107]) and small molecules (molecular reactionsenzymatic reactions metabolic pathways) [162 163] CodingSNVs play an important role in mediating such interactions Aninteractome of a protein can be defined using information gath-ered from high-throughput experiments that systematicallyidentify interactants of proteins Public protein-protein databaseslike BioGRID [164] and STRING [165] provide large data sets forderiving a protein interactome Network-level investigations tounderstand the category of interactors that are perturbed owingto coding SNVs will help delineate the impact of such mutationson the protein interactomes Using network measures (centralitydegree stress betweenness closeness cliques radiality transi-tivity reciprocity assortativity structural equivalence networkheterogeneity network density clustering coefficients neighbor-hood connectivity shared neighbors network topology etc) asquantitative parameters in variant annotation pipelines will helpthe researchers gain better insights into the impact of codingSNVs on the protein interactome Biologically relevant networkproperties such as network modularity [166] network fragility

14 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

[166] and lethality [166] can also be derived from an interactomeIncorporating a comparative network analysis framework thatcompares wild-type and variant interactomes will help quantifythe impact of coding SNVs on a network scale Methods such asVCF2Networks employ genotype networks (ie all genotypesassociated with a single phenotype) to understand the relation-ships between the genotype space and clinical or biological phe-notypes [167] Table 7 summarizes the available tools toinvestigate gene lists and to derive global functional trends andnetwork properties

Pathway-level impact of coding SNVs

Multiple functional genomics studies have investigated the per-turbations of pathways as a result of mutations in diseasessuch as adenocarcinoma (MAPK signaling p53 signaling Wntsignaling cell cycle and mammalian target of rapamycin path-ways) [168] childhood acute lymphoblastic leukemia (RAS path-way) [169] colorectal carcinoma (WNT signaling pathway) [170]etc Several SNP-centric approaches for pathway-level inferencehave been designed for the interpretation of SNPs identifiedfrom GWA studies or differentially expressed genes from gene-expression studies [171ndash175] Coding SNVs may also influencethe cross talk between various signaling pathways Currentvariant interpretation algorithms strive to identify pathways

associated with genes harboring coding SNVs yet a detailedunderstanding of the impact of variants on biological pathwaysand pathway cross talks [176] is often lacking Incorporatinganalytical routines to quantify the effect of functional muta-tions on pathways and pathway cross talk will be useful in in-terpreting functional variants from a biological perspective

A classification schema for annotatingcoding SNVs

Identifying the entire spectrum of molecular perturbationsowing to coding variations at the level of sequence structureinteraction and function of proteins is considered the basis ofvariant interpretation Multiple automated prediction tools thatcan assess the functional effect of mutations are currentlyavailable [See Table 6] Several variant annotation tools focuson the task of predicting the type or effect of mutations provideextended annotations from precomputed databases or lookuptables and map a coding variant to its corresponding gene pro-tein functional domain or signaling pathway Theseapproaches have several limitations because additional layersof protein-centric information that can be derived from the pre-diction or computation of sequence-based features with codingvariants are lacking To deal with this challenge and to designbroad-spectrum tools for deep variant interpretation we

Table 7 Software libraries and tools for biological network analysis

Name Description URL

Bio4J Bio4j is a bioinformatics graph-based database httpwwwbio4jcomBioConductor (Graphs

and Networks view)Collection of BioConductor modules for biological net-

work analysis and visualizationhttpwwwbioconductororgpackagesrelease

BiocViewshtml___GraphsAndNetworksCytoscape Open source platform for complex network analysis and

visualization with a large collection of plug-ins forbiological network analysis

httpwwwcytoscapeorg

DAVID Integrated functional annotation tool httpdavidabccncifcrfgovhomejspFunDO Functional Disease Ontology server httpdjangonubicnorthwesternedufundogene2pathway R package for prediction of KEGG pathway membership

for individual genes based on InterPro domainsignatures

httpwwwbioconductororgpackagesreleasebiochtmlgene2pathwayhtml

GeneAnswers R package for biological or medical interpretation of thegiven one or more groups of genes by means of statis-tical test

httpwwwbioconductororgpackagesreleasebiochtmlGeneAnswershtml

GO Tools Tools for analysis of GO T term enrichment statisticalanalysis semantic similarity and functional similar-ity using GO terms derived from gene lists

httpwwwgeneontologyorgGOtoolsshtml

Gephi Open-source graph visualization and analysis software httpwwwgephiorgGremlin Graph-based programming language httpsgithubcomtinkerpopgremlinwikiiGraph Network analysis and visualization library in C Also

available R package and a Python extensionhttpigraphsourceforgenet

KEGGgraph R package for analysis of KEGG pathways httpwwwbioconductororgpackages210biochtmlKEGGgraphhtml

LEDA A broad-spectrum Cthornthorn class library for efficient datatypes and algorithms including large-scale networkanalysis

httpwwwalgorithmic-solutionscomledaaboutindexhtm

Neat Web-based network analysis tools httpgephiorgOntology Analysis

plugins for Cytoscapehttpchiantiucsdeducyto_webplugins Plugins for functional enrichment analysis using

network dataPANTHER Classification of genes and proteins httppantherdborgReactome Curated knowledgebase of molecular events and

pathwayshttpwwwreactomeorg

TargetMine Integrates different types of biological data and enableflexible queries export results and analyze lists ofdata

httptargetminemizuguchilaborg

Interpreting functional effects of coding variants | 15

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

recommend a three-level annotation schema for the interpret-ation of coding variants (Figure 7) Primary level annotation pro-vides an overview of the types of mutations and associationannotations Secondary level annotation enables the systematicinvestigation of multiple variants in a single gene Tertiary-levelannotations help to find the global characteristics of genes har-boring coding variants and various properties in a network-scale

Level 1 Primary annotation of coding SNVs

Tools that use position information from VCF files to derive thetype of mutation (synonymous or non-synonymous) the loca-tion of the corresponding gene and its mapping onto several an-notation resources using positional data may be viewed aslsquoprimary annotationsrsquo See Table 2 for a list of tools that useminimum input data to gather a diverse set of annotationsusing database lookup tables and identifier mapping

Level 2 Secondary annotation of coding SNVs- Impactof multiple variants in a single gene

Sequence and structural explorations of coding variants arereferred to as secondary annotations VCF files can have mul-tiple variants of the same gene associated with a given pheno-type or disease A systematic investigation of multiple variantsin a single gene and an assessment of the relative impact of cod-ing variants could help in classifying multiple variants in a geneand help to delineate the variants as pathogenic moderatelypathogenic or VUS Comparative analysis of wild-type and vari-ant sequences can be performed and quantified using the totalnumber of gains or losses in the sequence features Based on the

availability of the structural data a given variant can be modeledin a protein structure using in-silico mutagenesis experimentsIn the absence of experimental protein structural data a proteinstructure model can be obtained from the homology model data-base or a new model can be built using homology modelingapproaches Once the wild type and mutant type structures areobtained the impact of variation can be rapidly computed usingstructural feature analyses or using computationally expensivemolecular dynamics simulations Table 5 summarizes key se-quence and structure feature prediction tools

Level 3 Tertiary annotation of coding SNVs- Impact ofcoding SNVs on multiple genes pathways andinteractome

The global impact of multiple coding variants on a genome orexome can be assessed using a combination of both knowledge-based enrichment or depletion analysis and network or interac-tome analyses Functional profiling using GO terms have beenused as an effective approach for characterizing collective func-tional characteristics of a perturbed set of genes [177 178]Gene-list-based analyses can provide biologically relevant infor-mation about the variants Enrichment analysis is not just con-fined to a priori defined gene sets or GO annotationsenrichment can be performed using several types of annota-tions and could provide insights into the probable functional as-sociations of genes with coding variants Enrichment analysisusing protein annotations would help to identify functionallysignificant protein domains protein classes families (mem-brane proteins or kinases etc) protein superfamilies (angioten-sin receptors or G-protein coupled receptors) or protein folds(Rossman fold beta-propeller) associated with proteins

Figure 7 Three-level schema for annotation of functional variants

16 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

harboring coding variants Furthermore it is possible to identifyenriched molecular events mediated by genes enriched TFBS inthe upstream of genes and biological pathways mediated by thegenes using annotations from Reactome UCSC and KEGG repo-sitories Disease ontology can be used to find known genendashdisease associations Following the knowledge-based analysis anetwork-level analysis of genes harboring coding SNVs can beperformed using publicly available network analysis tools suchas Cytoscape and Gephi

Future directions

The rapidly increasing availability of sequence structure func-tional and interaction information offers an attractive means toobtain a detailed characterization of coding variants Howeverthe use of large data sets to systematically explore the functionalvariants is currently limited owing to the gaps in the availablesequencendashstructurendashfunction interaction data Sequence dataare available for the entire human proteome but the availabilityof protein structural data experimentally verified functional as-sociations and biomolecular interaction data is limited Theavailability of structural data of homologous protein structuresis a prerequisite for structural investigations of functional vari-ants and its impact on the structural environment Availabilityof high-quality interaction data will also help analyze the cellu-lar networks modulated by proteins harboring coding variantsThe expansion of structural data sets using structural genomics[179] homology modeling [180] and the application of computa-tional approaches such as genomics-aided structure prediction[181] is likely to lower the increasing chasm between sequencestructure function and interaction data and may help to ascer-tain the impact of variants at the biomolecular level [163 182183] Improving the annotation databases using high-qualitycurated data and integration with cloud-based or stand-alonepipelines [184ndash189] can also help facilitate such efforts

The current version of the Bioinformatics Link Directory en-lists gt900 bioinformatics tools that are capable of processing dif-ferent types of protein data (sequence structure interactionquantification and annotation) Unifying multiple resources andenabling programmatic access via application program interfacesand web services for rapid integration will significantly enhancethe efficacy of variant annotation pipelines The development ofnovel statistical methods that can quantify various residue-specific properties would enable a comparative analysis of wild-type and variant sequences and can thereby help to facilitate theidentification and prioritization of functional mutations The in-clusion of a diverse set of sequence structure and network fea-tures into a graphically depicted database and an assessment ofthe impact of various data types (sequence structure functionand interaction) using probabilistic or machine learning modelswould also help automate variant annotation interpretation

From a sequence perspective understanding the precisefunctional role of LCRs unassigned regions and disordered re-gions as well as the relationship between these features and cod-ing SNVs is a key step in variant annotation and interpretationIdentifying deviations in protein structural space that are likelyto lead to misfolding domain swapping aggregation degrad-ation or deviations that are metamorphic in nature owing to theimpact of coding SNVs is an emerging challenge Several algo-rithms are available to predict sequence or structural domainsassign distantly related domains to unassigned regions func-tional motifs structural motifs and structural features proteindisorder and aggregation propensities However algorithms andtools to predict 3D domain swapping or to analyze metamorphic

properties degradation pathways and the network-level impactof coding SNVs are less abundant This may be attributed to thefact that these concepts are continually evolving in the proteinuniverse and concerted efforts are needed to understand the im-pact of coding variants of such unique features from both experi-mental and computational biologists alike

This review summarizes the major bioinformatics chal-lenges involved in gaining a deeper understanding of codingSNVs Coding SNVs may impart a molecular or disease pheno-type by interacting with noncoding regulatory parts of the gen-ome and the structural variants may provide an importantcontribution to this phenomenon Analyzing functional andregulatory variants in a single analytical framework may furtherenhance the interpretation of the genomic variants

Conclusions

A rapid decline in sequencing costs using NGS technologies hasled to an exponential increase in the frequency of the sequenc-ing projects over the past decade A large number of personalgenomes and exomes clinical samples and cancer genomes arebeing sequenced as part of large-scale collaborative projectsThe identification of a plethora of sequence variants associatedwith diverse molecular and disease phenotypes Scalable com-putational approaches that integrate annotations using se-quence structure functional and interaction data are necessaryfor the rapid interpretation of coding SNVs We have discussedthe bioinformatics resources available for the prediction anno-tation and visualization of coding SNVs summarized the majorbioinformatics challenges into 10 different themes for a deeperinterpretation of coding SNVs and recommended a three-levelschema to assess the phenotypic impact of functional vari-ations on individual protein sequences structures differentfunctional categories biological pathways and interactomesWe envisage that addressing the key challenges discussed inthis review and adopting a comprehensive annotation schemafor variant annotation could improve genomic reports that aregenerated as part of genomic medicine investigations and ex-perimental studies to better understand the variations impli-cated in rare common and complex disease manifestations

Key Points

bull Interpretations of variants identified from next-gener-ation sequencing pose several challenges

bull Systems-level experimental investigation of the func-tional variants is expensive and time-consumingefficient computational techniques are required toidentify the impact of functional variants

bull We recommended an integrated approach that com-bines multiple data types and tools for the predictionannotation and visualization of functional variantsand we have proposed a systematic approach for func-tional variant annotation and interpretation

bull Significant challenges that need to be negotiated dur-ing the interpretation of coding single nucleotide vari-ants are presented with the help of various examples

bull A three-level annotation approach that combines theinformation at the level of an individual variant mul-tiple variants in a single protein and global trends ofmultiple genes harboring multiple variants is proposedfor an effective interpretation of coding single nucleo-tide variants

Interpreting functional effects of coding variants | 17

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

Acknowledgements

KS and JTD would like to thank Icahn Institute forGenomics and Multiscale Biology (httpicahnmssmedudepartments-and-institutesgenomics) Mount Sinai HealthSystem for infrastructural support KS would like to ac-knowledge Michael J Zimmermann and Jean-Pierre Kocher(Mayo Clinic Rochester) for useful discussions The MayoClinic Bioinformatics Core provided infrastructure and ana-lytical support to KRK RS would like to thank NationalCentre for Biological Sciences (TIFR) for infrastructuralsupport

Funding

KS and JTD was supported by grant R01-DK098242-03from the National Institute of Diabetes and Digestive andKidney Diseases KRK is funded in part by the Mayo ClinicCenter for Individualized Medicine the PharmacogenomicsResearch Network (PGRN) and the Mayo Clinic BreastSpecialized Program of Research Excellence (SPORE)

References1 Lander ES Initial impact of the sequencing of the human

genome Nature 2011470187ndash972 Mardis ER A decadersquos perspective on DNA sequencing tech-

nology Nature 2011470198ndash2033 Green ED Guyer MS Charting a course for genomic medi-

cine from base pairs to bedside Nature 2011470204ndash134 Maxmen A Exome sequencing deciphers rare diseases Cell

2011144635ndash75 Wheeler DA Srinivasan M Egholm M et al The complete

genome of an individual by massively parallel DNAsequencing Nature 2008452872ndash6

6 Weatheritt RJ Davey NE Gibson TJ Linear motifs conferfunctional diversity onto splice variants Nucleic Acids Res201240123ndash31

7 Goldstein DB Common genetic variation and human traitsN Engl J Med 20093601696ndash8

8 Manolio TA Collins FS Cox NJ et al Finding the missing her-itability of complex diseases Nature 2009461747ndash53

9 Genome-wide association study of 14 000 cases of sevencommon diseases and 3 000 shared controls Nature2007447661ndash78

10 Kullo IJ Ding K Shameer K et al Complement receptor 1gene variants are associated with erythrocyte sedimenta-tion rate Am J Hum Genet 201189131ndash8

11 Welter D MacArthur J Morales J et al The NHGRI GWASCatalog a curated resource of SNP-trait associations NucleicAcids Res 201442D1001ndash6

12 Leslie R OrsquoDonnell CJ Johnson AD GRASP analysis of geno-type-phenotype results from 1390 genome-wide associationstudies and corresponding open access databaseBioinformatics 201430i185ndash94

13 Nelson MR Wegmann D Ehm MG et al An abundance ofrare functional variants in 202 drug target genes sequencedin 14 002 people Science 2012337100ndash4

14 MacArthur DG Balasubramanian S Frankish A et al A sys-tematic survey of loss-of-function variants in human pro-tein-coding genes Science 2012335823ndash8

15 Raychaudhuri S Mapping rare and common causal allelesfor complex human diseases Cell 201114757ndash69

16 Stankiewicz P Lupski JR Structural variation in the humangenome and its role in disease Annu Rev Med201061437ndash455

17 McCarroll SA Altshuler DM Copy-number variation and as-sociation studies of human disease Nat Genet200739S37ndash42

18 Zhang F Gu W Hurles ME et al Copy number variation inhuman health disease and evolution Annu Rev GenomicsHum Genet 200910451ndash81

19 Medvedev P Stanciu M Brudno M Computational methodsfor discovering structural variation with next-generationsequencing Nat Methods 20096S13ndash20

20 Sherry ST Ward MH Kholodov M et al dbSNP the NCBI data-base of genetic variation Nucleic Acids Res 200129308ndash11

21 Rios D McLaren WM Chen Y et al A database and API forvariation dense genotyping and resequencing data BMCBioinformatics 201011238

22 Iafrate AJ Feuk L Rivera MN et al Detection of large-scale variation in the human genome Nat Genet200436949ndash51

23 Ashley EA Butte AJ Wheeler MT et al Clinical assess-ment incorporating a personal genome Lancet20103751525ndash35

24 Chen R Mias GI Li-Pook-Than J et al Personal omics profil-ing reveals dynamic molecular and medical phenotypesCell 20121481293ndash307

25 Lupski JR Reid JG Gonzaga-Jauregui C et al Whole-genomesequencing in a patient with Charcot-Marie-Tooth neur-opathy N Engl J Med 20103621181ndash91

26 Li Y Vinckenbosch N Tian G et al Resequencing of200 human exomes identifies an excess of low-fre-quency non-synonymous coding variants Nat Genet201042969ndash72

27 Clarke L Zheng-Bradley X Smith R et al The 1000 GenomesProject data management and community access NatMethods 20129459ndash462

28 Schunkert H Konig IR Kathiresan S et al Large-scale associ-ation analysis identifies 13 new susceptibility loci for coron-ary artery disease Nat Genet 201143333ndash8

29 Helgadottir A Thorleifsson G Manolescu A et al A commonvariant on chromosome 9p21 affects the risk of myocardialinfarction Science 20073161491ndash3

30 Visel A Zhu Y May D et al Targeted deletion of the 9p21non-coding coronary artery disease risk interval in miceNature 2010464409ndash12

31 Soranzo N Rendon A Gieger C et al A novel variant onchromosome 7q223 associated with mean platelet volumecounts and function Blood 20091133831ndash7

32 Paul DS Nisbet JP Yang TP et al Maps of open chromatinguide the functional follow-up of genome-wide associationsignals application to hematological traits PLoS Genet20117e1002139

33 Hull J Campino S Rowlands K et al Identification of com-mon genetic variation that modulates alternative splicingPLoS Genet 20073e99

34 Albert FW Kruglyak L The role of regulatory vari-ation in complex traits and disease Nat Rev Genet201516197ndash212

35 Battle A Khan Z Wang SH et al Genomic variation Impactof regulatory variation from RNA to protein Science2015347664ndash7

36 den Dunnen JT Antonarakis SE Nomenclature for the de-scription of human sequence variations Hum Genet2001109121ndash4

18 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

37 Eilbeck K Lewis SE Mungall CJ et al The sequence ontologya tool for the unification of genome annotations Genome Biol20056R44

38 Landrum MJ Lee JM Riley GR et al ClinVar public archive ofrelationships among sequence variation and human pheno-type Nucleic Acids Res 201442D980ndash5

39 Kircher M Witten DM Jain P et al A general framework forestimating the relative pathogenicity of human genetic vari-ants Nat Genet 201446310ndash15

40 Kumar P Henikoff S Ng PC Predicting the effects of codingnon-synonymous variants on protein function using theSIFT algorithm Nat Protoc 200941073ndash81

41 Adzhubei I Schmidt S Peshkin L et al A method and serverfor predicting damaging missense mutations Nat Meth20107248ndash9

42 Gonzalez-Perez A Lopez-Bigas N Improving the assessmentof the outcome of nonsynonymous SNVs with a consensusdeleteriousness score Condel Am J Hum Genet 201188440ndash9

43 Grantham R Amino acid difference formula to help explainprotein evolution Science 1974185862ndash4

44 Cooper GM Stone EA Asimenos G et al Distribution and in-tensity of constraint in mammalian genomic sequenceGenome Res 200515901ndash13

45 Pollard KS Hubisz MJ Rosenbloom KR et al Detection ofnonneutral substitution rates on mammalian phylogeniesGenome Res 201020110ndash21

46 Siepel A Bejerano G Pedersen JS et al Evolutionarily con-served elements in vertebrate insect worm and yeast gen-omes Genome Res 2005151034ndash50

47 Dayhoff MO Eck RV Park CM Model of evolutionary changein proteins In MO Dayhoff (ed) Atlas of Protein Sequence andStructure Washington DC National Biomedical ResearchFoundation 1972 89ndash99

48 Henikoff S Henikoff JG Amino acid substitution matricesfrom protein blocks Proc Natl Acad Sci USA 19928910915ndash19

49 Altschul SF Gish W Miller W et al Basic local alignmentsearch tool J Mol Biol 1990215403ndash10

50 Altschul SF Madden TL Schaffer AA et al Gapped BLASTand PSI-BLAST a new generation of protein database searchprograms Nucleic Acids Res 1997253389ndash402

51 Sunyaev S Ramensky V Koch I et al Prediction of deleteri-ous human alleles Hum Mol Genet 200110591ndash7

52 Yang Z Bielawski JP Statistical methods for detecting mo-lecular adaptation Trends Ecol Evol 200015496ndash503

53 Ng PC Henikoff S Predicting deleterious amino acid substi-tutions Genome Res 200111863ndash74

54 Wang Z Moult J SNPs protein structure and disease HumMutat 200117263ndash70

55 Thomas PD Kejariwal A Coding single-nucleotide poly-morphisms associated with complex vs Mendelian diseaseevolutionary evidence for differences in molecular effectsProc Natl Acad Sci USA 200410115398ndash403

56 Stone EA Sidow A Physicochemical constraint violation bymissense substitutions mediates impairment of proteinfunction and disease severity Genome Res 200515978ndash86

57 Miller MP Kumar S Understanding human disease muta-tions through the use of interspecific genetic variation HumMol Genet 2001102319ndash28

58 Johnson AD Single-nucleotide polymorphism bioinfor-matics a comprehensive review of resources Circ CardiovascGenet 20092530ndash6

59 Marjoram P Tavare S Modern computational approachesfor analysing molecular genetic variation data Nat Rev Genet20067759ndash70

60 Excoffier L Heckel G Computer programs for populationgenetics data analysis a survival guide Nat Rev Genet20067745ndash58

61 Peterson TA Nehrt NL Park D et al Incorporating molecularand functional context into the analysis and prioritizationof human variants associated with cancer J Am Med InformAssoc 201219275ndash83

62 Kann MG Advances in translational bioinformatics compu-tational approaches for the hunting of disease genes BriefBioinform 20101196ndash110

63 Cooper GM Shendure J Needles in stacks of needles findingdisease-causal variants in a wealth of genomic data Nat RevGenet 201112628ndash40

64 Stitziel NO Kiezun A Sunyaev S Computational and statis-tical approaches to analyzing variants identified by exomesequencing Genome Biol 201112227

65 Alexander RP Fang G Rozowsky J et al Annotating non-coding regions of the genome Nat Rev Genet 201011559ndash71

66 Mort M Evani US Krishnan VG et al In silico func-tional profiling of human disease-associated and poly-morphic amino acid substitutions Hum Mutat 201031335ndash46

67 Hamosh A Scott AF Amberger JS et al Online MendelianInheritance in Man (OMIM) a knowledgebase of humangenes and genetic disorders Nucleic Acids Res200533D514ndash17

68 Stenson PD Mort M Ball EV et al The human gene mutationdatabase 2008 update Genome Med 2009113

69 Forbes SA Bindal N Bamford S et al COSMIC mining com-plete cancer genomes in the Catalogue of SomaticMutations in Cancer Nucleic Acids Res 201139D945ndash50

70 Chen YA Tripathi LP Mizuguchi K TargetMine an inte-grated data warehouse for candidate gene prioritisation andtarget discovery PLoS One 20116e17844

71 Wang K Li M Hakonarson H ANNOVAR functional annota-tion of genetic variants from high-throughput sequencingdata Nucleic Acids Res 201038e164

72 Danecek P Auton A Abecasis G et al The variant call formatand VCFtools Bioinformatics 2011272156ndash8

73 Macintyre G Jimeno Yepes A Ong CS et al Associating dis-ease-related genetic variants in intergenic regions to thegenes they impact Peer J 20142e639

74 Javed A Agrawal S Ng PC Phen-Gen combining phenotypeand genotype to analyze rare disorders Nat Methods201411935ndash7

75 Dai HJ Wu JC Tsai RT et al T-HOD a literature-based candi-date gene database for hypertension obesity and diabetesDatabase (Oxford) 20132013bas061

76 Cheng D Knox C Young N et al PolySearch a web-basedtext mining system for extracting relationships betweenhuman diseases genes mutations drugs and metabolitesNucleic Acids Res 200836W399ndash405

77 Do R Kathiresan S Abecasis GR Exome sequencing andcomplex disease practical aspects of rare variant associ-ation studies Hum Mol Genet 201221R1ndash9

78 Wagner MJ Rare-variant genome-wide association studiesa new frontier in genetic analysis of complex traitsPharmacogenomics 201314413ndash24

79 Genomes Project C Abecasis GR Auton A et al An inte-grated map of genetic variation from 1 092 human genomesNature 201249156ndash65

80 Anfinsen CB The formation and stabilization of proteinstructure Biochem J 1972128737ndash49

81 Anfinsen CB Principles that govern the folding of proteinchains Science 1973181223ndash30

Interpreting functional effects of coding variants | 19

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

82 Fay J Wu C Sequence divergence functional constraintand selection in protein evolution Annu Rev Genomics HumGenet 20034213ndash35

83 Ponting CP Russell RR The natural history of protein do-mains Annu Rev Biophys Biomol Struct 20023145ndash71

84 Ekman D Bjorklund AK Frey-Skott J et al Multi-domain pro-teins in the three kingdoms of life orphan domains andother unassigned regions J Mol Biol 2005348231ndash43

85 Sigrist CJ Cerutti L Hulo N et al PROSITE a documenteddatabase using patterns and profiles as motif descriptorsBrief Bioinform 20023265ndash74

86 Dinkel H Michael S Weatheritt RJ et al ELMndashthe database ofeukaryotic linear motifs Nucleic Acids Res 201240D242ndash51

87 Haberichter SL Budde U Obser T et al The mutation N528Sin the von Willebrand factor (VWF) propeptide causes de-fective multimerization and storage of VWF Blood20101154580ndash7

88 Kutz WE Wang LW Dagoneau N et al Functional analysis ofan ADAMTS10 signal peptide mutation in Weill-Marchesanisyndrome demonstrates a long-range effect on secretion ofthe full-length enzyme Hum Mutat 2008291425ndash34

89 Pugalenthi G Suganthan PN Sowdhamini R et al SMotif aserver for structural motifs in proteins Bioinformatics200723637ndash8

90 Sundaramurthy P Shameer K Sreenivasan R et al HORI aweb server to compute Higher Order Residue Interactions inprotein structures BMC Bioinformatics 201011(Suppl 1)S24

91 Porter CT Bartlett GJ Thornton JM The Catalytic Site Atlas aresource of catalytic sites and residues identified in enzymesusing structural data Nucleic Acids Res 200432D129ndash33

92 Stuart AC Ilyin VA Sali A LigBase a database of families ofaligned ligand binding sites in known protein sequencesand structures Bioinformatics 200218200ndash01

93 Swapna LS Mahajan S de Brevern A et al Comparison oftertiary structures of proteins in protein-protein complexeswith unbound forms suggests prevalence of allostery in sig-nalling proteins BMC Struct Biol 2012126

94 Mathivanan S Ahmed M Ahn NG et al Human Proteinpediaenables sharing of human protein data Nat Biotechnol200826164ndash7

95 Bhaskara RM Srinivasan N Stability of domain structures inmulti-domain proteins Sci Rep 2011140

96 Furnham N de Beer TA Thornton JM Current challenges ingenome annotation through structural biology and bioinfor-matics Curr Opin Struct Biol 201222594ndash601

97 Wang Y Cheng H Pan Z et al Reconfiguring phosphoryl-ation signaling by genetic polymorphisms affects cancersusceptibility J Mol Cell Biol 20157187ndash202

98 Gnad F Gunawardena J Mann M PHOSIDA 2011 the post-translational modification database Nucleic Acids Res201139D253ndash60

99 Nicolaou N Margadant C Kevelam SH et al Gain of glycosy-lation in integrin alpha3 causes lung disease and nephroticsyndrome J Clin Invest 20121224375ndash87

100 Radivojac P Baenziger PH Kann MG et al Gain and loss ofphosphorylation sites in human cancer Bioinformatics200824i241ndash7

101 Pan Y Karagiannis K Zhang H et al Human germline andpan-cancer variomes and their distinct functional profilesNucleic Acids Res 20144211570ndash88

102 Mazumder R Morampudi KS Motwani M et al Proteome-wide analysis of single-nucleotide variations in the N-glyco-sylation sequon of human genes PLoS One 20127e36212

103 Reddy CC Shameer K Offmann BO et al PURE a webserverfor the prediction of domains in unassigned regions in pro-teins BMC Bioinformatics 20089281

104 Coletta A Pinney JW Solis DY et al Low-complexity regionswithin protein sequences have position-dependent rolesBMC Syst Biol 2010443

105 Jones DT Swindells MB Getting the most from PSI-BLASTTrends Biochem Sci 200227161ndash4

106 Chograni M Rejeb I Jemaa LB et al The first missense muta-tion of NHS gene in a Tunisian family with clinical featuresof NHS syndrome including cardiac anomaly Eur J HumGenet 201119851ndash6

107 Jones S Thornton JM Principles of protein-protein inter-actions Proc Natl Acad Sci USA 19969313ndash20

108 Babu MM van der Lee R de Groot NS et al Intrinsically dis-ordered proteins regulation and disease Curr Opin Struct Biol201121432ndash40

109 Dunker AK Silman I Uversky VN et al Function and struc-ture of inherently disordered proteins Curr Opin Struct Biol200818756ndash64

110 Wright PE Dyson HJ Intrinsically unstructured proteins re-assessing the protein structure-function paradigm J Mol Biol1999293321ndash31

111 Dyson HJ Wright PE Coupling of folding and binding for un-structured proteins Curr Opin Struct Biol 20021254ndash60

112 Nguyen Ba AN Yeh BJ van Dyk D et al Proteome-wide dis-covery of evolutionary conserved sequences in disorderedregions Sci Signal 20125rs1

113 Das RK Mao AH Pappu RV Unmasking functionalmotifs within disordered regions of proteins Sci Signal20125pe17

114 Hu Y Liu Y Jung J et al Changes in predicted protein dis-order tendency may contribute to disease risk BMCGenomics 201112(Suppl 5)S2

115 Vacic V Iakoucheva LM Disease mutations in disordered re-gionsndashexception to the rule Mol BioSyst 2012827ndash32

116 Srivastava SK Gayathri S Manjasetty BA et al Analysis ofconformational variation in macromolecular structuralmodels PLoS One 20127e39993

117 Ward JJ Sodhi JS McGuffin LJ et al Prediction and functionalanalysis of native disorder in proteins from the three king-doms of life J Mol Biol 2004337635ndash45

118 Thomas PJ Qu BH Pedersen PL Defective protein folding asa basis of human disease Trends Biochem Sci 199520456ndash9

119 Welch W Role of quality control pathways in human dis-eases involving protein misfolding Semin Cell Dev Biol20041531ndash8

120 Fink AL Protein aggregation folding aggregates inclusionbodies and amyloid Fold Des 19983R9ndash23

121 Thirumalai D Klimov DK Dima RI Emerging ideas on themolecular basis of protein and peptide aggregation CurrOpin Struct Biol 200313146ndash59

122 Khare SD Dokholyan NV Molecular mechanisms of poly-peptide aggregation in human diseases Curr Protein Pept Sci20078573ndash9

123 Soto C Estrada L Castilla J Amyloids prions and the inher-ent infectious nature of misfolded protein aggregatesTrends Biochem Sci 200631150ndash5

124 Prusiner SB Molecular biology and pathogenesis of priondiseases Trends Biochem Sci 199621482ndash7

125 Shameer K Shingate PN Manjunath SC et al 3DSwap cura-ted knowledgebase of proteins involved in 3D domain swap-ping Database (Oxford) 20112011bar042

20 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

126 Bennett MJ Choe S Eisenberg D Domain swapping entan-gling alliances between proteins Proc Natl Acad Sci USA1994913127ndash31

127 Bennett MJ Sawaya MR Eisenberg D Deposition diseasesand 3D domain swapping Structure 200614811ndash24

128 Shameer K Sowdhamini R Functional repertoire molecularpathways and diseases associated with 3D domain swap-ping in the human proteome J Clin Bioinform 201228

129 Waters PJ Parniak MA Hewson AS et al Alterations in pro-tein aggregation and degradation due to mild and severemissense mutations (A104D R157N) in the human phenyl-alanine hydroxylase gene (PAH) Hum Mutat 199812344ndash54

130 OrsquoNeill JW Kim DE Johnsen K et al Single-site mutations in-duce 3D domain swapping in the B1 domain of protein Lfrom Peptostreptococcus magnus Structure 200191017ndash27

131 Seeliger MA Spichty M Kelly SE et al Role of conformationalheterogeneity in domain swapping and adapter function ofthe Cks proteins J Biol Chem 200528030448ndash59

132 Kirsten Frank M Dyda F Dobrodumov A et al Core muta-tions switch monomeric protein GB1 into an intertwinedtetramer Nat Struct Biol 20029877ndash85

133 Huang Y Cao H Liu Z Three-dimensional domain swappingin the protein structure space Proteins 2012801610ndash19

134 Gidalevitz T Wang N Deravaj T et al Natural genetic vari-ation determines susceptibility to aggregation or toxicity ina C elegans model for polyglutamine disease BMC Biol201311100

135 Ross CA Poirier MA Protein aggregation and neurodegener-ative disease Nat Med 200410(Suppl)S10ndash17

136 Murzin A Metamorphic proteins Science 20083201725ndash6137 Shortle D One sequence plus one mutation equals two

folds Proc Natl Acad Sci USA 200910621011ndash12138 Yadid I Kirshenbaum N Sharon M et al Metamorphic pro-

teins mediate evolutionary transitions of structure Proc NatlAcad Sci USA 20101077287ndash92

139 Yadid I Kirshenbaum N Sharon M et al Metamorphic pro-teins mediate evolutionary transitions of structure Proc NatlAcad Sci USA 20101077287ndash92

140 Pickrell J Marioni J Pai A et al Understanding mechanismsunderlying human gene expression variation with RNAsequencing Nature 2010464768ndash72

141 Matlin AJ Clark F Smith CW Understanding alternative splic-ing towards a cellular code Nat Rev Mol Cell Biol 20056386ndash98

142 Rickman DS Pflueger D Moss B et al SLC45A3-ELK4 is anovel and frequent erythroblast transformation-specific fu-sion transcript in prostate cancer Cancer Res 2009692734ndash8

143 Flomen R Makoff A Increased RNA editing in EAAT2 pre-mRNA from amyotrophic lateral sclerosis patients involve-ment of a cryptic polyadenylation site Neurosci Lett2011497139ndash43

144 Asmann YW Hossain A Necela BM et al A novel bioinfor-matics pipeline for identification and characterization of fu-sion transcripts in breast cancer and normal cell linesNucleic Acids Res 201139e100

145 Bundschuh R Computational prediction of RNA editingsites Bioinformatics 2004203214ndash20

146 Eisenberg E Li JB Levanon EY Sequence based identificationof RNA editing sites RNA Biol 20107248ndash52

147 Levanon EY Eisenberg E Algorithmic approaches for identi-fication of RNA editing sites Brief Funct Genomics Proteomic2006543ndash5

148 McCarthy DJ Humburg P Kanapin A et al Choice of tran-scripts and software has a large effect on variant annota-tion Genome Med 2014626

149 Tang X Baheti S Shameer K et al The eSNV-detect a com-putational system to identify expressed single nucleotidevariants from transcriptome sequencing data Nucleic AcidsRes 201442e172

150 Chorev M Carmel L The function of introns Front Genet2012355

151 Chen R Davydov EV Sirota M et al Non-synonymous andsynonymous coding SNPs show similar likelihood and effectsize of human disease association PLoS One 20105e13574

152 Rose AB Intron-mediated regulation of gene expressionCurr Top Microbiol Immunol 2008326277ndash90

153 Birnbaum RY Clowney EJ Agamy O et al Coding exonsfunction as tissue-specific enhancers of nearby genesGenome Res 2012221059ndash68

154 Nakaya HI Amaral PP Louro R et al Genome mapping andexpression analyses of human intronic noncoding RNAsreveal tissue-specific patterns and enrichment in genesrelated to regulation of transcription Genome Biol 20078R43

155 Rohs R Jin X West SM et al Origins of specificity in protein-DNA recognition Ann Rev Biochem 201079233ndash69

156 Karczewski KJ Dudley JT Kukurba KR et al Systematic func-tional regulatory assessment of disease-associated variantsProc Natl Acad Sci USA 20131109607ndash12

157 Khalil AM Rinn JL RNA-protein interactions in humanhealth and disease Semin Cell Dev Biol 201122359ndash65

158 Jones S van Heyningen P Berman HM et al Protein-DNAinteractions a structural analysis J Mol Biol 1999287877ndash96

159 Jones S Daley DT Luscombe NM et al Protein-RNAinteractions a structural analysis Nucleic Acids Res200129943ndash54

160 Barabasi AL Oltvai ZN Network biology understanding thecellrsquos functional organization Nat Rev Genet 20045101ndash13

161 Nooren IM Thornton JM Diversity of protein-protein inter-actions EMBO J 2003223486ndash92

162 Caspi R Altman T Dreher K et al The MetaCyc database ofmetabolic pathways and enzymes and the BioCyc collectionof pathwaygenome databases Nucleic Acids Res201240D742ndash53

163 Yamada T Bork P Evolution of biomolecular networks les-sons from metabolic and protein interactions Nat Rev MolCell Biol 200910791ndash803

164 Stark C Breitkreutz BJ Chatr-Aryamontri A et al TheBioGRID interaction database 2011 update Nucleic Acids Res201139D698ndash704

165 Szklarczyk D Franceschini A Kuhn M et al The STRINGdatabase in 2011 functional interaction networks of pro-teins globally integrated and scored Nucleic Acids Res201139D561ndash8

166 Bader GD Hogue CW An automated method for finding mo-lecular complexes in large protein interaction networksBMC Bioinformatics 200342

167 DallrsquoOlio GM Vahdati AR Bertranpetit J et alVCF2Networks applying Genotype Networks to SingleNucleotide Variants data Bioinformatics 201531438ndash9

168 Ding L Getz G Wheeler DA et al Somatic mutations affectkey pathways in lung adenocarcinoma Nature20084551069ndash75

169 Case M Matheson E Minto L et al Mutation of genes affect-ing the RAS pathway is common in childhood acutelymphoblastic leukemia Cancer Res 2008686803ndash9

170 Thorstensen L Lind GE Lovig T et al Genetic and epigeneticchanges of components affecting the WNT pathway in colo-rectal carcinomas stratified by microsatellite instabilityNeoplasia 2005799ndash108

Interpreting functional effects of coding variants | 21

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

171 Weng L Macciardi F Subramanian A et al SNP-based path-way enrichment analysis for genome-wide associationstudies BMC Bioinformatics 20111299

172 Wang K Li M Bucan M Pathway-based approaches for ana-lysis of genomewide association studies Am J Hum Genet2007811278ndash83

173 OrsquoDushlaine C Kenny E Heron EA et al The SNP ratio testpathway analysis of genome-wide association datasetsBioinformatics 2009252762ndash3

174 Tarca AL Draghici S Khatri P et al A novel signaling path-way impact analysis Bioinformatics 20092575ndash82

175 Ramanan VK Shen L Moore JH et al Pathway analysis ofgenomic data concepts methods and prospects for futuredevelopment Trends Genet 201228323ndash2

176 Li Y Agarwal P Rajagopalan D A global pathway crosstalknetwork Bioinformatics 2008241442ndash7

177 Rivals I Personnaz L Taing L et al Enrichment or depletionof a GO category within a class of genes which testBioinformatics 200723401ndash7

178 Rhee SY Wood V Dolinski K et al Use and misuse of thegene ontology annotations Nat Rev Genet 20089509ndash15

179 Chandonia JM Brenner SE The impact of structural gen-omics expectations and outcomes Science 2006311347ndash51

180 Johnson MS Srinivasan N Sowdhamini R et al Knowledge-based protein modeling Crit Rev Biochem Mol Biol 1994291ndash68

181 Sulkowska JI Morcos F Weigt M et al Genomics-aided struc-ture prediction Proc Natl Acad Sci USA 201210910340ndash5

182 Lee D Redfern O Orengo C Predicting protein function from se-quence and structure Nat Rev Mol Cell Biol 20078995ndash1005

183 Worth CL Gong S Blundell TL Structural and functionalconstraints in the evolution of protein families Nat Rev MolCell Biol 200910709ndash20

184 Karczewski KJ Fernald GH Martin AR et al STORMSeq anopen-source user-friendly pipeline for processing personalgenomics data in the cloud PLoS One 20149e84860

185 Kalari KR Rossell D Necela BM et al Deep sequence ana-lysis of non-small cell lung cancer integrated analysis ofgene expression alternative splicing and single nucleotidevariations in lung adenocarcinomas with and without onco-genic KRAS mutations Front Oncol 2012212

186 Patil S Upadhayay A Arya D et al A high throughput exomesequencing approach to analyse events within a good re-sponder CML patient under imatinib at diagnosis and underremission Blood 20131225161

187 Kullo IJ Shameer K Jouni H et al The ATXN2-SH2B3 locus isassociated with peripheral arterial disease an electronicmedical record-based genome-wide association study FrontGenet 20145166

188 Shameer K Denny JC Ding K et al A genome- and phe-nome-wide association study to identify genetic variantsinfluencing platelet count and volume and their pleiotropiceffects Hum Genet 201413395ndash109

189 Sabarinathan R Wenzel A Novotny P et al Transcriptome-wide analysis of UTRs in non-small cell lung cancer revealscancer-related genes with SNV-induced changes on RNAsecondary structure and miRNA target sites PLoS One20149e82699

22 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

  • bbv084-TF1
Page 8: challenges in proteome-scale prediction, annotation and assessment

Table 4 Genome browsers and biological data visualization libraries

Name Description URL

1000 Genomesbrowser

Genome browser to access data from 1000 Genomesproject

httpwwwncbinlmnihgovvariationtools1000genomes

AnnoJ Web 20 application designed for visualizing sequencingand annotation data

httpwwwannojorg

Artemis Genome Browser and Annotation Tool httpwwwsangeracukresourcessoftwareartemis

BioGraphics Perl modules for biological data visualization httpsearchcpanorgdistBio-GraphicsBioGraphics (Ruby) Ruby library for drawing overviews of genomic regions httpbio-graphicsrubyforgeorgBluejay Java-based integrated computational environment for the

exploration of genomic datahttpbluejayucalgaryca

CGView Circular Genome Viewer httpwishartbiologyualbertacacgviewCircos Circos is a software package for visualizing genomic data

and annotations in circular layouthttpcircosca

Dalliance Interactive genome viewer which runs directly in a modernweb browser

httpwwwbiodallianceorg

DAS DAS is an integrated visualization toolkit to share and col-late genomic annotation information

wwwbiodasorg

DASTY Web client for visualizing protein sequence feature infor-mation using DAS

httpwwwebiacukdasty

DNAPlotter DNAPlotter can be used to generate images of circular andlinear DNA maps to display regions and annotations ofinterest

httpwwwsangeracukresourcessoftwarednaplotter

Ensembl GenomeBrowser

Ensembl Genome Browser enables the visualization of gen-omic and transcriptomic sequence and related informa-tion for several vertebrate and non-vertebrate species

httpuseastensemblorgindexhtml

GASV Geometric Analysis of Structural Variants httpcompbiocsbrownedusoftwarehtmlGbrowse Generic Genome Browser (GBrowse) is a genome viewer de-

veloped as part of Generic Model Organism Database(GMOD) project

httpgmodorgwikiGBrowse

GENBOREE Customizable genome browser httpgenboreeorgjava-binloginjspGeneViTo JAVA-based workbench for genome-wide analysis through

visual interactionhttpathinabioluoagrbioinformaticsGENEVITO

GenomeGraphs R-based interface to plot genomic information fromEnsembl

httpwwwbioconductororgpackagesreleasebiochtmlGenomeGraphshtml

GenomePixelizer Tool to generate custom images of genomes out of thegiven set of genes

httpwwwatgcorgGenomePixelizer

GenomeTools Versatile genome analysis software httpgenometoolsorgGenomeView Next-generation stand-alone genome browser and editor httpgenomevieworgGremlin Interactive visualization model for analyzing genomic

rearrangementshttpcompbiocsbrownedusoftwarehtml

IGB Integrated genome browser httpbiovizorgigbindexhtmlIGV Integrative Genomics Viewer httpwwwbroadinstituteorgigvJbrowse Genome browser with a fully dynamic AJAX interface httpgmodorgwikiJBrowsejsDAS JavaScript client library for the DAS httpwwwebiacukdastyebihtmljsdashtmlMagicViewer Integrated solution for NGS data visualization and genetic

variation detection and annotationhttpbioinformaticszjcnmagicviewerindexphp

NCBI GraphicalSequence Viewer

Graphical display for the Nucleotide and Protein sequences httpwwwncbinlmnihgovprojectssviewer

Rover Genome browser framework to build custom genomic tools httpchmille4githubcomRoversitehomehtmlRviewer Interactive online tool for comparing and prioritizing gen-

omic regionshttprviewerlblgovrviewer

Savant Genome Browser for high-throughput sequencing data httpgenomesavantcomScribl HTML5 Canvas-based graphics library for visualization of

genomic data and annotationshttpchmille4githubcomScribl

UCSC GenomeBrowser

Interactive genome browser that provide access to se-quence data from different species integrated with alarge collection of layered annotations from experimentsand prediction algorithms

httpgenomeucsceducgi-binhgGateway

VISTA Browser Visualization of pairwise and multiple alignments of wholegenome assemblies

httppipelinelblgovcgi-bingateway2selectorfrac14vista

Whole GenomerVISTA

Visualization of TFBS that are conserved between speciesand overrepresented in upstream regions of groups ofgenes

httpgenomelblgovvistaindexshtml

8 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

Previously studies have established (Table 4) that functionalvariants may impact the overall structural conformation andfunction of proteins (ligand binding substrate specificity pro-teinndashprotein interaction transcription factor-target bindingetc) Residue-specific interactions can play a key role in suchvariations Sequence-based investigations will further help toidentify linear sequence motifs that confer diversity in the splicevariants [6] but the identification and analysis of the impact ofpoint mutations on spatially distributed structural motifs mayrequire structural data [55] However the limited availability ofstructural data compared with sequence data may pose add-itional challenges for incorporating structural analysis of thefunctional variants [96] Moreover additional challenges ariseowing to the dynamic protein conformations and allosteric orother long-distance effects (including higher order interactions)of few mutations on the activity of the protein

UniProtKB provides extensive information about residueproperties and curated information about globular domains (viaannotations derived from Pfam SMART or InterPro)posttranslational modifications (PTMs) and several bioinfor-matics tools and webservers are available to investigate residueproperties and structural properties (See Table 5 for a list oftools to predict various sequence features See URL httpbioinformaticsca links_directorycategoryprotein for list of toolsavailable for different protein-centric analyses)

Occasionally multiple sequence or structural features canbe annotated onto a single variant for example a residue couldbe part of a functional motif andor a structural motif and cod-ing SNVs may perturb either or both features Quantifying mul-tiple features perturbed by variants using an objective-scoringschema will help to capture the entire set of features perturbedby a functional variant Comparative analysis of the sequenceand structural features of the wild type and the variant se-quences will also enhance the understanding of the impact ofmultiple coding variants within a protein

Gain and loss of PTMs owing to SNVs

Amino acids in a mature peptide chain are targets of variedPTM [97] events PTMs include phosphorylation methylationacetylation amidation hydroxylation sulfation lipidation

glycosylation and palmitoylation [98] Several studies have nowcompiled and assessed the impact of phosphorylation in cancer[97] and have reported mutational landscapes of various PTMevents including gain or loss of particular PTMs owing to SNVs(for example gain of glycosylation [99] gain and loss of phos-phorylation in cancer [100]) Pan-cancer [101] and proteomewide studies [102] have also assessed the impact of variety ofPTMs Several tools and databases are currently available toassess impact of individual PTMs owing to SNVs [refs]However tools that can provide complete in silico profiling ofmutations will help the researchers to identify PTM-relevantmutations and use the information to assess therapeuticstratifications

Coding SNVs in unassigned regions of proteins

An unassigned region refers to the segments in proteins withno known functional domain assignments [84] Human proteinshave a variable degree of unassigned regions and small-un-assigned regions are often defined as the linker regions betweentwo domains (Figure 2i) We surveyed the human proteomeusing Pfam-based domain annotations to understand the un-assigned regions in the human proteome We computed as-signed and unassigned regions (in percentage) for 20 242 proteinsequences from the SwissProt database (reviewed sequences)Pfam-A-based domain assignments were retrieved for 20 137 se-quences A subset of 105 proteins was excluded from the ana-lysis owing to overlapping domain assignments Of the 20 137sequences that were analyzed (Figure 4A) one protein (haloaciddehalogenase-like hydrolase domain-containing protein 2) wasassigned with a conserved domain over its entire length 3234sequences were completely unassigned and the rest of the pro-teins had varying segments of unassigned regions (mean5387 SD63094) Current approaches in variant annotationand interpretation are primarily focused on highly conservedglobular domains Given that domains are presently assigned toonly 50 of the human proteome analytical methods such asPrediction of Unassigned Regions (PURE) that use intermediatesequence search techniques for domain assignments [103] orsimilar approaches that use sensitive sequence search protocols

Table 5 Examples of the functional variants located in sequence features perturbing diverse functional and structural effects in proteins

Variant location Description URL

Protein domain DMDM A database that compiles domain mapping of dis-ease mutations have information about 202 507 muta-tions associated with 10 919 domains (compiled fromCDD Pfam COG and SMART databases)

httpbioinfumbcedudmdm

Phosphorylation site Mutation of an AKT phosphorylation site of human B-raf httpwwwncbinlmnihgovpubmed15791648Propeptide Mutation in the von Willebrand factor (VWF) propeptide

affects the oligomerizationhttpwwwncbinlmnihgovpubmed20335223

Signal peptide Mutation in signal peptide of ADAMTS10 influence secre-tion of full-length enzyme

httpwwwncbinlmnihgovpubmed18567016

Active site Mutation in the active site of human deoxycytidine kinaseaffects the substrate specificity

httpwwwncbinlmnihgovpubmed18361501

Linear motif Linear motifs mediate functional diversity of transcriptvariants

httpwwwncbinlmnihgovpubmed22638587

Structural motif Heterozygous missense mutation of a spatially distributedstructural motif in human connexin (GJB3) gene causeerythrokeratodermia variabilis

httpwwwncbinlmnihgovpubmed9843209

Subcellular localization Missense mutations in the NPHS2 gene altering the traffick-ing of nephrin to the plasma membrane

wwwncbinlmnihgovpubmed15496146

Interpreting functional effects of coding variants | 9

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

Table 6 Tools for predicting various sequence and structural features

Name Description URL

3dswap-pred Classify a protein sequence as domain-swapping ornon-domain swapping using an SVM model

httpcapsncbsresin3dswap-predindexhtml

AAIndex An amino acid index is a set of 20 numerical values rep-resenting various physico-chemical and biochemicalproperties of amino acids

httpwwwgenomejpaaindex

Bioinformatics LinkDirectory (Protein)

Extensive list of tools for prediction of protein sequencefeatures structure features and function

httpbioinformaticsca links_directorycategoryprotein

dbPTM Comprehensive resource for protein PTMs httpdbptmmbcnctuedutwDISOPRED Dynamically disordered protein chains do not have sta-

ble secondary structures and have high flexibility insolution Disordered regions also play critical roles inprotein function

httpbioinfcsuclacukdisopred

ELM Eukaryotic Linear Motif server httpelmeuorgEris Eris server computes the change of the protein stability

induced by mutations using structural datahttpdokhlabuncedutoolserisindexhtml

FoldAmyloid Method for predicting of amyloidogenic regions fromprotein sequence

httpbioinfoprotresrufold-amyloidogacgi

FoldX FoldX can be used to find interactions contributing tothe stability of proteins and protein complexes usingstructural data

httpfoldxcrges

Globplot Globplot can predict disordered regions in proteinsequence

httpglobplotemblde

H-Predictor Predict hinge regions involved in protein oligomeriza-tion via the domain-swapping mechanism fromstructural data

httptrollmeduncedudokhlabindexphpSpecialHpredictor

Harmony Substitution and propensity score-based protein struc-ture assessment algorithm

httpcapsncbsresinharmony

HORI Webserver for prediction of higher order residue inter-actions in protein structures

httpcapsncbsresinhori

InterPro Integrated database of predictive protein signaturesused for the classification and automatic annotationof proteins and genomes

httpwwwebiacukinterpro

IUPred Prediction of intrinsically unstructured proteins httpiupredenzimhuLIMBO Predicts the amylogenic regions in a protein httplimbovibbeMUPRO Prediction of protein stability changes for single-site

mutations from sequenceshttpmuproproteomicsicsuciedu

NCBI-CDD Extensive protein domain and family annotationdatabase

httpwwwncbinlmnihgovStructurecddcddshtml

Pfam Database of conserved protein domain families httppfamsangeracukPFILT Program to filter various sequence regions including

low-complexity regionshttpbioinfadmincsuclacukdownloadspfilt

PIC Protein interactions calculator httppicmbuiiscernetinProtParam Compute biochemical features like Molecular Weight

Theoretical pI Grand Average of Hydropathy(GRAVY) instability index etc

httpwebexpasyorgprotparam

PSIPRED Secondary structure prediction httpbioinfcsuclacukpsipredPURE Prediction of unassigned regions in proteins httpcapsncbsresinpureSABBLE Relative solvent accessibility prediction httpsablecchmcorgScanProsite Report the functional motifspatterns encoded in the se-

quence Helps to assess the gainloss of functionalsites owing to mutation

httpprositeexpasyorgscanprosite

SignalP Predicts the presence and location of signal peptidecleavage sites in amino-acid sequences

httpwwwcbsdtudkservicesSignalP

SMART Simple modular architecture research tool for assigningdomains to protein

httpsmartembl-heidelbergde

TANGO Predicts the aggregation-prone regions in a protein httptangocrgesTargetP Predicts the subcellular location of eukaryotic proteins httpwwwcbsdtudkservicesTargetPUniProtKB Catalog of information on proteins httpwwwuniprotorgWALTZ Predicts the aggregation-prone regions in a protein httpwaltzvubacbe

10 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

(Table 5) may potentially help investigate the evolutionary andfunctional roles of variants in unassigned regions

Impact of coding SNVs in low-complexityregions

Low-complexity regions (LCRs) in the protein universe refer to astretch of amino acids with low Shannon entropy (leucine-richdomains or poly-alanine tracts) Unlike linear motifs whichhave a specific function and sequence signature the individualfunctions of LCRs are poorly characterized LCRs do not adopt adefinite secondary structure but may exist as solvent-exposedamino acids in the coiled or disordered regions in proteins LCRsare observed in functionally diverse proteins and in both eu-karyotes and prokaryotes The predominant functions of LCRsinclude promoting mRNA stability and mediating a diverse set

of proteinndashprotein interactions [104] We scanned the referencehuman proteome (reviewed subset of 20 237 sequences) usingPFILT [105] and observed that 1432 (nfrac14 2899) protein se-quences have LCRs with a median length of 16 amino acids(Figure 4B) A recent study showed that a functional variantlocalized to an LCR in Nance-Horan Syndrome (NHS) gene wasassociated with clinical features of NHS including cardiacanomaly [106] In the absence of direct functional informationto interpret the impact of functional variants scanning proteinsequences with an LCR prediction tool like PFILT is highly rec-ommended to investigate the probable gain or loss of LCRsowing to the functional variations

Coding SNVs and intrinsically disorderedregions in proteins

Proteins are believed to be functional when the structure attainsits definite globular fold [107] Recent studies however suggestthat proteins may perform their functions even when not in afully folded state Such proteins or regions within proteins thatexist in a stable conformation without attaining a definite struc-tural conformation are generally referred to as intrinsically dis-ordered proteins (IDPs) or proteins with intrinsically disorderedregions (IDRs) or simply disordered proteins [108 109] Disprotis a database that catalogs a curated list of disordered proteinsthat includes 248 experimentally validated human proteinswith disordered regions (httpwwwdisprotorgactionsearchphpkeywordfrac14humanampcriterionfrac14organism) Prediction modelssuggest that 30ndash40 of human proteins are considered to beIDPs or have IDRs and approximately 25 of eukaryotic proteinsare predicted to be fully disordered [108 109]

Regions of the intrinsically disordered segments in proteinshave been found to be functionally important (Figure 5)Disordered regions mediate various functional roles includingprotein binding and proteinndashprotein interactions Previous stud-ies have shown that IDPs can attain a definite structure on bind-ing to their interacting partner and may thus exist in anintermediate stage of disordered (unfolded) to ordered (folded)stage [110 111] Irrespective of the structural plasticity recent

Figure 3 Proportional Venn diagram of genes with coding variants annotated as

polymorphism (Polymorphism nfrac1411 527) disease (Disease nfrac142105) and un-

classified (Unclassified nfrac141910) in Human polymorphisms and disease muta-

tions index Percentages of genes shared between the three groups are provided

Unassigned regions in human proteome

Percentage of unassigned regions in human proteins

Fre

qu

en

cy in

hu

ma

n p

rote

om

e (

Sw

issp

rot

re

vie

we

d s

eq

ue

nce

s)

0 20 40 60 80 100

01

00

02

00

03

00

04

00

0

Low-complexity regions in human proteome

Percentage of LCRs in human proteins

Fre

quency in h

um

an p

rote

om

e (

Sw

isspro

t r

evie

wed s

equences)

0 20 40 60 80 100

0500

1000

1500

2000

2500

3000(a) (b)

Figure 4 Histograms depicting the distribution (in percentage) of (A) unassigned regions and (B) LCRs in the human proteome

Interpreting functional effects of coding variants | 11

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

studies have shown the presence of evolutionarily conservedfunctional motifs within IDPs [112 113]

Recently the influence of disease mutations in the dis-ordered regions was extensively surveyed and the crucialroles of disorder-promoting amino acids in imparting vari-ations in the protein structures were proposed [114 115]Disorder-promoting amino acids are a potential cause ofvariations in protein structures and structural modelsDisorder-to-order transitions are enriched among diseasemutations compared with neutral polymorphisms [116]Currently protein modeling or variant effect prediction al-gorithms do not take the intrinsic disorder characteristicsof a protein into consideration Integrating tools such asDisopred [117] for disorder region prediction into the vari-ant annotation pipelines will improve our understanding ofthe impact of coding SNVs in the disordered regions of aprotein

Influence of coding SNVs on proteinmisfolding domain swapping aggregationmacromolecular crowding and degradation

Diseases have multiple environmental as well as geneticcauses and structural biology is an active area of research tounderstand how changes in individual protein structures orprotein complexes may play a key role in disease manifest-ations Understanding the structural bases of human diseases

would help to identify better targets and design better ligandsfor more effective therapeutic interventions

Protein misfolding and aggregation

Protein folding pathways play a crucial role in mediating cellu-lar homeostasis Defective folding of a protein product is themechanistic basis of several disease phenotypes likeAlzheimerrsquos disease and Parkinsonrsquos disease [118] When non-synonymous mutations lead to the production of misfoldedproteins with aberrant function several gate-keeping pathwaysact on such misfolded proteins to clear them from the cellularenvironment [119] Misfolded proteins targeted for ubiquityla-tion may also lead to protein aggregation Protein aggregation[120ndash123] is defined as a molecular phenomenon where a pro-tein is not cleared from the cellular environment by the normalpathways (Figure 6A) This leads to the aggregation of proteinsin the cellular environment which in turn leads to cellular tox-icity and is considered to be the mechanistic basis of varioushuman diseases such as prion diseases [123 124]

Domain swapping

3D domain swapping (Figure 6B) is a phenomenon observed in asubset of proteins where intermolecular interactions arereplaced by intramolecular interactions [125 126] Domainswapping is also recognized as a mechanism for forming pro-tein aggregates via open-ended mode Domain swapping is

Figure 5 Examples of IDRs in human proteins (A) Breast cancer type 1 susceptibility protein (BRCA1) (B) Cellular tumor antigen p53 (Oncoprotein p53) (C) T-cell sur-

face glycoprotein CD4 (Disprot) (D) Secondary structure information (UniProtKB) and (E) mutation histogram (COSMIC) of BRCA1 is provided to illustrate mutations in

the unstructured regions

12 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

observed in a variety of therapeutically important proteins andis considered to be the mechanism mediating deposition dis-eases such as neurodegenerative diseases and Alzheimerrsquos dis-ease that are caused by conformational perturbations [127 128]Missense mutations in the human phenylalanine hydroxylasehave been shown to infleunce aggregation and degradationproperties [129] Functional mutations are also considered to bea causative factor for proteins to adopt swapped conformationfrom monomers to higher oligomers [130ndash132] A recent study[128] that surveyed a subset of human proteins involved in 3Ddomain swapping suggested that swapping is not only confinedto conformational diseases it is also associated with severalkey biological pathways and also plays a role in mediating di-verse diseases in humans Another study [133] that investigatedthe structural properties of proteins involved in 3D domainswapping suggested that 10 of protein folds and 5 of proteinfamilies include domain-swapped structures

Macromolecular crowding

Proteins interact with a variety of small molecules and othermacromolecules to perform a specific functional task The qua-ternary assembly of a protein is influenced by the importantyet understudied phenomenon of macromolecular crowding

(117 118) Protein function is driven by interactions mediated byinter- and intra-chain amino acids and mutations in the sur-face amino acids may affect macromolecular homeostasisMutagenesis studies have shown the impact of functional mu-tations on proteinndashprotein interfaces (119) oligomerization (120)and stability (121) Mutations that influence the intra-chaininteractions may influence the crowding effect in the cellularenvironment Probing the impact of functional mutations onsuch effects using theoretical models and molecular dynamicsimulation studies is likely to enhance the understanding of therelationship between coding SNVs and macromolecular crowd-ing A list of selected tools and databases available for the pre-diction of misfolding protein aggregation and domainswapping is summarized in Table 5

Protein degradation

Targeted biochemical studies have revealed that the proteindegradation pathway plays an important role in the clearanceof the mutated proteins to reduce cellular toxicity For examplemutant Cu Zn-superoxide dismutase associated with amyo-trophic lateral sclerosis was cleared by macroautophagy path-way that includes proteasomal cleavage phenylalaninehydroxylase (109) and NAD(P)H quinone oxidoreductase 1 (156)

(a) (b)

Figure 6 Schematic representation of the impact of functional mutation on protein misfolding folding aggregation domain swapping macromolecular crowding and

protein degradation pathways Structure of misfolded rat CD2 structure (PDB ID 1A6P) and normal CD2 (PDB ID 1HNG) is used for representing misfolded and folded

structures Functional variant is represented using red asterisk

Interpreting functional effects of coding variants | 13

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

were also cleared by similar mechanisms Functional variantsplay a crucial role in aggregation and protein degration path-ways [134 135] Computational approaches that can predict thedegradation properties of mutated proteins using sequence orstructural information will be useful for rapid characterizationof functionally active proteins Integrated models that combinefolding misfolding aggregation 3D domain swapping macro-molecular crowding and degradation pathways (Figure 6) in asystems biology approach are likely to provide additional in-sights into the regulation of these important mechanisms andthe role of coding SNVs in mediating such mechanisms

Coding SNVs and metamorphic proteins

The molecular paradigm of sequence-structure-function sug-gests that a diverse set of sequences could attain a similarstructural fold that may lead to a functional convergence Whilethis is generally true for the protein universe occasionally devi-ations are observed Metamorphic proteins refer to a relativelynew class of proteins in which a given sequence has beenshown to attain different folded conformations under nativeconditions while performing distinct functions [136ndash138]Metamorphic proteins such as human chemokine lymphotactinhave been shown to influence evolutionary transitions of struc-ture and coding SNVs which could play an important role inswitching between two or more folds and functions for a singlesequence [139] Computational approaches to catalog and pre-dict the metamorphic properties and the impact of coding SNVson metamorphic proteins would help us understand how muta-tions drive functional plasticity at the proteome level

Impact of coding SNVs on the transcriptomicdiversity

Dynamic features of the human proteome are driven by tran-script diversity and the average number of characterized tran-scripts per gene is rapidly expanding The recent adoption ofRNA-Seq as a tool for expression profiling has led to the charac-terization of a large number of novel transcripts includingfusion transcripts [140] Several novel tissue-specific orcell-type-specific transcripts were reported from RNA-Seqexperiments The cellular compartment-based functional rolesof such transcripts are determined by alternative splicingevents [141] RNA editing is a phenomenon where a pre-mRNAmolecule is altered through a chemical change in the basemakeup thereby adding to the diversity of the transcriptomeRNA editing events occur via two distinct mechanisms of sub-stitution editing and insertiondeletion editing leading to func-tional diversity Fusion transcripts [142] and RNA editing eventsare also associated with various diseases including prostatecancer and amyotrophic lateral sclerosis [143] Computationalapproaches are available for the identification of fusion tran-scripts [144] and RNA editing events [145ndash147] from sequencingdata Understanding the precise roles and the impact of variantson such a diverse set of transcripts using computationalapproaches is a challenging task Ascribing the functional roleand assessing the impact of coding SNVs for mediating noveland fusion transcripts using bioinformatics approaches is anemerging problem and being addressed in recent studies [148]Recently we designed an integrated method to simultaneousanalyses of genome and transcriptome from RNA-seq data TheeSNV-Detect method can precisely capture genetic variation(genotypes) from RNA-seq data and helps to design cost-

effective and sustainable experimental strategies [149] Analyticand interpretation strategies that rely on multiple data-typeswould provide greater confidence for variant calling annotationand interpretation and thus actionability for clinicalinterventions

Functional impact of synonymous variations

Synonymous mutations are defined as mutations that result ina variation at the DNA level that code for the same amino acidin the protein level owing to codon degeneracy Synonymousvariants may either be coding or noncoding and the functionalroles of such variants are emerging A recent study [150] sug-gests that introns are involved in functional mechanisms andintrons with positional conservation across eukaryotic lineageare classified as functional introns A comparative analysis ofsynonymous and non-synonymous variants associated withcomplex diseases has shown similar likelihood and effect sizewith disease association [151] Synonymous mutations mayalso influence the introns regulating gene expression [152] A re-cent report also suggests that the coding exons function as tis-sue-specific enhancers and synonymous variations in suchenhancer sites may influence the expression level of certaingenes [153] Synonymous variants could influence the expres-sion of intronic noncoding RNAs [154] perturb the transientproteinndashDNA [155 156] and proteinndashRNA interactions in the cellSuch perturbations can lead to diseases such as cancer neuro-logical disorders and cardiac disease [157] Developing bioinfor-matics approaches to explore the putative impact ofsynonymous mutations on different layers of the coding andnoncoding genome and their relationships will be an importantaspect in the detailed interpretation of the genomic variantsThe detailed structural exploration of proteinndashDNA [158] andproteinndashRNA interactions [159] will help to precisely map themechanistic bases of the loss or gain of interactions owing tosynonymous variations

Impact of coding SNVs on the interactomeof a protein

The individual interactome of proteins varies to a great extent[160] Proteins often interact with nucleotides (proteinndashDNA orproteinndashRNA interactions) proteins (protein complexes obligateor non-obligate) [161] and proteinndashprotein interactions (transientor permanent) [107]) and small molecules (molecular reactionsenzymatic reactions metabolic pathways) [162 163] CodingSNVs play an important role in mediating such interactions Aninteractome of a protein can be defined using information gath-ered from high-throughput experiments that systematicallyidentify interactants of proteins Public protein-protein databaseslike BioGRID [164] and STRING [165] provide large data sets forderiving a protein interactome Network-level investigations tounderstand the category of interactors that are perturbed owingto coding SNVs will help delineate the impact of such mutationson the protein interactomes Using network measures (centralitydegree stress betweenness closeness cliques radiality transi-tivity reciprocity assortativity structural equivalence networkheterogeneity network density clustering coefficients neighbor-hood connectivity shared neighbors network topology etc) asquantitative parameters in variant annotation pipelines will helpthe researchers gain better insights into the impact of codingSNVs on the protein interactome Biologically relevant networkproperties such as network modularity [166] network fragility

14 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

[166] and lethality [166] can also be derived from an interactomeIncorporating a comparative network analysis framework thatcompares wild-type and variant interactomes will help quantifythe impact of coding SNVs on a network scale Methods such asVCF2Networks employ genotype networks (ie all genotypesassociated with a single phenotype) to understand the relation-ships between the genotype space and clinical or biological phe-notypes [167] Table 7 summarizes the available tools toinvestigate gene lists and to derive global functional trends andnetwork properties

Pathway-level impact of coding SNVs

Multiple functional genomics studies have investigated the per-turbations of pathways as a result of mutations in diseasessuch as adenocarcinoma (MAPK signaling p53 signaling Wntsignaling cell cycle and mammalian target of rapamycin path-ways) [168] childhood acute lymphoblastic leukemia (RAS path-way) [169] colorectal carcinoma (WNT signaling pathway) [170]etc Several SNP-centric approaches for pathway-level inferencehave been designed for the interpretation of SNPs identifiedfrom GWA studies or differentially expressed genes from gene-expression studies [171ndash175] Coding SNVs may also influencethe cross talk between various signaling pathways Currentvariant interpretation algorithms strive to identify pathways

associated with genes harboring coding SNVs yet a detailedunderstanding of the impact of variants on biological pathwaysand pathway cross talks [176] is often lacking Incorporatinganalytical routines to quantify the effect of functional muta-tions on pathways and pathway cross talk will be useful in in-terpreting functional variants from a biological perspective

A classification schema for annotatingcoding SNVs

Identifying the entire spectrum of molecular perturbationsowing to coding variations at the level of sequence structureinteraction and function of proteins is considered the basis ofvariant interpretation Multiple automated prediction tools thatcan assess the functional effect of mutations are currentlyavailable [See Table 6] Several variant annotation tools focuson the task of predicting the type or effect of mutations provideextended annotations from precomputed databases or lookuptables and map a coding variant to its corresponding gene pro-tein functional domain or signaling pathway Theseapproaches have several limitations because additional layersof protein-centric information that can be derived from the pre-diction or computation of sequence-based features with codingvariants are lacking To deal with this challenge and to designbroad-spectrum tools for deep variant interpretation we

Table 7 Software libraries and tools for biological network analysis

Name Description URL

Bio4J Bio4j is a bioinformatics graph-based database httpwwwbio4jcomBioConductor (Graphs

and Networks view)Collection of BioConductor modules for biological net-

work analysis and visualizationhttpwwwbioconductororgpackagesrelease

BiocViewshtml___GraphsAndNetworksCytoscape Open source platform for complex network analysis and

visualization with a large collection of plug-ins forbiological network analysis

httpwwwcytoscapeorg

DAVID Integrated functional annotation tool httpdavidabccncifcrfgovhomejspFunDO Functional Disease Ontology server httpdjangonubicnorthwesternedufundogene2pathway R package for prediction of KEGG pathway membership

for individual genes based on InterPro domainsignatures

httpwwwbioconductororgpackagesreleasebiochtmlgene2pathwayhtml

GeneAnswers R package for biological or medical interpretation of thegiven one or more groups of genes by means of statis-tical test

httpwwwbioconductororgpackagesreleasebiochtmlGeneAnswershtml

GO Tools Tools for analysis of GO T term enrichment statisticalanalysis semantic similarity and functional similar-ity using GO terms derived from gene lists

httpwwwgeneontologyorgGOtoolsshtml

Gephi Open-source graph visualization and analysis software httpwwwgephiorgGremlin Graph-based programming language httpsgithubcomtinkerpopgremlinwikiiGraph Network analysis and visualization library in C Also

available R package and a Python extensionhttpigraphsourceforgenet

KEGGgraph R package for analysis of KEGG pathways httpwwwbioconductororgpackages210biochtmlKEGGgraphhtml

LEDA A broad-spectrum Cthornthorn class library for efficient datatypes and algorithms including large-scale networkanalysis

httpwwwalgorithmic-solutionscomledaaboutindexhtm

Neat Web-based network analysis tools httpgephiorgOntology Analysis

plugins for Cytoscapehttpchiantiucsdeducyto_webplugins Plugins for functional enrichment analysis using

network dataPANTHER Classification of genes and proteins httppantherdborgReactome Curated knowledgebase of molecular events and

pathwayshttpwwwreactomeorg

TargetMine Integrates different types of biological data and enableflexible queries export results and analyze lists ofdata

httptargetminemizuguchilaborg

Interpreting functional effects of coding variants | 15

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

recommend a three-level annotation schema for the interpret-ation of coding variants (Figure 7) Primary level annotation pro-vides an overview of the types of mutations and associationannotations Secondary level annotation enables the systematicinvestigation of multiple variants in a single gene Tertiary-levelannotations help to find the global characteristics of genes har-boring coding variants and various properties in a network-scale

Level 1 Primary annotation of coding SNVs

Tools that use position information from VCF files to derive thetype of mutation (synonymous or non-synonymous) the loca-tion of the corresponding gene and its mapping onto several an-notation resources using positional data may be viewed aslsquoprimary annotationsrsquo See Table 2 for a list of tools that useminimum input data to gather a diverse set of annotationsusing database lookup tables and identifier mapping

Level 2 Secondary annotation of coding SNVs- Impactof multiple variants in a single gene

Sequence and structural explorations of coding variants arereferred to as secondary annotations VCF files can have mul-tiple variants of the same gene associated with a given pheno-type or disease A systematic investigation of multiple variantsin a single gene and an assessment of the relative impact of cod-ing variants could help in classifying multiple variants in a geneand help to delineate the variants as pathogenic moderatelypathogenic or VUS Comparative analysis of wild-type and vari-ant sequences can be performed and quantified using the totalnumber of gains or losses in the sequence features Based on the

availability of the structural data a given variant can be modeledin a protein structure using in-silico mutagenesis experimentsIn the absence of experimental protein structural data a proteinstructure model can be obtained from the homology model data-base or a new model can be built using homology modelingapproaches Once the wild type and mutant type structures areobtained the impact of variation can be rapidly computed usingstructural feature analyses or using computationally expensivemolecular dynamics simulations Table 5 summarizes key se-quence and structure feature prediction tools

Level 3 Tertiary annotation of coding SNVs- Impact ofcoding SNVs on multiple genes pathways andinteractome

The global impact of multiple coding variants on a genome orexome can be assessed using a combination of both knowledge-based enrichment or depletion analysis and network or interac-tome analyses Functional profiling using GO terms have beenused as an effective approach for characterizing collective func-tional characteristics of a perturbed set of genes [177 178]Gene-list-based analyses can provide biologically relevant infor-mation about the variants Enrichment analysis is not just con-fined to a priori defined gene sets or GO annotationsenrichment can be performed using several types of annota-tions and could provide insights into the probable functional as-sociations of genes with coding variants Enrichment analysisusing protein annotations would help to identify functionallysignificant protein domains protein classes families (mem-brane proteins or kinases etc) protein superfamilies (angioten-sin receptors or G-protein coupled receptors) or protein folds(Rossman fold beta-propeller) associated with proteins

Figure 7 Three-level schema for annotation of functional variants

16 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

harboring coding variants Furthermore it is possible to identifyenriched molecular events mediated by genes enriched TFBS inthe upstream of genes and biological pathways mediated by thegenes using annotations from Reactome UCSC and KEGG repo-sitories Disease ontology can be used to find known genendashdisease associations Following the knowledge-based analysis anetwork-level analysis of genes harboring coding SNVs can beperformed using publicly available network analysis tools suchas Cytoscape and Gephi

Future directions

The rapidly increasing availability of sequence structure func-tional and interaction information offers an attractive means toobtain a detailed characterization of coding variants Howeverthe use of large data sets to systematically explore the functionalvariants is currently limited owing to the gaps in the availablesequencendashstructurendashfunction interaction data Sequence dataare available for the entire human proteome but the availabilityof protein structural data experimentally verified functional as-sociations and biomolecular interaction data is limited Theavailability of structural data of homologous protein structuresis a prerequisite for structural investigations of functional vari-ants and its impact on the structural environment Availabilityof high-quality interaction data will also help analyze the cellu-lar networks modulated by proteins harboring coding variantsThe expansion of structural data sets using structural genomics[179] homology modeling [180] and the application of computa-tional approaches such as genomics-aided structure prediction[181] is likely to lower the increasing chasm between sequencestructure function and interaction data and may help to ascer-tain the impact of variants at the biomolecular level [163 182183] Improving the annotation databases using high-qualitycurated data and integration with cloud-based or stand-alonepipelines [184ndash189] can also help facilitate such efforts

The current version of the Bioinformatics Link Directory en-lists gt900 bioinformatics tools that are capable of processing dif-ferent types of protein data (sequence structure interactionquantification and annotation) Unifying multiple resources andenabling programmatic access via application program interfacesand web services for rapid integration will significantly enhancethe efficacy of variant annotation pipelines The development ofnovel statistical methods that can quantify various residue-specific properties would enable a comparative analysis of wild-type and variant sequences and can thereby help to facilitate theidentification and prioritization of functional mutations The in-clusion of a diverse set of sequence structure and network fea-tures into a graphically depicted database and an assessment ofthe impact of various data types (sequence structure functionand interaction) using probabilistic or machine learning modelswould also help automate variant annotation interpretation

From a sequence perspective understanding the precisefunctional role of LCRs unassigned regions and disordered re-gions as well as the relationship between these features and cod-ing SNVs is a key step in variant annotation and interpretationIdentifying deviations in protein structural space that are likelyto lead to misfolding domain swapping aggregation degrad-ation or deviations that are metamorphic in nature owing to theimpact of coding SNVs is an emerging challenge Several algo-rithms are available to predict sequence or structural domainsassign distantly related domains to unassigned regions func-tional motifs structural motifs and structural features proteindisorder and aggregation propensities However algorithms andtools to predict 3D domain swapping or to analyze metamorphic

properties degradation pathways and the network-level impactof coding SNVs are less abundant This may be attributed to thefact that these concepts are continually evolving in the proteinuniverse and concerted efforts are needed to understand the im-pact of coding variants of such unique features from both experi-mental and computational biologists alike

This review summarizes the major bioinformatics chal-lenges involved in gaining a deeper understanding of codingSNVs Coding SNVs may impart a molecular or disease pheno-type by interacting with noncoding regulatory parts of the gen-ome and the structural variants may provide an importantcontribution to this phenomenon Analyzing functional andregulatory variants in a single analytical framework may furtherenhance the interpretation of the genomic variants

Conclusions

A rapid decline in sequencing costs using NGS technologies hasled to an exponential increase in the frequency of the sequenc-ing projects over the past decade A large number of personalgenomes and exomes clinical samples and cancer genomes arebeing sequenced as part of large-scale collaborative projectsThe identification of a plethora of sequence variants associatedwith diverse molecular and disease phenotypes Scalable com-putational approaches that integrate annotations using se-quence structure functional and interaction data are necessaryfor the rapid interpretation of coding SNVs We have discussedthe bioinformatics resources available for the prediction anno-tation and visualization of coding SNVs summarized the majorbioinformatics challenges into 10 different themes for a deeperinterpretation of coding SNVs and recommended a three-levelschema to assess the phenotypic impact of functional vari-ations on individual protein sequences structures differentfunctional categories biological pathways and interactomesWe envisage that addressing the key challenges discussed inthis review and adopting a comprehensive annotation schemafor variant annotation could improve genomic reports that aregenerated as part of genomic medicine investigations and ex-perimental studies to better understand the variations impli-cated in rare common and complex disease manifestations

Key Points

bull Interpretations of variants identified from next-gener-ation sequencing pose several challenges

bull Systems-level experimental investigation of the func-tional variants is expensive and time-consumingefficient computational techniques are required toidentify the impact of functional variants

bull We recommended an integrated approach that com-bines multiple data types and tools for the predictionannotation and visualization of functional variantsand we have proposed a systematic approach for func-tional variant annotation and interpretation

bull Significant challenges that need to be negotiated dur-ing the interpretation of coding single nucleotide vari-ants are presented with the help of various examples

bull A three-level annotation approach that combines theinformation at the level of an individual variant mul-tiple variants in a single protein and global trends ofmultiple genes harboring multiple variants is proposedfor an effective interpretation of coding single nucleo-tide variants

Interpreting functional effects of coding variants | 17

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

Acknowledgements

KS and JTD would like to thank Icahn Institute forGenomics and Multiscale Biology (httpicahnmssmedudepartments-and-institutesgenomics) Mount Sinai HealthSystem for infrastructural support KS would like to ac-knowledge Michael J Zimmermann and Jean-Pierre Kocher(Mayo Clinic Rochester) for useful discussions The MayoClinic Bioinformatics Core provided infrastructure and ana-lytical support to KRK RS would like to thank NationalCentre for Biological Sciences (TIFR) for infrastructuralsupport

Funding

KS and JTD was supported by grant R01-DK098242-03from the National Institute of Diabetes and Digestive andKidney Diseases KRK is funded in part by the Mayo ClinicCenter for Individualized Medicine the PharmacogenomicsResearch Network (PGRN) and the Mayo Clinic BreastSpecialized Program of Research Excellence (SPORE)

References1 Lander ES Initial impact of the sequencing of the human

genome Nature 2011470187ndash972 Mardis ER A decadersquos perspective on DNA sequencing tech-

nology Nature 2011470198ndash2033 Green ED Guyer MS Charting a course for genomic medi-

cine from base pairs to bedside Nature 2011470204ndash134 Maxmen A Exome sequencing deciphers rare diseases Cell

2011144635ndash75 Wheeler DA Srinivasan M Egholm M et al The complete

genome of an individual by massively parallel DNAsequencing Nature 2008452872ndash6

6 Weatheritt RJ Davey NE Gibson TJ Linear motifs conferfunctional diversity onto splice variants Nucleic Acids Res201240123ndash31

7 Goldstein DB Common genetic variation and human traitsN Engl J Med 20093601696ndash8

8 Manolio TA Collins FS Cox NJ et al Finding the missing her-itability of complex diseases Nature 2009461747ndash53

9 Genome-wide association study of 14 000 cases of sevencommon diseases and 3 000 shared controls Nature2007447661ndash78

10 Kullo IJ Ding K Shameer K et al Complement receptor 1gene variants are associated with erythrocyte sedimenta-tion rate Am J Hum Genet 201189131ndash8

11 Welter D MacArthur J Morales J et al The NHGRI GWASCatalog a curated resource of SNP-trait associations NucleicAcids Res 201442D1001ndash6

12 Leslie R OrsquoDonnell CJ Johnson AD GRASP analysis of geno-type-phenotype results from 1390 genome-wide associationstudies and corresponding open access databaseBioinformatics 201430i185ndash94

13 Nelson MR Wegmann D Ehm MG et al An abundance ofrare functional variants in 202 drug target genes sequencedin 14 002 people Science 2012337100ndash4

14 MacArthur DG Balasubramanian S Frankish A et al A sys-tematic survey of loss-of-function variants in human pro-tein-coding genes Science 2012335823ndash8

15 Raychaudhuri S Mapping rare and common causal allelesfor complex human diseases Cell 201114757ndash69

16 Stankiewicz P Lupski JR Structural variation in the humangenome and its role in disease Annu Rev Med201061437ndash455

17 McCarroll SA Altshuler DM Copy-number variation and as-sociation studies of human disease Nat Genet200739S37ndash42

18 Zhang F Gu W Hurles ME et al Copy number variation inhuman health disease and evolution Annu Rev GenomicsHum Genet 200910451ndash81

19 Medvedev P Stanciu M Brudno M Computational methodsfor discovering structural variation with next-generationsequencing Nat Methods 20096S13ndash20

20 Sherry ST Ward MH Kholodov M et al dbSNP the NCBI data-base of genetic variation Nucleic Acids Res 200129308ndash11

21 Rios D McLaren WM Chen Y et al A database and API forvariation dense genotyping and resequencing data BMCBioinformatics 201011238

22 Iafrate AJ Feuk L Rivera MN et al Detection of large-scale variation in the human genome Nat Genet200436949ndash51

23 Ashley EA Butte AJ Wheeler MT et al Clinical assess-ment incorporating a personal genome Lancet20103751525ndash35

24 Chen R Mias GI Li-Pook-Than J et al Personal omics profil-ing reveals dynamic molecular and medical phenotypesCell 20121481293ndash307

25 Lupski JR Reid JG Gonzaga-Jauregui C et al Whole-genomesequencing in a patient with Charcot-Marie-Tooth neur-opathy N Engl J Med 20103621181ndash91

26 Li Y Vinckenbosch N Tian G et al Resequencing of200 human exomes identifies an excess of low-fre-quency non-synonymous coding variants Nat Genet201042969ndash72

27 Clarke L Zheng-Bradley X Smith R et al The 1000 GenomesProject data management and community access NatMethods 20129459ndash462

28 Schunkert H Konig IR Kathiresan S et al Large-scale associ-ation analysis identifies 13 new susceptibility loci for coron-ary artery disease Nat Genet 201143333ndash8

29 Helgadottir A Thorleifsson G Manolescu A et al A commonvariant on chromosome 9p21 affects the risk of myocardialinfarction Science 20073161491ndash3

30 Visel A Zhu Y May D et al Targeted deletion of the 9p21non-coding coronary artery disease risk interval in miceNature 2010464409ndash12

31 Soranzo N Rendon A Gieger C et al A novel variant onchromosome 7q223 associated with mean platelet volumecounts and function Blood 20091133831ndash7

32 Paul DS Nisbet JP Yang TP et al Maps of open chromatinguide the functional follow-up of genome-wide associationsignals application to hematological traits PLoS Genet20117e1002139

33 Hull J Campino S Rowlands K et al Identification of com-mon genetic variation that modulates alternative splicingPLoS Genet 20073e99

34 Albert FW Kruglyak L The role of regulatory vari-ation in complex traits and disease Nat Rev Genet201516197ndash212

35 Battle A Khan Z Wang SH et al Genomic variation Impactof regulatory variation from RNA to protein Science2015347664ndash7

36 den Dunnen JT Antonarakis SE Nomenclature for the de-scription of human sequence variations Hum Genet2001109121ndash4

18 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

37 Eilbeck K Lewis SE Mungall CJ et al The sequence ontologya tool for the unification of genome annotations Genome Biol20056R44

38 Landrum MJ Lee JM Riley GR et al ClinVar public archive ofrelationships among sequence variation and human pheno-type Nucleic Acids Res 201442D980ndash5

39 Kircher M Witten DM Jain P et al A general framework forestimating the relative pathogenicity of human genetic vari-ants Nat Genet 201446310ndash15

40 Kumar P Henikoff S Ng PC Predicting the effects of codingnon-synonymous variants on protein function using theSIFT algorithm Nat Protoc 200941073ndash81

41 Adzhubei I Schmidt S Peshkin L et al A method and serverfor predicting damaging missense mutations Nat Meth20107248ndash9

42 Gonzalez-Perez A Lopez-Bigas N Improving the assessmentof the outcome of nonsynonymous SNVs with a consensusdeleteriousness score Condel Am J Hum Genet 201188440ndash9

43 Grantham R Amino acid difference formula to help explainprotein evolution Science 1974185862ndash4

44 Cooper GM Stone EA Asimenos G et al Distribution and in-tensity of constraint in mammalian genomic sequenceGenome Res 200515901ndash13

45 Pollard KS Hubisz MJ Rosenbloom KR et al Detection ofnonneutral substitution rates on mammalian phylogeniesGenome Res 201020110ndash21

46 Siepel A Bejerano G Pedersen JS et al Evolutionarily con-served elements in vertebrate insect worm and yeast gen-omes Genome Res 2005151034ndash50

47 Dayhoff MO Eck RV Park CM Model of evolutionary changein proteins In MO Dayhoff (ed) Atlas of Protein Sequence andStructure Washington DC National Biomedical ResearchFoundation 1972 89ndash99

48 Henikoff S Henikoff JG Amino acid substitution matricesfrom protein blocks Proc Natl Acad Sci USA 19928910915ndash19

49 Altschul SF Gish W Miller W et al Basic local alignmentsearch tool J Mol Biol 1990215403ndash10

50 Altschul SF Madden TL Schaffer AA et al Gapped BLASTand PSI-BLAST a new generation of protein database searchprograms Nucleic Acids Res 1997253389ndash402

51 Sunyaev S Ramensky V Koch I et al Prediction of deleteri-ous human alleles Hum Mol Genet 200110591ndash7

52 Yang Z Bielawski JP Statistical methods for detecting mo-lecular adaptation Trends Ecol Evol 200015496ndash503

53 Ng PC Henikoff S Predicting deleterious amino acid substi-tutions Genome Res 200111863ndash74

54 Wang Z Moult J SNPs protein structure and disease HumMutat 200117263ndash70

55 Thomas PD Kejariwal A Coding single-nucleotide poly-morphisms associated with complex vs Mendelian diseaseevolutionary evidence for differences in molecular effectsProc Natl Acad Sci USA 200410115398ndash403

56 Stone EA Sidow A Physicochemical constraint violation bymissense substitutions mediates impairment of proteinfunction and disease severity Genome Res 200515978ndash86

57 Miller MP Kumar S Understanding human disease muta-tions through the use of interspecific genetic variation HumMol Genet 2001102319ndash28

58 Johnson AD Single-nucleotide polymorphism bioinfor-matics a comprehensive review of resources Circ CardiovascGenet 20092530ndash6

59 Marjoram P Tavare S Modern computational approachesfor analysing molecular genetic variation data Nat Rev Genet20067759ndash70

60 Excoffier L Heckel G Computer programs for populationgenetics data analysis a survival guide Nat Rev Genet20067745ndash58

61 Peterson TA Nehrt NL Park D et al Incorporating molecularand functional context into the analysis and prioritizationof human variants associated with cancer J Am Med InformAssoc 201219275ndash83

62 Kann MG Advances in translational bioinformatics compu-tational approaches for the hunting of disease genes BriefBioinform 20101196ndash110

63 Cooper GM Shendure J Needles in stacks of needles findingdisease-causal variants in a wealth of genomic data Nat RevGenet 201112628ndash40

64 Stitziel NO Kiezun A Sunyaev S Computational and statis-tical approaches to analyzing variants identified by exomesequencing Genome Biol 201112227

65 Alexander RP Fang G Rozowsky J et al Annotating non-coding regions of the genome Nat Rev Genet 201011559ndash71

66 Mort M Evani US Krishnan VG et al In silico func-tional profiling of human disease-associated and poly-morphic amino acid substitutions Hum Mutat 201031335ndash46

67 Hamosh A Scott AF Amberger JS et al Online MendelianInheritance in Man (OMIM) a knowledgebase of humangenes and genetic disorders Nucleic Acids Res200533D514ndash17

68 Stenson PD Mort M Ball EV et al The human gene mutationdatabase 2008 update Genome Med 2009113

69 Forbes SA Bindal N Bamford S et al COSMIC mining com-plete cancer genomes in the Catalogue of SomaticMutations in Cancer Nucleic Acids Res 201139D945ndash50

70 Chen YA Tripathi LP Mizuguchi K TargetMine an inte-grated data warehouse for candidate gene prioritisation andtarget discovery PLoS One 20116e17844

71 Wang K Li M Hakonarson H ANNOVAR functional annota-tion of genetic variants from high-throughput sequencingdata Nucleic Acids Res 201038e164

72 Danecek P Auton A Abecasis G et al The variant call formatand VCFtools Bioinformatics 2011272156ndash8

73 Macintyre G Jimeno Yepes A Ong CS et al Associating dis-ease-related genetic variants in intergenic regions to thegenes they impact Peer J 20142e639

74 Javed A Agrawal S Ng PC Phen-Gen combining phenotypeand genotype to analyze rare disorders Nat Methods201411935ndash7

75 Dai HJ Wu JC Tsai RT et al T-HOD a literature-based candi-date gene database for hypertension obesity and diabetesDatabase (Oxford) 20132013bas061

76 Cheng D Knox C Young N et al PolySearch a web-basedtext mining system for extracting relationships betweenhuman diseases genes mutations drugs and metabolitesNucleic Acids Res 200836W399ndash405

77 Do R Kathiresan S Abecasis GR Exome sequencing andcomplex disease practical aspects of rare variant associ-ation studies Hum Mol Genet 201221R1ndash9

78 Wagner MJ Rare-variant genome-wide association studiesa new frontier in genetic analysis of complex traitsPharmacogenomics 201314413ndash24

79 Genomes Project C Abecasis GR Auton A et al An inte-grated map of genetic variation from 1 092 human genomesNature 201249156ndash65

80 Anfinsen CB The formation and stabilization of proteinstructure Biochem J 1972128737ndash49

81 Anfinsen CB Principles that govern the folding of proteinchains Science 1973181223ndash30

Interpreting functional effects of coding variants | 19

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

82 Fay J Wu C Sequence divergence functional constraintand selection in protein evolution Annu Rev Genomics HumGenet 20034213ndash35

83 Ponting CP Russell RR The natural history of protein do-mains Annu Rev Biophys Biomol Struct 20023145ndash71

84 Ekman D Bjorklund AK Frey-Skott J et al Multi-domain pro-teins in the three kingdoms of life orphan domains andother unassigned regions J Mol Biol 2005348231ndash43

85 Sigrist CJ Cerutti L Hulo N et al PROSITE a documenteddatabase using patterns and profiles as motif descriptorsBrief Bioinform 20023265ndash74

86 Dinkel H Michael S Weatheritt RJ et al ELMndashthe database ofeukaryotic linear motifs Nucleic Acids Res 201240D242ndash51

87 Haberichter SL Budde U Obser T et al The mutation N528Sin the von Willebrand factor (VWF) propeptide causes de-fective multimerization and storage of VWF Blood20101154580ndash7

88 Kutz WE Wang LW Dagoneau N et al Functional analysis ofan ADAMTS10 signal peptide mutation in Weill-Marchesanisyndrome demonstrates a long-range effect on secretion ofthe full-length enzyme Hum Mutat 2008291425ndash34

89 Pugalenthi G Suganthan PN Sowdhamini R et al SMotif aserver for structural motifs in proteins Bioinformatics200723637ndash8

90 Sundaramurthy P Shameer K Sreenivasan R et al HORI aweb server to compute Higher Order Residue Interactions inprotein structures BMC Bioinformatics 201011(Suppl 1)S24

91 Porter CT Bartlett GJ Thornton JM The Catalytic Site Atlas aresource of catalytic sites and residues identified in enzymesusing structural data Nucleic Acids Res 200432D129ndash33

92 Stuart AC Ilyin VA Sali A LigBase a database of families ofaligned ligand binding sites in known protein sequencesand structures Bioinformatics 200218200ndash01

93 Swapna LS Mahajan S de Brevern A et al Comparison oftertiary structures of proteins in protein-protein complexeswith unbound forms suggests prevalence of allostery in sig-nalling proteins BMC Struct Biol 2012126

94 Mathivanan S Ahmed M Ahn NG et al Human Proteinpediaenables sharing of human protein data Nat Biotechnol200826164ndash7

95 Bhaskara RM Srinivasan N Stability of domain structures inmulti-domain proteins Sci Rep 2011140

96 Furnham N de Beer TA Thornton JM Current challenges ingenome annotation through structural biology and bioinfor-matics Curr Opin Struct Biol 201222594ndash601

97 Wang Y Cheng H Pan Z et al Reconfiguring phosphoryl-ation signaling by genetic polymorphisms affects cancersusceptibility J Mol Cell Biol 20157187ndash202

98 Gnad F Gunawardena J Mann M PHOSIDA 2011 the post-translational modification database Nucleic Acids Res201139D253ndash60

99 Nicolaou N Margadant C Kevelam SH et al Gain of glycosy-lation in integrin alpha3 causes lung disease and nephroticsyndrome J Clin Invest 20121224375ndash87

100 Radivojac P Baenziger PH Kann MG et al Gain and loss ofphosphorylation sites in human cancer Bioinformatics200824i241ndash7

101 Pan Y Karagiannis K Zhang H et al Human germline andpan-cancer variomes and their distinct functional profilesNucleic Acids Res 20144211570ndash88

102 Mazumder R Morampudi KS Motwani M et al Proteome-wide analysis of single-nucleotide variations in the N-glyco-sylation sequon of human genes PLoS One 20127e36212

103 Reddy CC Shameer K Offmann BO et al PURE a webserverfor the prediction of domains in unassigned regions in pro-teins BMC Bioinformatics 20089281

104 Coletta A Pinney JW Solis DY et al Low-complexity regionswithin protein sequences have position-dependent rolesBMC Syst Biol 2010443

105 Jones DT Swindells MB Getting the most from PSI-BLASTTrends Biochem Sci 200227161ndash4

106 Chograni M Rejeb I Jemaa LB et al The first missense muta-tion of NHS gene in a Tunisian family with clinical featuresof NHS syndrome including cardiac anomaly Eur J HumGenet 201119851ndash6

107 Jones S Thornton JM Principles of protein-protein inter-actions Proc Natl Acad Sci USA 19969313ndash20

108 Babu MM van der Lee R de Groot NS et al Intrinsically dis-ordered proteins regulation and disease Curr Opin Struct Biol201121432ndash40

109 Dunker AK Silman I Uversky VN et al Function and struc-ture of inherently disordered proteins Curr Opin Struct Biol200818756ndash64

110 Wright PE Dyson HJ Intrinsically unstructured proteins re-assessing the protein structure-function paradigm J Mol Biol1999293321ndash31

111 Dyson HJ Wright PE Coupling of folding and binding for un-structured proteins Curr Opin Struct Biol 20021254ndash60

112 Nguyen Ba AN Yeh BJ van Dyk D et al Proteome-wide dis-covery of evolutionary conserved sequences in disorderedregions Sci Signal 20125rs1

113 Das RK Mao AH Pappu RV Unmasking functionalmotifs within disordered regions of proteins Sci Signal20125pe17

114 Hu Y Liu Y Jung J et al Changes in predicted protein dis-order tendency may contribute to disease risk BMCGenomics 201112(Suppl 5)S2

115 Vacic V Iakoucheva LM Disease mutations in disordered re-gionsndashexception to the rule Mol BioSyst 2012827ndash32

116 Srivastava SK Gayathri S Manjasetty BA et al Analysis ofconformational variation in macromolecular structuralmodels PLoS One 20127e39993

117 Ward JJ Sodhi JS McGuffin LJ et al Prediction and functionalanalysis of native disorder in proteins from the three king-doms of life J Mol Biol 2004337635ndash45

118 Thomas PJ Qu BH Pedersen PL Defective protein folding asa basis of human disease Trends Biochem Sci 199520456ndash9

119 Welch W Role of quality control pathways in human dis-eases involving protein misfolding Semin Cell Dev Biol20041531ndash8

120 Fink AL Protein aggregation folding aggregates inclusionbodies and amyloid Fold Des 19983R9ndash23

121 Thirumalai D Klimov DK Dima RI Emerging ideas on themolecular basis of protein and peptide aggregation CurrOpin Struct Biol 200313146ndash59

122 Khare SD Dokholyan NV Molecular mechanisms of poly-peptide aggregation in human diseases Curr Protein Pept Sci20078573ndash9

123 Soto C Estrada L Castilla J Amyloids prions and the inher-ent infectious nature of misfolded protein aggregatesTrends Biochem Sci 200631150ndash5

124 Prusiner SB Molecular biology and pathogenesis of priondiseases Trends Biochem Sci 199621482ndash7

125 Shameer K Shingate PN Manjunath SC et al 3DSwap cura-ted knowledgebase of proteins involved in 3D domain swap-ping Database (Oxford) 20112011bar042

20 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

126 Bennett MJ Choe S Eisenberg D Domain swapping entan-gling alliances between proteins Proc Natl Acad Sci USA1994913127ndash31

127 Bennett MJ Sawaya MR Eisenberg D Deposition diseasesand 3D domain swapping Structure 200614811ndash24

128 Shameer K Sowdhamini R Functional repertoire molecularpathways and diseases associated with 3D domain swap-ping in the human proteome J Clin Bioinform 201228

129 Waters PJ Parniak MA Hewson AS et al Alterations in pro-tein aggregation and degradation due to mild and severemissense mutations (A104D R157N) in the human phenyl-alanine hydroxylase gene (PAH) Hum Mutat 199812344ndash54

130 OrsquoNeill JW Kim DE Johnsen K et al Single-site mutations in-duce 3D domain swapping in the B1 domain of protein Lfrom Peptostreptococcus magnus Structure 200191017ndash27

131 Seeliger MA Spichty M Kelly SE et al Role of conformationalheterogeneity in domain swapping and adapter function ofthe Cks proteins J Biol Chem 200528030448ndash59

132 Kirsten Frank M Dyda F Dobrodumov A et al Core muta-tions switch monomeric protein GB1 into an intertwinedtetramer Nat Struct Biol 20029877ndash85

133 Huang Y Cao H Liu Z Three-dimensional domain swappingin the protein structure space Proteins 2012801610ndash19

134 Gidalevitz T Wang N Deravaj T et al Natural genetic vari-ation determines susceptibility to aggregation or toxicity ina C elegans model for polyglutamine disease BMC Biol201311100

135 Ross CA Poirier MA Protein aggregation and neurodegener-ative disease Nat Med 200410(Suppl)S10ndash17

136 Murzin A Metamorphic proteins Science 20083201725ndash6137 Shortle D One sequence plus one mutation equals two

folds Proc Natl Acad Sci USA 200910621011ndash12138 Yadid I Kirshenbaum N Sharon M et al Metamorphic pro-

teins mediate evolutionary transitions of structure Proc NatlAcad Sci USA 20101077287ndash92

139 Yadid I Kirshenbaum N Sharon M et al Metamorphic pro-teins mediate evolutionary transitions of structure Proc NatlAcad Sci USA 20101077287ndash92

140 Pickrell J Marioni J Pai A et al Understanding mechanismsunderlying human gene expression variation with RNAsequencing Nature 2010464768ndash72

141 Matlin AJ Clark F Smith CW Understanding alternative splic-ing towards a cellular code Nat Rev Mol Cell Biol 20056386ndash98

142 Rickman DS Pflueger D Moss B et al SLC45A3-ELK4 is anovel and frequent erythroblast transformation-specific fu-sion transcript in prostate cancer Cancer Res 2009692734ndash8

143 Flomen R Makoff A Increased RNA editing in EAAT2 pre-mRNA from amyotrophic lateral sclerosis patients involve-ment of a cryptic polyadenylation site Neurosci Lett2011497139ndash43

144 Asmann YW Hossain A Necela BM et al A novel bioinfor-matics pipeline for identification and characterization of fu-sion transcripts in breast cancer and normal cell linesNucleic Acids Res 201139e100

145 Bundschuh R Computational prediction of RNA editingsites Bioinformatics 2004203214ndash20

146 Eisenberg E Li JB Levanon EY Sequence based identificationof RNA editing sites RNA Biol 20107248ndash52

147 Levanon EY Eisenberg E Algorithmic approaches for identi-fication of RNA editing sites Brief Funct Genomics Proteomic2006543ndash5

148 McCarthy DJ Humburg P Kanapin A et al Choice of tran-scripts and software has a large effect on variant annota-tion Genome Med 2014626

149 Tang X Baheti S Shameer K et al The eSNV-detect a com-putational system to identify expressed single nucleotidevariants from transcriptome sequencing data Nucleic AcidsRes 201442e172

150 Chorev M Carmel L The function of introns Front Genet2012355

151 Chen R Davydov EV Sirota M et al Non-synonymous andsynonymous coding SNPs show similar likelihood and effectsize of human disease association PLoS One 20105e13574

152 Rose AB Intron-mediated regulation of gene expressionCurr Top Microbiol Immunol 2008326277ndash90

153 Birnbaum RY Clowney EJ Agamy O et al Coding exonsfunction as tissue-specific enhancers of nearby genesGenome Res 2012221059ndash68

154 Nakaya HI Amaral PP Louro R et al Genome mapping andexpression analyses of human intronic noncoding RNAsreveal tissue-specific patterns and enrichment in genesrelated to regulation of transcription Genome Biol 20078R43

155 Rohs R Jin X West SM et al Origins of specificity in protein-DNA recognition Ann Rev Biochem 201079233ndash69

156 Karczewski KJ Dudley JT Kukurba KR et al Systematic func-tional regulatory assessment of disease-associated variantsProc Natl Acad Sci USA 20131109607ndash12

157 Khalil AM Rinn JL RNA-protein interactions in humanhealth and disease Semin Cell Dev Biol 201122359ndash65

158 Jones S van Heyningen P Berman HM et al Protein-DNAinteractions a structural analysis J Mol Biol 1999287877ndash96

159 Jones S Daley DT Luscombe NM et al Protein-RNAinteractions a structural analysis Nucleic Acids Res200129943ndash54

160 Barabasi AL Oltvai ZN Network biology understanding thecellrsquos functional organization Nat Rev Genet 20045101ndash13

161 Nooren IM Thornton JM Diversity of protein-protein inter-actions EMBO J 2003223486ndash92

162 Caspi R Altman T Dreher K et al The MetaCyc database ofmetabolic pathways and enzymes and the BioCyc collectionof pathwaygenome databases Nucleic Acids Res201240D742ndash53

163 Yamada T Bork P Evolution of biomolecular networks les-sons from metabolic and protein interactions Nat Rev MolCell Biol 200910791ndash803

164 Stark C Breitkreutz BJ Chatr-Aryamontri A et al TheBioGRID interaction database 2011 update Nucleic Acids Res201139D698ndash704

165 Szklarczyk D Franceschini A Kuhn M et al The STRINGdatabase in 2011 functional interaction networks of pro-teins globally integrated and scored Nucleic Acids Res201139D561ndash8

166 Bader GD Hogue CW An automated method for finding mo-lecular complexes in large protein interaction networksBMC Bioinformatics 200342

167 DallrsquoOlio GM Vahdati AR Bertranpetit J et alVCF2Networks applying Genotype Networks to SingleNucleotide Variants data Bioinformatics 201531438ndash9

168 Ding L Getz G Wheeler DA et al Somatic mutations affectkey pathways in lung adenocarcinoma Nature20084551069ndash75

169 Case M Matheson E Minto L et al Mutation of genes affect-ing the RAS pathway is common in childhood acutelymphoblastic leukemia Cancer Res 2008686803ndash9

170 Thorstensen L Lind GE Lovig T et al Genetic and epigeneticchanges of components affecting the WNT pathway in colo-rectal carcinomas stratified by microsatellite instabilityNeoplasia 2005799ndash108

Interpreting functional effects of coding variants | 21

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

171 Weng L Macciardi F Subramanian A et al SNP-based path-way enrichment analysis for genome-wide associationstudies BMC Bioinformatics 20111299

172 Wang K Li M Bucan M Pathway-based approaches for ana-lysis of genomewide association studies Am J Hum Genet2007811278ndash83

173 OrsquoDushlaine C Kenny E Heron EA et al The SNP ratio testpathway analysis of genome-wide association datasetsBioinformatics 2009252762ndash3

174 Tarca AL Draghici S Khatri P et al A novel signaling path-way impact analysis Bioinformatics 20092575ndash82

175 Ramanan VK Shen L Moore JH et al Pathway analysis ofgenomic data concepts methods and prospects for futuredevelopment Trends Genet 201228323ndash2

176 Li Y Agarwal P Rajagopalan D A global pathway crosstalknetwork Bioinformatics 2008241442ndash7

177 Rivals I Personnaz L Taing L et al Enrichment or depletionof a GO category within a class of genes which testBioinformatics 200723401ndash7

178 Rhee SY Wood V Dolinski K et al Use and misuse of thegene ontology annotations Nat Rev Genet 20089509ndash15

179 Chandonia JM Brenner SE The impact of structural gen-omics expectations and outcomes Science 2006311347ndash51

180 Johnson MS Srinivasan N Sowdhamini R et al Knowledge-based protein modeling Crit Rev Biochem Mol Biol 1994291ndash68

181 Sulkowska JI Morcos F Weigt M et al Genomics-aided struc-ture prediction Proc Natl Acad Sci USA 201210910340ndash5

182 Lee D Redfern O Orengo C Predicting protein function from se-quence and structure Nat Rev Mol Cell Biol 20078995ndash1005

183 Worth CL Gong S Blundell TL Structural and functionalconstraints in the evolution of protein families Nat Rev MolCell Biol 200910709ndash20

184 Karczewski KJ Fernald GH Martin AR et al STORMSeq anopen-source user-friendly pipeline for processing personalgenomics data in the cloud PLoS One 20149e84860

185 Kalari KR Rossell D Necela BM et al Deep sequence ana-lysis of non-small cell lung cancer integrated analysis ofgene expression alternative splicing and single nucleotidevariations in lung adenocarcinomas with and without onco-genic KRAS mutations Front Oncol 2012212

186 Patil S Upadhayay A Arya D et al A high throughput exomesequencing approach to analyse events within a good re-sponder CML patient under imatinib at diagnosis and underremission Blood 20131225161

187 Kullo IJ Shameer K Jouni H et al The ATXN2-SH2B3 locus isassociated with peripheral arterial disease an electronicmedical record-based genome-wide association study FrontGenet 20145166

188 Shameer K Denny JC Ding K et al A genome- and phe-nome-wide association study to identify genetic variantsinfluencing platelet count and volume and their pleiotropiceffects Hum Genet 201413395ndash109

189 Sabarinathan R Wenzel A Novotny P et al Transcriptome-wide analysis of UTRs in non-small cell lung cancer revealscancer-related genes with SNV-induced changes on RNAsecondary structure and miRNA target sites PLoS One20149e82699

22 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

  • bbv084-TF1
Page 9: challenges in proteome-scale prediction, annotation and assessment

Previously studies have established (Table 4) that functionalvariants may impact the overall structural conformation andfunction of proteins (ligand binding substrate specificity pro-teinndashprotein interaction transcription factor-target bindingetc) Residue-specific interactions can play a key role in suchvariations Sequence-based investigations will further help toidentify linear sequence motifs that confer diversity in the splicevariants [6] but the identification and analysis of the impact ofpoint mutations on spatially distributed structural motifs mayrequire structural data [55] However the limited availability ofstructural data compared with sequence data may pose add-itional challenges for incorporating structural analysis of thefunctional variants [96] Moreover additional challenges ariseowing to the dynamic protein conformations and allosteric orother long-distance effects (including higher order interactions)of few mutations on the activity of the protein

UniProtKB provides extensive information about residueproperties and curated information about globular domains (viaannotations derived from Pfam SMART or InterPro)posttranslational modifications (PTMs) and several bioinfor-matics tools and webservers are available to investigate residueproperties and structural properties (See Table 5 for a list oftools to predict various sequence features See URL httpbioinformaticsca links_directorycategoryprotein for list of toolsavailable for different protein-centric analyses)

Occasionally multiple sequence or structural features canbe annotated onto a single variant for example a residue couldbe part of a functional motif andor a structural motif and cod-ing SNVs may perturb either or both features Quantifying mul-tiple features perturbed by variants using an objective-scoringschema will help to capture the entire set of features perturbedby a functional variant Comparative analysis of the sequenceand structural features of the wild type and the variant se-quences will also enhance the understanding of the impact ofmultiple coding variants within a protein

Gain and loss of PTMs owing to SNVs

Amino acids in a mature peptide chain are targets of variedPTM [97] events PTMs include phosphorylation methylationacetylation amidation hydroxylation sulfation lipidation

glycosylation and palmitoylation [98] Several studies have nowcompiled and assessed the impact of phosphorylation in cancer[97] and have reported mutational landscapes of various PTMevents including gain or loss of particular PTMs owing to SNVs(for example gain of glycosylation [99] gain and loss of phos-phorylation in cancer [100]) Pan-cancer [101] and proteomewide studies [102] have also assessed the impact of variety ofPTMs Several tools and databases are currently available toassess impact of individual PTMs owing to SNVs [refs]However tools that can provide complete in silico profiling ofmutations will help the researchers to identify PTM-relevantmutations and use the information to assess therapeuticstratifications

Coding SNVs in unassigned regions of proteins

An unassigned region refers to the segments in proteins withno known functional domain assignments [84] Human proteinshave a variable degree of unassigned regions and small-un-assigned regions are often defined as the linker regions betweentwo domains (Figure 2i) We surveyed the human proteomeusing Pfam-based domain annotations to understand the un-assigned regions in the human proteome We computed as-signed and unassigned regions (in percentage) for 20 242 proteinsequences from the SwissProt database (reviewed sequences)Pfam-A-based domain assignments were retrieved for 20 137 se-quences A subset of 105 proteins was excluded from the ana-lysis owing to overlapping domain assignments Of the 20 137sequences that were analyzed (Figure 4A) one protein (haloaciddehalogenase-like hydrolase domain-containing protein 2) wasassigned with a conserved domain over its entire length 3234sequences were completely unassigned and the rest of the pro-teins had varying segments of unassigned regions (mean5387 SD63094) Current approaches in variant annotationand interpretation are primarily focused on highly conservedglobular domains Given that domains are presently assigned toonly 50 of the human proteome analytical methods such asPrediction of Unassigned Regions (PURE) that use intermediatesequence search techniques for domain assignments [103] orsimilar approaches that use sensitive sequence search protocols

Table 5 Examples of the functional variants located in sequence features perturbing diverse functional and structural effects in proteins

Variant location Description URL

Protein domain DMDM A database that compiles domain mapping of dis-ease mutations have information about 202 507 muta-tions associated with 10 919 domains (compiled fromCDD Pfam COG and SMART databases)

httpbioinfumbcedudmdm

Phosphorylation site Mutation of an AKT phosphorylation site of human B-raf httpwwwncbinlmnihgovpubmed15791648Propeptide Mutation in the von Willebrand factor (VWF) propeptide

affects the oligomerizationhttpwwwncbinlmnihgovpubmed20335223

Signal peptide Mutation in signal peptide of ADAMTS10 influence secre-tion of full-length enzyme

httpwwwncbinlmnihgovpubmed18567016

Active site Mutation in the active site of human deoxycytidine kinaseaffects the substrate specificity

httpwwwncbinlmnihgovpubmed18361501

Linear motif Linear motifs mediate functional diversity of transcriptvariants

httpwwwncbinlmnihgovpubmed22638587

Structural motif Heterozygous missense mutation of a spatially distributedstructural motif in human connexin (GJB3) gene causeerythrokeratodermia variabilis

httpwwwncbinlmnihgovpubmed9843209

Subcellular localization Missense mutations in the NPHS2 gene altering the traffick-ing of nephrin to the plasma membrane

wwwncbinlmnihgovpubmed15496146

Interpreting functional effects of coding variants | 9

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

Table 6 Tools for predicting various sequence and structural features

Name Description URL

3dswap-pred Classify a protein sequence as domain-swapping ornon-domain swapping using an SVM model

httpcapsncbsresin3dswap-predindexhtml

AAIndex An amino acid index is a set of 20 numerical values rep-resenting various physico-chemical and biochemicalproperties of amino acids

httpwwwgenomejpaaindex

Bioinformatics LinkDirectory (Protein)

Extensive list of tools for prediction of protein sequencefeatures structure features and function

httpbioinformaticsca links_directorycategoryprotein

dbPTM Comprehensive resource for protein PTMs httpdbptmmbcnctuedutwDISOPRED Dynamically disordered protein chains do not have sta-

ble secondary structures and have high flexibility insolution Disordered regions also play critical roles inprotein function

httpbioinfcsuclacukdisopred

ELM Eukaryotic Linear Motif server httpelmeuorgEris Eris server computes the change of the protein stability

induced by mutations using structural datahttpdokhlabuncedutoolserisindexhtml

FoldAmyloid Method for predicting of amyloidogenic regions fromprotein sequence

httpbioinfoprotresrufold-amyloidogacgi

FoldX FoldX can be used to find interactions contributing tothe stability of proteins and protein complexes usingstructural data

httpfoldxcrges

Globplot Globplot can predict disordered regions in proteinsequence

httpglobplotemblde

H-Predictor Predict hinge regions involved in protein oligomeriza-tion via the domain-swapping mechanism fromstructural data

httptrollmeduncedudokhlabindexphpSpecialHpredictor

Harmony Substitution and propensity score-based protein struc-ture assessment algorithm

httpcapsncbsresinharmony

HORI Webserver for prediction of higher order residue inter-actions in protein structures

httpcapsncbsresinhori

InterPro Integrated database of predictive protein signaturesused for the classification and automatic annotationof proteins and genomes

httpwwwebiacukinterpro

IUPred Prediction of intrinsically unstructured proteins httpiupredenzimhuLIMBO Predicts the amylogenic regions in a protein httplimbovibbeMUPRO Prediction of protein stability changes for single-site

mutations from sequenceshttpmuproproteomicsicsuciedu

NCBI-CDD Extensive protein domain and family annotationdatabase

httpwwwncbinlmnihgovStructurecddcddshtml

Pfam Database of conserved protein domain families httppfamsangeracukPFILT Program to filter various sequence regions including

low-complexity regionshttpbioinfadmincsuclacukdownloadspfilt

PIC Protein interactions calculator httppicmbuiiscernetinProtParam Compute biochemical features like Molecular Weight

Theoretical pI Grand Average of Hydropathy(GRAVY) instability index etc

httpwebexpasyorgprotparam

PSIPRED Secondary structure prediction httpbioinfcsuclacukpsipredPURE Prediction of unassigned regions in proteins httpcapsncbsresinpureSABBLE Relative solvent accessibility prediction httpsablecchmcorgScanProsite Report the functional motifspatterns encoded in the se-

quence Helps to assess the gainloss of functionalsites owing to mutation

httpprositeexpasyorgscanprosite

SignalP Predicts the presence and location of signal peptidecleavage sites in amino-acid sequences

httpwwwcbsdtudkservicesSignalP

SMART Simple modular architecture research tool for assigningdomains to protein

httpsmartembl-heidelbergde

TANGO Predicts the aggregation-prone regions in a protein httptangocrgesTargetP Predicts the subcellular location of eukaryotic proteins httpwwwcbsdtudkservicesTargetPUniProtKB Catalog of information on proteins httpwwwuniprotorgWALTZ Predicts the aggregation-prone regions in a protein httpwaltzvubacbe

10 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

(Table 5) may potentially help investigate the evolutionary andfunctional roles of variants in unassigned regions

Impact of coding SNVs in low-complexityregions

Low-complexity regions (LCRs) in the protein universe refer to astretch of amino acids with low Shannon entropy (leucine-richdomains or poly-alanine tracts) Unlike linear motifs whichhave a specific function and sequence signature the individualfunctions of LCRs are poorly characterized LCRs do not adopt adefinite secondary structure but may exist as solvent-exposedamino acids in the coiled or disordered regions in proteins LCRsare observed in functionally diverse proteins and in both eu-karyotes and prokaryotes The predominant functions of LCRsinclude promoting mRNA stability and mediating a diverse set

of proteinndashprotein interactions [104] We scanned the referencehuman proteome (reviewed subset of 20 237 sequences) usingPFILT [105] and observed that 1432 (nfrac14 2899) protein se-quences have LCRs with a median length of 16 amino acids(Figure 4B) A recent study showed that a functional variantlocalized to an LCR in Nance-Horan Syndrome (NHS) gene wasassociated with clinical features of NHS including cardiacanomaly [106] In the absence of direct functional informationto interpret the impact of functional variants scanning proteinsequences with an LCR prediction tool like PFILT is highly rec-ommended to investigate the probable gain or loss of LCRsowing to the functional variations

Coding SNVs and intrinsically disorderedregions in proteins

Proteins are believed to be functional when the structure attainsits definite globular fold [107] Recent studies however suggestthat proteins may perform their functions even when not in afully folded state Such proteins or regions within proteins thatexist in a stable conformation without attaining a definite struc-tural conformation are generally referred to as intrinsically dis-ordered proteins (IDPs) or proteins with intrinsically disorderedregions (IDRs) or simply disordered proteins [108 109] Disprotis a database that catalogs a curated list of disordered proteinsthat includes 248 experimentally validated human proteinswith disordered regions (httpwwwdisprotorgactionsearchphpkeywordfrac14humanampcriterionfrac14organism) Prediction modelssuggest that 30ndash40 of human proteins are considered to beIDPs or have IDRs and approximately 25 of eukaryotic proteinsare predicted to be fully disordered [108 109]

Regions of the intrinsically disordered segments in proteinshave been found to be functionally important (Figure 5)Disordered regions mediate various functional roles includingprotein binding and proteinndashprotein interactions Previous stud-ies have shown that IDPs can attain a definite structure on bind-ing to their interacting partner and may thus exist in anintermediate stage of disordered (unfolded) to ordered (folded)stage [110 111] Irrespective of the structural plasticity recent

Figure 3 Proportional Venn diagram of genes with coding variants annotated as

polymorphism (Polymorphism nfrac1411 527) disease (Disease nfrac142105) and un-

classified (Unclassified nfrac141910) in Human polymorphisms and disease muta-

tions index Percentages of genes shared between the three groups are provided

Unassigned regions in human proteome

Percentage of unassigned regions in human proteins

Fre

qu

en

cy in

hu

ma

n p

rote

om

e (

Sw

issp

rot

re

vie

we

d s

eq

ue

nce

s)

0 20 40 60 80 100

01

00

02

00

03

00

04

00

0

Low-complexity regions in human proteome

Percentage of LCRs in human proteins

Fre

quency in h

um

an p

rote

om

e (

Sw

isspro

t r

evie

wed s

equences)

0 20 40 60 80 100

0500

1000

1500

2000

2500

3000(a) (b)

Figure 4 Histograms depicting the distribution (in percentage) of (A) unassigned regions and (B) LCRs in the human proteome

Interpreting functional effects of coding variants | 11

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

studies have shown the presence of evolutionarily conservedfunctional motifs within IDPs [112 113]

Recently the influence of disease mutations in the dis-ordered regions was extensively surveyed and the crucialroles of disorder-promoting amino acids in imparting vari-ations in the protein structures were proposed [114 115]Disorder-promoting amino acids are a potential cause ofvariations in protein structures and structural modelsDisorder-to-order transitions are enriched among diseasemutations compared with neutral polymorphisms [116]Currently protein modeling or variant effect prediction al-gorithms do not take the intrinsic disorder characteristicsof a protein into consideration Integrating tools such asDisopred [117] for disorder region prediction into the vari-ant annotation pipelines will improve our understanding ofthe impact of coding SNVs in the disordered regions of aprotein

Influence of coding SNVs on proteinmisfolding domain swapping aggregationmacromolecular crowding and degradation

Diseases have multiple environmental as well as geneticcauses and structural biology is an active area of research tounderstand how changes in individual protein structures orprotein complexes may play a key role in disease manifest-ations Understanding the structural bases of human diseases

would help to identify better targets and design better ligandsfor more effective therapeutic interventions

Protein misfolding and aggregation

Protein folding pathways play a crucial role in mediating cellu-lar homeostasis Defective folding of a protein product is themechanistic basis of several disease phenotypes likeAlzheimerrsquos disease and Parkinsonrsquos disease [118] When non-synonymous mutations lead to the production of misfoldedproteins with aberrant function several gate-keeping pathwaysact on such misfolded proteins to clear them from the cellularenvironment [119] Misfolded proteins targeted for ubiquityla-tion may also lead to protein aggregation Protein aggregation[120ndash123] is defined as a molecular phenomenon where a pro-tein is not cleared from the cellular environment by the normalpathways (Figure 6A) This leads to the aggregation of proteinsin the cellular environment which in turn leads to cellular tox-icity and is considered to be the mechanistic basis of varioushuman diseases such as prion diseases [123 124]

Domain swapping

3D domain swapping (Figure 6B) is a phenomenon observed in asubset of proteins where intermolecular interactions arereplaced by intramolecular interactions [125 126] Domainswapping is also recognized as a mechanism for forming pro-tein aggregates via open-ended mode Domain swapping is

Figure 5 Examples of IDRs in human proteins (A) Breast cancer type 1 susceptibility protein (BRCA1) (B) Cellular tumor antigen p53 (Oncoprotein p53) (C) T-cell sur-

face glycoprotein CD4 (Disprot) (D) Secondary structure information (UniProtKB) and (E) mutation histogram (COSMIC) of BRCA1 is provided to illustrate mutations in

the unstructured regions

12 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

observed in a variety of therapeutically important proteins andis considered to be the mechanism mediating deposition dis-eases such as neurodegenerative diseases and Alzheimerrsquos dis-ease that are caused by conformational perturbations [127 128]Missense mutations in the human phenylalanine hydroxylasehave been shown to infleunce aggregation and degradationproperties [129] Functional mutations are also considered to bea causative factor for proteins to adopt swapped conformationfrom monomers to higher oligomers [130ndash132] A recent study[128] that surveyed a subset of human proteins involved in 3Ddomain swapping suggested that swapping is not only confinedto conformational diseases it is also associated with severalkey biological pathways and also plays a role in mediating di-verse diseases in humans Another study [133] that investigatedthe structural properties of proteins involved in 3D domainswapping suggested that 10 of protein folds and 5 of proteinfamilies include domain-swapped structures

Macromolecular crowding

Proteins interact with a variety of small molecules and othermacromolecules to perform a specific functional task The qua-ternary assembly of a protein is influenced by the importantyet understudied phenomenon of macromolecular crowding

(117 118) Protein function is driven by interactions mediated byinter- and intra-chain amino acids and mutations in the sur-face amino acids may affect macromolecular homeostasisMutagenesis studies have shown the impact of functional mu-tations on proteinndashprotein interfaces (119) oligomerization (120)and stability (121) Mutations that influence the intra-chaininteractions may influence the crowding effect in the cellularenvironment Probing the impact of functional mutations onsuch effects using theoretical models and molecular dynamicsimulation studies is likely to enhance the understanding of therelationship between coding SNVs and macromolecular crowd-ing A list of selected tools and databases available for the pre-diction of misfolding protein aggregation and domainswapping is summarized in Table 5

Protein degradation

Targeted biochemical studies have revealed that the proteindegradation pathway plays an important role in the clearanceof the mutated proteins to reduce cellular toxicity For examplemutant Cu Zn-superoxide dismutase associated with amyo-trophic lateral sclerosis was cleared by macroautophagy path-way that includes proteasomal cleavage phenylalaninehydroxylase (109) and NAD(P)H quinone oxidoreductase 1 (156)

(a) (b)

Figure 6 Schematic representation of the impact of functional mutation on protein misfolding folding aggregation domain swapping macromolecular crowding and

protein degradation pathways Structure of misfolded rat CD2 structure (PDB ID 1A6P) and normal CD2 (PDB ID 1HNG) is used for representing misfolded and folded

structures Functional variant is represented using red asterisk

Interpreting functional effects of coding variants | 13

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

were also cleared by similar mechanisms Functional variantsplay a crucial role in aggregation and protein degration path-ways [134 135] Computational approaches that can predict thedegradation properties of mutated proteins using sequence orstructural information will be useful for rapid characterizationof functionally active proteins Integrated models that combinefolding misfolding aggregation 3D domain swapping macro-molecular crowding and degradation pathways (Figure 6) in asystems biology approach are likely to provide additional in-sights into the regulation of these important mechanisms andthe role of coding SNVs in mediating such mechanisms

Coding SNVs and metamorphic proteins

The molecular paradigm of sequence-structure-function sug-gests that a diverse set of sequences could attain a similarstructural fold that may lead to a functional convergence Whilethis is generally true for the protein universe occasionally devi-ations are observed Metamorphic proteins refer to a relativelynew class of proteins in which a given sequence has beenshown to attain different folded conformations under nativeconditions while performing distinct functions [136ndash138]Metamorphic proteins such as human chemokine lymphotactinhave been shown to influence evolutionary transitions of struc-ture and coding SNVs which could play an important role inswitching between two or more folds and functions for a singlesequence [139] Computational approaches to catalog and pre-dict the metamorphic properties and the impact of coding SNVson metamorphic proteins would help us understand how muta-tions drive functional plasticity at the proteome level

Impact of coding SNVs on the transcriptomicdiversity

Dynamic features of the human proteome are driven by tran-script diversity and the average number of characterized tran-scripts per gene is rapidly expanding The recent adoption ofRNA-Seq as a tool for expression profiling has led to the charac-terization of a large number of novel transcripts includingfusion transcripts [140] Several novel tissue-specific orcell-type-specific transcripts were reported from RNA-Seqexperiments The cellular compartment-based functional rolesof such transcripts are determined by alternative splicingevents [141] RNA editing is a phenomenon where a pre-mRNAmolecule is altered through a chemical change in the basemakeup thereby adding to the diversity of the transcriptomeRNA editing events occur via two distinct mechanisms of sub-stitution editing and insertiondeletion editing leading to func-tional diversity Fusion transcripts [142] and RNA editing eventsare also associated with various diseases including prostatecancer and amyotrophic lateral sclerosis [143] Computationalapproaches are available for the identification of fusion tran-scripts [144] and RNA editing events [145ndash147] from sequencingdata Understanding the precise roles and the impact of variantson such a diverse set of transcripts using computationalapproaches is a challenging task Ascribing the functional roleand assessing the impact of coding SNVs for mediating noveland fusion transcripts using bioinformatics approaches is anemerging problem and being addressed in recent studies [148]Recently we designed an integrated method to simultaneousanalyses of genome and transcriptome from RNA-seq data TheeSNV-Detect method can precisely capture genetic variation(genotypes) from RNA-seq data and helps to design cost-

effective and sustainable experimental strategies [149] Analyticand interpretation strategies that rely on multiple data-typeswould provide greater confidence for variant calling annotationand interpretation and thus actionability for clinicalinterventions

Functional impact of synonymous variations

Synonymous mutations are defined as mutations that result ina variation at the DNA level that code for the same amino acidin the protein level owing to codon degeneracy Synonymousvariants may either be coding or noncoding and the functionalroles of such variants are emerging A recent study [150] sug-gests that introns are involved in functional mechanisms andintrons with positional conservation across eukaryotic lineageare classified as functional introns A comparative analysis ofsynonymous and non-synonymous variants associated withcomplex diseases has shown similar likelihood and effect sizewith disease association [151] Synonymous mutations mayalso influence the introns regulating gene expression [152] A re-cent report also suggests that the coding exons function as tis-sue-specific enhancers and synonymous variations in suchenhancer sites may influence the expression level of certaingenes [153] Synonymous variants could influence the expres-sion of intronic noncoding RNAs [154] perturb the transientproteinndashDNA [155 156] and proteinndashRNA interactions in the cellSuch perturbations can lead to diseases such as cancer neuro-logical disorders and cardiac disease [157] Developing bioinfor-matics approaches to explore the putative impact ofsynonymous mutations on different layers of the coding andnoncoding genome and their relationships will be an importantaspect in the detailed interpretation of the genomic variantsThe detailed structural exploration of proteinndashDNA [158] andproteinndashRNA interactions [159] will help to precisely map themechanistic bases of the loss or gain of interactions owing tosynonymous variations

Impact of coding SNVs on the interactomeof a protein

The individual interactome of proteins varies to a great extent[160] Proteins often interact with nucleotides (proteinndashDNA orproteinndashRNA interactions) proteins (protein complexes obligateor non-obligate) [161] and proteinndashprotein interactions (transientor permanent) [107]) and small molecules (molecular reactionsenzymatic reactions metabolic pathways) [162 163] CodingSNVs play an important role in mediating such interactions Aninteractome of a protein can be defined using information gath-ered from high-throughput experiments that systematicallyidentify interactants of proteins Public protein-protein databaseslike BioGRID [164] and STRING [165] provide large data sets forderiving a protein interactome Network-level investigations tounderstand the category of interactors that are perturbed owingto coding SNVs will help delineate the impact of such mutationson the protein interactomes Using network measures (centralitydegree stress betweenness closeness cliques radiality transi-tivity reciprocity assortativity structural equivalence networkheterogeneity network density clustering coefficients neighbor-hood connectivity shared neighbors network topology etc) asquantitative parameters in variant annotation pipelines will helpthe researchers gain better insights into the impact of codingSNVs on the protein interactome Biologically relevant networkproperties such as network modularity [166] network fragility

14 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

[166] and lethality [166] can also be derived from an interactomeIncorporating a comparative network analysis framework thatcompares wild-type and variant interactomes will help quantifythe impact of coding SNVs on a network scale Methods such asVCF2Networks employ genotype networks (ie all genotypesassociated with a single phenotype) to understand the relation-ships between the genotype space and clinical or biological phe-notypes [167] Table 7 summarizes the available tools toinvestigate gene lists and to derive global functional trends andnetwork properties

Pathway-level impact of coding SNVs

Multiple functional genomics studies have investigated the per-turbations of pathways as a result of mutations in diseasessuch as adenocarcinoma (MAPK signaling p53 signaling Wntsignaling cell cycle and mammalian target of rapamycin path-ways) [168] childhood acute lymphoblastic leukemia (RAS path-way) [169] colorectal carcinoma (WNT signaling pathway) [170]etc Several SNP-centric approaches for pathway-level inferencehave been designed for the interpretation of SNPs identifiedfrom GWA studies or differentially expressed genes from gene-expression studies [171ndash175] Coding SNVs may also influencethe cross talk between various signaling pathways Currentvariant interpretation algorithms strive to identify pathways

associated with genes harboring coding SNVs yet a detailedunderstanding of the impact of variants on biological pathwaysand pathway cross talks [176] is often lacking Incorporatinganalytical routines to quantify the effect of functional muta-tions on pathways and pathway cross talk will be useful in in-terpreting functional variants from a biological perspective

A classification schema for annotatingcoding SNVs

Identifying the entire spectrum of molecular perturbationsowing to coding variations at the level of sequence structureinteraction and function of proteins is considered the basis ofvariant interpretation Multiple automated prediction tools thatcan assess the functional effect of mutations are currentlyavailable [See Table 6] Several variant annotation tools focuson the task of predicting the type or effect of mutations provideextended annotations from precomputed databases or lookuptables and map a coding variant to its corresponding gene pro-tein functional domain or signaling pathway Theseapproaches have several limitations because additional layersof protein-centric information that can be derived from the pre-diction or computation of sequence-based features with codingvariants are lacking To deal with this challenge and to designbroad-spectrum tools for deep variant interpretation we

Table 7 Software libraries and tools for biological network analysis

Name Description URL

Bio4J Bio4j is a bioinformatics graph-based database httpwwwbio4jcomBioConductor (Graphs

and Networks view)Collection of BioConductor modules for biological net-

work analysis and visualizationhttpwwwbioconductororgpackagesrelease

BiocViewshtml___GraphsAndNetworksCytoscape Open source platform for complex network analysis and

visualization with a large collection of plug-ins forbiological network analysis

httpwwwcytoscapeorg

DAVID Integrated functional annotation tool httpdavidabccncifcrfgovhomejspFunDO Functional Disease Ontology server httpdjangonubicnorthwesternedufundogene2pathway R package for prediction of KEGG pathway membership

for individual genes based on InterPro domainsignatures

httpwwwbioconductororgpackagesreleasebiochtmlgene2pathwayhtml

GeneAnswers R package for biological or medical interpretation of thegiven one or more groups of genes by means of statis-tical test

httpwwwbioconductororgpackagesreleasebiochtmlGeneAnswershtml

GO Tools Tools for analysis of GO T term enrichment statisticalanalysis semantic similarity and functional similar-ity using GO terms derived from gene lists

httpwwwgeneontologyorgGOtoolsshtml

Gephi Open-source graph visualization and analysis software httpwwwgephiorgGremlin Graph-based programming language httpsgithubcomtinkerpopgremlinwikiiGraph Network analysis and visualization library in C Also

available R package and a Python extensionhttpigraphsourceforgenet

KEGGgraph R package for analysis of KEGG pathways httpwwwbioconductororgpackages210biochtmlKEGGgraphhtml

LEDA A broad-spectrum Cthornthorn class library for efficient datatypes and algorithms including large-scale networkanalysis

httpwwwalgorithmic-solutionscomledaaboutindexhtm

Neat Web-based network analysis tools httpgephiorgOntology Analysis

plugins for Cytoscapehttpchiantiucsdeducyto_webplugins Plugins for functional enrichment analysis using

network dataPANTHER Classification of genes and proteins httppantherdborgReactome Curated knowledgebase of molecular events and

pathwayshttpwwwreactomeorg

TargetMine Integrates different types of biological data and enableflexible queries export results and analyze lists ofdata

httptargetminemizuguchilaborg

Interpreting functional effects of coding variants | 15

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

recommend a three-level annotation schema for the interpret-ation of coding variants (Figure 7) Primary level annotation pro-vides an overview of the types of mutations and associationannotations Secondary level annotation enables the systematicinvestigation of multiple variants in a single gene Tertiary-levelannotations help to find the global characteristics of genes har-boring coding variants and various properties in a network-scale

Level 1 Primary annotation of coding SNVs

Tools that use position information from VCF files to derive thetype of mutation (synonymous or non-synonymous) the loca-tion of the corresponding gene and its mapping onto several an-notation resources using positional data may be viewed aslsquoprimary annotationsrsquo See Table 2 for a list of tools that useminimum input data to gather a diverse set of annotationsusing database lookup tables and identifier mapping

Level 2 Secondary annotation of coding SNVs- Impactof multiple variants in a single gene

Sequence and structural explorations of coding variants arereferred to as secondary annotations VCF files can have mul-tiple variants of the same gene associated with a given pheno-type or disease A systematic investigation of multiple variantsin a single gene and an assessment of the relative impact of cod-ing variants could help in classifying multiple variants in a geneand help to delineate the variants as pathogenic moderatelypathogenic or VUS Comparative analysis of wild-type and vari-ant sequences can be performed and quantified using the totalnumber of gains or losses in the sequence features Based on the

availability of the structural data a given variant can be modeledin a protein structure using in-silico mutagenesis experimentsIn the absence of experimental protein structural data a proteinstructure model can be obtained from the homology model data-base or a new model can be built using homology modelingapproaches Once the wild type and mutant type structures areobtained the impact of variation can be rapidly computed usingstructural feature analyses or using computationally expensivemolecular dynamics simulations Table 5 summarizes key se-quence and structure feature prediction tools

Level 3 Tertiary annotation of coding SNVs- Impact ofcoding SNVs on multiple genes pathways andinteractome

The global impact of multiple coding variants on a genome orexome can be assessed using a combination of both knowledge-based enrichment or depletion analysis and network or interac-tome analyses Functional profiling using GO terms have beenused as an effective approach for characterizing collective func-tional characteristics of a perturbed set of genes [177 178]Gene-list-based analyses can provide biologically relevant infor-mation about the variants Enrichment analysis is not just con-fined to a priori defined gene sets or GO annotationsenrichment can be performed using several types of annota-tions and could provide insights into the probable functional as-sociations of genes with coding variants Enrichment analysisusing protein annotations would help to identify functionallysignificant protein domains protein classes families (mem-brane proteins or kinases etc) protein superfamilies (angioten-sin receptors or G-protein coupled receptors) or protein folds(Rossman fold beta-propeller) associated with proteins

Figure 7 Three-level schema for annotation of functional variants

16 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

harboring coding variants Furthermore it is possible to identifyenriched molecular events mediated by genes enriched TFBS inthe upstream of genes and biological pathways mediated by thegenes using annotations from Reactome UCSC and KEGG repo-sitories Disease ontology can be used to find known genendashdisease associations Following the knowledge-based analysis anetwork-level analysis of genes harboring coding SNVs can beperformed using publicly available network analysis tools suchas Cytoscape and Gephi

Future directions

The rapidly increasing availability of sequence structure func-tional and interaction information offers an attractive means toobtain a detailed characterization of coding variants Howeverthe use of large data sets to systematically explore the functionalvariants is currently limited owing to the gaps in the availablesequencendashstructurendashfunction interaction data Sequence dataare available for the entire human proteome but the availabilityof protein structural data experimentally verified functional as-sociations and biomolecular interaction data is limited Theavailability of structural data of homologous protein structuresis a prerequisite for structural investigations of functional vari-ants and its impact on the structural environment Availabilityof high-quality interaction data will also help analyze the cellu-lar networks modulated by proteins harboring coding variantsThe expansion of structural data sets using structural genomics[179] homology modeling [180] and the application of computa-tional approaches such as genomics-aided structure prediction[181] is likely to lower the increasing chasm between sequencestructure function and interaction data and may help to ascer-tain the impact of variants at the biomolecular level [163 182183] Improving the annotation databases using high-qualitycurated data and integration with cloud-based or stand-alonepipelines [184ndash189] can also help facilitate such efforts

The current version of the Bioinformatics Link Directory en-lists gt900 bioinformatics tools that are capable of processing dif-ferent types of protein data (sequence structure interactionquantification and annotation) Unifying multiple resources andenabling programmatic access via application program interfacesand web services for rapid integration will significantly enhancethe efficacy of variant annotation pipelines The development ofnovel statistical methods that can quantify various residue-specific properties would enable a comparative analysis of wild-type and variant sequences and can thereby help to facilitate theidentification and prioritization of functional mutations The in-clusion of a diverse set of sequence structure and network fea-tures into a graphically depicted database and an assessment ofthe impact of various data types (sequence structure functionand interaction) using probabilistic or machine learning modelswould also help automate variant annotation interpretation

From a sequence perspective understanding the precisefunctional role of LCRs unassigned regions and disordered re-gions as well as the relationship between these features and cod-ing SNVs is a key step in variant annotation and interpretationIdentifying deviations in protein structural space that are likelyto lead to misfolding domain swapping aggregation degrad-ation or deviations that are metamorphic in nature owing to theimpact of coding SNVs is an emerging challenge Several algo-rithms are available to predict sequence or structural domainsassign distantly related domains to unassigned regions func-tional motifs structural motifs and structural features proteindisorder and aggregation propensities However algorithms andtools to predict 3D domain swapping or to analyze metamorphic

properties degradation pathways and the network-level impactof coding SNVs are less abundant This may be attributed to thefact that these concepts are continually evolving in the proteinuniverse and concerted efforts are needed to understand the im-pact of coding variants of such unique features from both experi-mental and computational biologists alike

This review summarizes the major bioinformatics chal-lenges involved in gaining a deeper understanding of codingSNVs Coding SNVs may impart a molecular or disease pheno-type by interacting with noncoding regulatory parts of the gen-ome and the structural variants may provide an importantcontribution to this phenomenon Analyzing functional andregulatory variants in a single analytical framework may furtherenhance the interpretation of the genomic variants

Conclusions

A rapid decline in sequencing costs using NGS technologies hasled to an exponential increase in the frequency of the sequenc-ing projects over the past decade A large number of personalgenomes and exomes clinical samples and cancer genomes arebeing sequenced as part of large-scale collaborative projectsThe identification of a plethora of sequence variants associatedwith diverse molecular and disease phenotypes Scalable com-putational approaches that integrate annotations using se-quence structure functional and interaction data are necessaryfor the rapid interpretation of coding SNVs We have discussedthe bioinformatics resources available for the prediction anno-tation and visualization of coding SNVs summarized the majorbioinformatics challenges into 10 different themes for a deeperinterpretation of coding SNVs and recommended a three-levelschema to assess the phenotypic impact of functional vari-ations on individual protein sequences structures differentfunctional categories biological pathways and interactomesWe envisage that addressing the key challenges discussed inthis review and adopting a comprehensive annotation schemafor variant annotation could improve genomic reports that aregenerated as part of genomic medicine investigations and ex-perimental studies to better understand the variations impli-cated in rare common and complex disease manifestations

Key Points

bull Interpretations of variants identified from next-gener-ation sequencing pose several challenges

bull Systems-level experimental investigation of the func-tional variants is expensive and time-consumingefficient computational techniques are required toidentify the impact of functional variants

bull We recommended an integrated approach that com-bines multiple data types and tools for the predictionannotation and visualization of functional variantsand we have proposed a systematic approach for func-tional variant annotation and interpretation

bull Significant challenges that need to be negotiated dur-ing the interpretation of coding single nucleotide vari-ants are presented with the help of various examples

bull A three-level annotation approach that combines theinformation at the level of an individual variant mul-tiple variants in a single protein and global trends ofmultiple genes harboring multiple variants is proposedfor an effective interpretation of coding single nucleo-tide variants

Interpreting functional effects of coding variants | 17

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

Acknowledgements

KS and JTD would like to thank Icahn Institute forGenomics and Multiscale Biology (httpicahnmssmedudepartments-and-institutesgenomics) Mount Sinai HealthSystem for infrastructural support KS would like to ac-knowledge Michael J Zimmermann and Jean-Pierre Kocher(Mayo Clinic Rochester) for useful discussions The MayoClinic Bioinformatics Core provided infrastructure and ana-lytical support to KRK RS would like to thank NationalCentre for Biological Sciences (TIFR) for infrastructuralsupport

Funding

KS and JTD was supported by grant R01-DK098242-03from the National Institute of Diabetes and Digestive andKidney Diseases KRK is funded in part by the Mayo ClinicCenter for Individualized Medicine the PharmacogenomicsResearch Network (PGRN) and the Mayo Clinic BreastSpecialized Program of Research Excellence (SPORE)

References1 Lander ES Initial impact of the sequencing of the human

genome Nature 2011470187ndash972 Mardis ER A decadersquos perspective on DNA sequencing tech-

nology Nature 2011470198ndash2033 Green ED Guyer MS Charting a course for genomic medi-

cine from base pairs to bedside Nature 2011470204ndash134 Maxmen A Exome sequencing deciphers rare diseases Cell

2011144635ndash75 Wheeler DA Srinivasan M Egholm M et al The complete

genome of an individual by massively parallel DNAsequencing Nature 2008452872ndash6

6 Weatheritt RJ Davey NE Gibson TJ Linear motifs conferfunctional diversity onto splice variants Nucleic Acids Res201240123ndash31

7 Goldstein DB Common genetic variation and human traitsN Engl J Med 20093601696ndash8

8 Manolio TA Collins FS Cox NJ et al Finding the missing her-itability of complex diseases Nature 2009461747ndash53

9 Genome-wide association study of 14 000 cases of sevencommon diseases and 3 000 shared controls Nature2007447661ndash78

10 Kullo IJ Ding K Shameer K et al Complement receptor 1gene variants are associated with erythrocyte sedimenta-tion rate Am J Hum Genet 201189131ndash8

11 Welter D MacArthur J Morales J et al The NHGRI GWASCatalog a curated resource of SNP-trait associations NucleicAcids Res 201442D1001ndash6

12 Leslie R OrsquoDonnell CJ Johnson AD GRASP analysis of geno-type-phenotype results from 1390 genome-wide associationstudies and corresponding open access databaseBioinformatics 201430i185ndash94

13 Nelson MR Wegmann D Ehm MG et al An abundance ofrare functional variants in 202 drug target genes sequencedin 14 002 people Science 2012337100ndash4

14 MacArthur DG Balasubramanian S Frankish A et al A sys-tematic survey of loss-of-function variants in human pro-tein-coding genes Science 2012335823ndash8

15 Raychaudhuri S Mapping rare and common causal allelesfor complex human diseases Cell 201114757ndash69

16 Stankiewicz P Lupski JR Structural variation in the humangenome and its role in disease Annu Rev Med201061437ndash455

17 McCarroll SA Altshuler DM Copy-number variation and as-sociation studies of human disease Nat Genet200739S37ndash42

18 Zhang F Gu W Hurles ME et al Copy number variation inhuman health disease and evolution Annu Rev GenomicsHum Genet 200910451ndash81

19 Medvedev P Stanciu M Brudno M Computational methodsfor discovering structural variation with next-generationsequencing Nat Methods 20096S13ndash20

20 Sherry ST Ward MH Kholodov M et al dbSNP the NCBI data-base of genetic variation Nucleic Acids Res 200129308ndash11

21 Rios D McLaren WM Chen Y et al A database and API forvariation dense genotyping and resequencing data BMCBioinformatics 201011238

22 Iafrate AJ Feuk L Rivera MN et al Detection of large-scale variation in the human genome Nat Genet200436949ndash51

23 Ashley EA Butte AJ Wheeler MT et al Clinical assess-ment incorporating a personal genome Lancet20103751525ndash35

24 Chen R Mias GI Li-Pook-Than J et al Personal omics profil-ing reveals dynamic molecular and medical phenotypesCell 20121481293ndash307

25 Lupski JR Reid JG Gonzaga-Jauregui C et al Whole-genomesequencing in a patient with Charcot-Marie-Tooth neur-opathy N Engl J Med 20103621181ndash91

26 Li Y Vinckenbosch N Tian G et al Resequencing of200 human exomes identifies an excess of low-fre-quency non-synonymous coding variants Nat Genet201042969ndash72

27 Clarke L Zheng-Bradley X Smith R et al The 1000 GenomesProject data management and community access NatMethods 20129459ndash462

28 Schunkert H Konig IR Kathiresan S et al Large-scale associ-ation analysis identifies 13 new susceptibility loci for coron-ary artery disease Nat Genet 201143333ndash8

29 Helgadottir A Thorleifsson G Manolescu A et al A commonvariant on chromosome 9p21 affects the risk of myocardialinfarction Science 20073161491ndash3

30 Visel A Zhu Y May D et al Targeted deletion of the 9p21non-coding coronary artery disease risk interval in miceNature 2010464409ndash12

31 Soranzo N Rendon A Gieger C et al A novel variant onchromosome 7q223 associated with mean platelet volumecounts and function Blood 20091133831ndash7

32 Paul DS Nisbet JP Yang TP et al Maps of open chromatinguide the functional follow-up of genome-wide associationsignals application to hematological traits PLoS Genet20117e1002139

33 Hull J Campino S Rowlands K et al Identification of com-mon genetic variation that modulates alternative splicingPLoS Genet 20073e99

34 Albert FW Kruglyak L The role of regulatory vari-ation in complex traits and disease Nat Rev Genet201516197ndash212

35 Battle A Khan Z Wang SH et al Genomic variation Impactof regulatory variation from RNA to protein Science2015347664ndash7

36 den Dunnen JT Antonarakis SE Nomenclature for the de-scription of human sequence variations Hum Genet2001109121ndash4

18 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

37 Eilbeck K Lewis SE Mungall CJ et al The sequence ontologya tool for the unification of genome annotations Genome Biol20056R44

38 Landrum MJ Lee JM Riley GR et al ClinVar public archive ofrelationships among sequence variation and human pheno-type Nucleic Acids Res 201442D980ndash5

39 Kircher M Witten DM Jain P et al A general framework forestimating the relative pathogenicity of human genetic vari-ants Nat Genet 201446310ndash15

40 Kumar P Henikoff S Ng PC Predicting the effects of codingnon-synonymous variants on protein function using theSIFT algorithm Nat Protoc 200941073ndash81

41 Adzhubei I Schmidt S Peshkin L et al A method and serverfor predicting damaging missense mutations Nat Meth20107248ndash9

42 Gonzalez-Perez A Lopez-Bigas N Improving the assessmentof the outcome of nonsynonymous SNVs with a consensusdeleteriousness score Condel Am J Hum Genet 201188440ndash9

43 Grantham R Amino acid difference formula to help explainprotein evolution Science 1974185862ndash4

44 Cooper GM Stone EA Asimenos G et al Distribution and in-tensity of constraint in mammalian genomic sequenceGenome Res 200515901ndash13

45 Pollard KS Hubisz MJ Rosenbloom KR et al Detection ofnonneutral substitution rates on mammalian phylogeniesGenome Res 201020110ndash21

46 Siepel A Bejerano G Pedersen JS et al Evolutionarily con-served elements in vertebrate insect worm and yeast gen-omes Genome Res 2005151034ndash50

47 Dayhoff MO Eck RV Park CM Model of evolutionary changein proteins In MO Dayhoff (ed) Atlas of Protein Sequence andStructure Washington DC National Biomedical ResearchFoundation 1972 89ndash99

48 Henikoff S Henikoff JG Amino acid substitution matricesfrom protein blocks Proc Natl Acad Sci USA 19928910915ndash19

49 Altschul SF Gish W Miller W et al Basic local alignmentsearch tool J Mol Biol 1990215403ndash10

50 Altschul SF Madden TL Schaffer AA et al Gapped BLASTand PSI-BLAST a new generation of protein database searchprograms Nucleic Acids Res 1997253389ndash402

51 Sunyaev S Ramensky V Koch I et al Prediction of deleteri-ous human alleles Hum Mol Genet 200110591ndash7

52 Yang Z Bielawski JP Statistical methods for detecting mo-lecular adaptation Trends Ecol Evol 200015496ndash503

53 Ng PC Henikoff S Predicting deleterious amino acid substi-tutions Genome Res 200111863ndash74

54 Wang Z Moult J SNPs protein structure and disease HumMutat 200117263ndash70

55 Thomas PD Kejariwal A Coding single-nucleotide poly-morphisms associated with complex vs Mendelian diseaseevolutionary evidence for differences in molecular effectsProc Natl Acad Sci USA 200410115398ndash403

56 Stone EA Sidow A Physicochemical constraint violation bymissense substitutions mediates impairment of proteinfunction and disease severity Genome Res 200515978ndash86

57 Miller MP Kumar S Understanding human disease muta-tions through the use of interspecific genetic variation HumMol Genet 2001102319ndash28

58 Johnson AD Single-nucleotide polymorphism bioinfor-matics a comprehensive review of resources Circ CardiovascGenet 20092530ndash6

59 Marjoram P Tavare S Modern computational approachesfor analysing molecular genetic variation data Nat Rev Genet20067759ndash70

60 Excoffier L Heckel G Computer programs for populationgenetics data analysis a survival guide Nat Rev Genet20067745ndash58

61 Peterson TA Nehrt NL Park D et al Incorporating molecularand functional context into the analysis and prioritizationof human variants associated with cancer J Am Med InformAssoc 201219275ndash83

62 Kann MG Advances in translational bioinformatics compu-tational approaches for the hunting of disease genes BriefBioinform 20101196ndash110

63 Cooper GM Shendure J Needles in stacks of needles findingdisease-causal variants in a wealth of genomic data Nat RevGenet 201112628ndash40

64 Stitziel NO Kiezun A Sunyaev S Computational and statis-tical approaches to analyzing variants identified by exomesequencing Genome Biol 201112227

65 Alexander RP Fang G Rozowsky J et al Annotating non-coding regions of the genome Nat Rev Genet 201011559ndash71

66 Mort M Evani US Krishnan VG et al In silico func-tional profiling of human disease-associated and poly-morphic amino acid substitutions Hum Mutat 201031335ndash46

67 Hamosh A Scott AF Amberger JS et al Online MendelianInheritance in Man (OMIM) a knowledgebase of humangenes and genetic disorders Nucleic Acids Res200533D514ndash17

68 Stenson PD Mort M Ball EV et al The human gene mutationdatabase 2008 update Genome Med 2009113

69 Forbes SA Bindal N Bamford S et al COSMIC mining com-plete cancer genomes in the Catalogue of SomaticMutations in Cancer Nucleic Acids Res 201139D945ndash50

70 Chen YA Tripathi LP Mizuguchi K TargetMine an inte-grated data warehouse for candidate gene prioritisation andtarget discovery PLoS One 20116e17844

71 Wang K Li M Hakonarson H ANNOVAR functional annota-tion of genetic variants from high-throughput sequencingdata Nucleic Acids Res 201038e164

72 Danecek P Auton A Abecasis G et al The variant call formatand VCFtools Bioinformatics 2011272156ndash8

73 Macintyre G Jimeno Yepes A Ong CS et al Associating dis-ease-related genetic variants in intergenic regions to thegenes they impact Peer J 20142e639

74 Javed A Agrawal S Ng PC Phen-Gen combining phenotypeand genotype to analyze rare disorders Nat Methods201411935ndash7

75 Dai HJ Wu JC Tsai RT et al T-HOD a literature-based candi-date gene database for hypertension obesity and diabetesDatabase (Oxford) 20132013bas061

76 Cheng D Knox C Young N et al PolySearch a web-basedtext mining system for extracting relationships betweenhuman diseases genes mutations drugs and metabolitesNucleic Acids Res 200836W399ndash405

77 Do R Kathiresan S Abecasis GR Exome sequencing andcomplex disease practical aspects of rare variant associ-ation studies Hum Mol Genet 201221R1ndash9

78 Wagner MJ Rare-variant genome-wide association studiesa new frontier in genetic analysis of complex traitsPharmacogenomics 201314413ndash24

79 Genomes Project C Abecasis GR Auton A et al An inte-grated map of genetic variation from 1 092 human genomesNature 201249156ndash65

80 Anfinsen CB The formation and stabilization of proteinstructure Biochem J 1972128737ndash49

81 Anfinsen CB Principles that govern the folding of proteinchains Science 1973181223ndash30

Interpreting functional effects of coding variants | 19

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

82 Fay J Wu C Sequence divergence functional constraintand selection in protein evolution Annu Rev Genomics HumGenet 20034213ndash35

83 Ponting CP Russell RR The natural history of protein do-mains Annu Rev Biophys Biomol Struct 20023145ndash71

84 Ekman D Bjorklund AK Frey-Skott J et al Multi-domain pro-teins in the three kingdoms of life orphan domains andother unassigned regions J Mol Biol 2005348231ndash43

85 Sigrist CJ Cerutti L Hulo N et al PROSITE a documenteddatabase using patterns and profiles as motif descriptorsBrief Bioinform 20023265ndash74

86 Dinkel H Michael S Weatheritt RJ et al ELMndashthe database ofeukaryotic linear motifs Nucleic Acids Res 201240D242ndash51

87 Haberichter SL Budde U Obser T et al The mutation N528Sin the von Willebrand factor (VWF) propeptide causes de-fective multimerization and storage of VWF Blood20101154580ndash7

88 Kutz WE Wang LW Dagoneau N et al Functional analysis ofan ADAMTS10 signal peptide mutation in Weill-Marchesanisyndrome demonstrates a long-range effect on secretion ofthe full-length enzyme Hum Mutat 2008291425ndash34

89 Pugalenthi G Suganthan PN Sowdhamini R et al SMotif aserver for structural motifs in proteins Bioinformatics200723637ndash8

90 Sundaramurthy P Shameer K Sreenivasan R et al HORI aweb server to compute Higher Order Residue Interactions inprotein structures BMC Bioinformatics 201011(Suppl 1)S24

91 Porter CT Bartlett GJ Thornton JM The Catalytic Site Atlas aresource of catalytic sites and residues identified in enzymesusing structural data Nucleic Acids Res 200432D129ndash33

92 Stuart AC Ilyin VA Sali A LigBase a database of families ofaligned ligand binding sites in known protein sequencesand structures Bioinformatics 200218200ndash01

93 Swapna LS Mahajan S de Brevern A et al Comparison oftertiary structures of proteins in protein-protein complexeswith unbound forms suggests prevalence of allostery in sig-nalling proteins BMC Struct Biol 2012126

94 Mathivanan S Ahmed M Ahn NG et al Human Proteinpediaenables sharing of human protein data Nat Biotechnol200826164ndash7

95 Bhaskara RM Srinivasan N Stability of domain structures inmulti-domain proteins Sci Rep 2011140

96 Furnham N de Beer TA Thornton JM Current challenges ingenome annotation through structural biology and bioinfor-matics Curr Opin Struct Biol 201222594ndash601

97 Wang Y Cheng H Pan Z et al Reconfiguring phosphoryl-ation signaling by genetic polymorphisms affects cancersusceptibility J Mol Cell Biol 20157187ndash202

98 Gnad F Gunawardena J Mann M PHOSIDA 2011 the post-translational modification database Nucleic Acids Res201139D253ndash60

99 Nicolaou N Margadant C Kevelam SH et al Gain of glycosy-lation in integrin alpha3 causes lung disease and nephroticsyndrome J Clin Invest 20121224375ndash87

100 Radivojac P Baenziger PH Kann MG et al Gain and loss ofphosphorylation sites in human cancer Bioinformatics200824i241ndash7

101 Pan Y Karagiannis K Zhang H et al Human germline andpan-cancer variomes and their distinct functional profilesNucleic Acids Res 20144211570ndash88

102 Mazumder R Morampudi KS Motwani M et al Proteome-wide analysis of single-nucleotide variations in the N-glyco-sylation sequon of human genes PLoS One 20127e36212

103 Reddy CC Shameer K Offmann BO et al PURE a webserverfor the prediction of domains in unassigned regions in pro-teins BMC Bioinformatics 20089281

104 Coletta A Pinney JW Solis DY et al Low-complexity regionswithin protein sequences have position-dependent rolesBMC Syst Biol 2010443

105 Jones DT Swindells MB Getting the most from PSI-BLASTTrends Biochem Sci 200227161ndash4

106 Chograni M Rejeb I Jemaa LB et al The first missense muta-tion of NHS gene in a Tunisian family with clinical featuresof NHS syndrome including cardiac anomaly Eur J HumGenet 201119851ndash6

107 Jones S Thornton JM Principles of protein-protein inter-actions Proc Natl Acad Sci USA 19969313ndash20

108 Babu MM van der Lee R de Groot NS et al Intrinsically dis-ordered proteins regulation and disease Curr Opin Struct Biol201121432ndash40

109 Dunker AK Silman I Uversky VN et al Function and struc-ture of inherently disordered proteins Curr Opin Struct Biol200818756ndash64

110 Wright PE Dyson HJ Intrinsically unstructured proteins re-assessing the protein structure-function paradigm J Mol Biol1999293321ndash31

111 Dyson HJ Wright PE Coupling of folding and binding for un-structured proteins Curr Opin Struct Biol 20021254ndash60

112 Nguyen Ba AN Yeh BJ van Dyk D et al Proteome-wide dis-covery of evolutionary conserved sequences in disorderedregions Sci Signal 20125rs1

113 Das RK Mao AH Pappu RV Unmasking functionalmotifs within disordered regions of proteins Sci Signal20125pe17

114 Hu Y Liu Y Jung J et al Changes in predicted protein dis-order tendency may contribute to disease risk BMCGenomics 201112(Suppl 5)S2

115 Vacic V Iakoucheva LM Disease mutations in disordered re-gionsndashexception to the rule Mol BioSyst 2012827ndash32

116 Srivastava SK Gayathri S Manjasetty BA et al Analysis ofconformational variation in macromolecular structuralmodels PLoS One 20127e39993

117 Ward JJ Sodhi JS McGuffin LJ et al Prediction and functionalanalysis of native disorder in proteins from the three king-doms of life J Mol Biol 2004337635ndash45

118 Thomas PJ Qu BH Pedersen PL Defective protein folding asa basis of human disease Trends Biochem Sci 199520456ndash9

119 Welch W Role of quality control pathways in human dis-eases involving protein misfolding Semin Cell Dev Biol20041531ndash8

120 Fink AL Protein aggregation folding aggregates inclusionbodies and amyloid Fold Des 19983R9ndash23

121 Thirumalai D Klimov DK Dima RI Emerging ideas on themolecular basis of protein and peptide aggregation CurrOpin Struct Biol 200313146ndash59

122 Khare SD Dokholyan NV Molecular mechanisms of poly-peptide aggregation in human diseases Curr Protein Pept Sci20078573ndash9

123 Soto C Estrada L Castilla J Amyloids prions and the inher-ent infectious nature of misfolded protein aggregatesTrends Biochem Sci 200631150ndash5

124 Prusiner SB Molecular biology and pathogenesis of priondiseases Trends Biochem Sci 199621482ndash7

125 Shameer K Shingate PN Manjunath SC et al 3DSwap cura-ted knowledgebase of proteins involved in 3D domain swap-ping Database (Oxford) 20112011bar042

20 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

126 Bennett MJ Choe S Eisenberg D Domain swapping entan-gling alliances between proteins Proc Natl Acad Sci USA1994913127ndash31

127 Bennett MJ Sawaya MR Eisenberg D Deposition diseasesand 3D domain swapping Structure 200614811ndash24

128 Shameer K Sowdhamini R Functional repertoire molecularpathways and diseases associated with 3D domain swap-ping in the human proteome J Clin Bioinform 201228

129 Waters PJ Parniak MA Hewson AS et al Alterations in pro-tein aggregation and degradation due to mild and severemissense mutations (A104D R157N) in the human phenyl-alanine hydroxylase gene (PAH) Hum Mutat 199812344ndash54

130 OrsquoNeill JW Kim DE Johnsen K et al Single-site mutations in-duce 3D domain swapping in the B1 domain of protein Lfrom Peptostreptococcus magnus Structure 200191017ndash27

131 Seeliger MA Spichty M Kelly SE et al Role of conformationalheterogeneity in domain swapping and adapter function ofthe Cks proteins J Biol Chem 200528030448ndash59

132 Kirsten Frank M Dyda F Dobrodumov A et al Core muta-tions switch monomeric protein GB1 into an intertwinedtetramer Nat Struct Biol 20029877ndash85

133 Huang Y Cao H Liu Z Three-dimensional domain swappingin the protein structure space Proteins 2012801610ndash19

134 Gidalevitz T Wang N Deravaj T et al Natural genetic vari-ation determines susceptibility to aggregation or toxicity ina C elegans model for polyglutamine disease BMC Biol201311100

135 Ross CA Poirier MA Protein aggregation and neurodegener-ative disease Nat Med 200410(Suppl)S10ndash17

136 Murzin A Metamorphic proteins Science 20083201725ndash6137 Shortle D One sequence plus one mutation equals two

folds Proc Natl Acad Sci USA 200910621011ndash12138 Yadid I Kirshenbaum N Sharon M et al Metamorphic pro-

teins mediate evolutionary transitions of structure Proc NatlAcad Sci USA 20101077287ndash92

139 Yadid I Kirshenbaum N Sharon M et al Metamorphic pro-teins mediate evolutionary transitions of structure Proc NatlAcad Sci USA 20101077287ndash92

140 Pickrell J Marioni J Pai A et al Understanding mechanismsunderlying human gene expression variation with RNAsequencing Nature 2010464768ndash72

141 Matlin AJ Clark F Smith CW Understanding alternative splic-ing towards a cellular code Nat Rev Mol Cell Biol 20056386ndash98

142 Rickman DS Pflueger D Moss B et al SLC45A3-ELK4 is anovel and frequent erythroblast transformation-specific fu-sion transcript in prostate cancer Cancer Res 2009692734ndash8

143 Flomen R Makoff A Increased RNA editing in EAAT2 pre-mRNA from amyotrophic lateral sclerosis patients involve-ment of a cryptic polyadenylation site Neurosci Lett2011497139ndash43

144 Asmann YW Hossain A Necela BM et al A novel bioinfor-matics pipeline for identification and characterization of fu-sion transcripts in breast cancer and normal cell linesNucleic Acids Res 201139e100

145 Bundschuh R Computational prediction of RNA editingsites Bioinformatics 2004203214ndash20

146 Eisenberg E Li JB Levanon EY Sequence based identificationof RNA editing sites RNA Biol 20107248ndash52

147 Levanon EY Eisenberg E Algorithmic approaches for identi-fication of RNA editing sites Brief Funct Genomics Proteomic2006543ndash5

148 McCarthy DJ Humburg P Kanapin A et al Choice of tran-scripts and software has a large effect on variant annota-tion Genome Med 2014626

149 Tang X Baheti S Shameer K et al The eSNV-detect a com-putational system to identify expressed single nucleotidevariants from transcriptome sequencing data Nucleic AcidsRes 201442e172

150 Chorev M Carmel L The function of introns Front Genet2012355

151 Chen R Davydov EV Sirota M et al Non-synonymous andsynonymous coding SNPs show similar likelihood and effectsize of human disease association PLoS One 20105e13574

152 Rose AB Intron-mediated regulation of gene expressionCurr Top Microbiol Immunol 2008326277ndash90

153 Birnbaum RY Clowney EJ Agamy O et al Coding exonsfunction as tissue-specific enhancers of nearby genesGenome Res 2012221059ndash68

154 Nakaya HI Amaral PP Louro R et al Genome mapping andexpression analyses of human intronic noncoding RNAsreveal tissue-specific patterns and enrichment in genesrelated to regulation of transcription Genome Biol 20078R43

155 Rohs R Jin X West SM et al Origins of specificity in protein-DNA recognition Ann Rev Biochem 201079233ndash69

156 Karczewski KJ Dudley JT Kukurba KR et al Systematic func-tional regulatory assessment of disease-associated variantsProc Natl Acad Sci USA 20131109607ndash12

157 Khalil AM Rinn JL RNA-protein interactions in humanhealth and disease Semin Cell Dev Biol 201122359ndash65

158 Jones S van Heyningen P Berman HM et al Protein-DNAinteractions a structural analysis J Mol Biol 1999287877ndash96

159 Jones S Daley DT Luscombe NM et al Protein-RNAinteractions a structural analysis Nucleic Acids Res200129943ndash54

160 Barabasi AL Oltvai ZN Network biology understanding thecellrsquos functional organization Nat Rev Genet 20045101ndash13

161 Nooren IM Thornton JM Diversity of protein-protein inter-actions EMBO J 2003223486ndash92

162 Caspi R Altman T Dreher K et al The MetaCyc database ofmetabolic pathways and enzymes and the BioCyc collectionof pathwaygenome databases Nucleic Acids Res201240D742ndash53

163 Yamada T Bork P Evolution of biomolecular networks les-sons from metabolic and protein interactions Nat Rev MolCell Biol 200910791ndash803

164 Stark C Breitkreutz BJ Chatr-Aryamontri A et al TheBioGRID interaction database 2011 update Nucleic Acids Res201139D698ndash704

165 Szklarczyk D Franceschini A Kuhn M et al The STRINGdatabase in 2011 functional interaction networks of pro-teins globally integrated and scored Nucleic Acids Res201139D561ndash8

166 Bader GD Hogue CW An automated method for finding mo-lecular complexes in large protein interaction networksBMC Bioinformatics 200342

167 DallrsquoOlio GM Vahdati AR Bertranpetit J et alVCF2Networks applying Genotype Networks to SingleNucleotide Variants data Bioinformatics 201531438ndash9

168 Ding L Getz G Wheeler DA et al Somatic mutations affectkey pathways in lung adenocarcinoma Nature20084551069ndash75

169 Case M Matheson E Minto L et al Mutation of genes affect-ing the RAS pathway is common in childhood acutelymphoblastic leukemia Cancer Res 2008686803ndash9

170 Thorstensen L Lind GE Lovig T et al Genetic and epigeneticchanges of components affecting the WNT pathway in colo-rectal carcinomas stratified by microsatellite instabilityNeoplasia 2005799ndash108

Interpreting functional effects of coding variants | 21

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

171 Weng L Macciardi F Subramanian A et al SNP-based path-way enrichment analysis for genome-wide associationstudies BMC Bioinformatics 20111299

172 Wang K Li M Bucan M Pathway-based approaches for ana-lysis of genomewide association studies Am J Hum Genet2007811278ndash83

173 OrsquoDushlaine C Kenny E Heron EA et al The SNP ratio testpathway analysis of genome-wide association datasetsBioinformatics 2009252762ndash3

174 Tarca AL Draghici S Khatri P et al A novel signaling path-way impact analysis Bioinformatics 20092575ndash82

175 Ramanan VK Shen L Moore JH et al Pathway analysis ofgenomic data concepts methods and prospects for futuredevelopment Trends Genet 201228323ndash2

176 Li Y Agarwal P Rajagopalan D A global pathway crosstalknetwork Bioinformatics 2008241442ndash7

177 Rivals I Personnaz L Taing L et al Enrichment or depletionof a GO category within a class of genes which testBioinformatics 200723401ndash7

178 Rhee SY Wood V Dolinski K et al Use and misuse of thegene ontology annotations Nat Rev Genet 20089509ndash15

179 Chandonia JM Brenner SE The impact of structural gen-omics expectations and outcomes Science 2006311347ndash51

180 Johnson MS Srinivasan N Sowdhamini R et al Knowledge-based protein modeling Crit Rev Biochem Mol Biol 1994291ndash68

181 Sulkowska JI Morcos F Weigt M et al Genomics-aided struc-ture prediction Proc Natl Acad Sci USA 201210910340ndash5

182 Lee D Redfern O Orengo C Predicting protein function from se-quence and structure Nat Rev Mol Cell Biol 20078995ndash1005

183 Worth CL Gong S Blundell TL Structural and functionalconstraints in the evolution of protein families Nat Rev MolCell Biol 200910709ndash20

184 Karczewski KJ Fernald GH Martin AR et al STORMSeq anopen-source user-friendly pipeline for processing personalgenomics data in the cloud PLoS One 20149e84860

185 Kalari KR Rossell D Necela BM et al Deep sequence ana-lysis of non-small cell lung cancer integrated analysis ofgene expression alternative splicing and single nucleotidevariations in lung adenocarcinomas with and without onco-genic KRAS mutations Front Oncol 2012212

186 Patil S Upadhayay A Arya D et al A high throughput exomesequencing approach to analyse events within a good re-sponder CML patient under imatinib at diagnosis and underremission Blood 20131225161

187 Kullo IJ Shameer K Jouni H et al The ATXN2-SH2B3 locus isassociated with peripheral arterial disease an electronicmedical record-based genome-wide association study FrontGenet 20145166

188 Shameer K Denny JC Ding K et al A genome- and phe-nome-wide association study to identify genetic variantsinfluencing platelet count and volume and their pleiotropiceffects Hum Genet 201413395ndash109

189 Sabarinathan R Wenzel A Novotny P et al Transcriptome-wide analysis of UTRs in non-small cell lung cancer revealscancer-related genes with SNV-induced changes on RNAsecondary structure and miRNA target sites PLoS One20149e82699

22 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

  • bbv084-TF1
Page 10: challenges in proteome-scale prediction, annotation and assessment

Table 6 Tools for predicting various sequence and structural features

Name Description URL

3dswap-pred Classify a protein sequence as domain-swapping ornon-domain swapping using an SVM model

httpcapsncbsresin3dswap-predindexhtml

AAIndex An amino acid index is a set of 20 numerical values rep-resenting various physico-chemical and biochemicalproperties of amino acids

httpwwwgenomejpaaindex

Bioinformatics LinkDirectory (Protein)

Extensive list of tools for prediction of protein sequencefeatures structure features and function

httpbioinformaticsca links_directorycategoryprotein

dbPTM Comprehensive resource for protein PTMs httpdbptmmbcnctuedutwDISOPRED Dynamically disordered protein chains do not have sta-

ble secondary structures and have high flexibility insolution Disordered regions also play critical roles inprotein function

httpbioinfcsuclacukdisopred

ELM Eukaryotic Linear Motif server httpelmeuorgEris Eris server computes the change of the protein stability

induced by mutations using structural datahttpdokhlabuncedutoolserisindexhtml

FoldAmyloid Method for predicting of amyloidogenic regions fromprotein sequence

httpbioinfoprotresrufold-amyloidogacgi

FoldX FoldX can be used to find interactions contributing tothe stability of proteins and protein complexes usingstructural data

httpfoldxcrges

Globplot Globplot can predict disordered regions in proteinsequence

httpglobplotemblde

H-Predictor Predict hinge regions involved in protein oligomeriza-tion via the domain-swapping mechanism fromstructural data

httptrollmeduncedudokhlabindexphpSpecialHpredictor

Harmony Substitution and propensity score-based protein struc-ture assessment algorithm

httpcapsncbsresinharmony

HORI Webserver for prediction of higher order residue inter-actions in protein structures

httpcapsncbsresinhori

InterPro Integrated database of predictive protein signaturesused for the classification and automatic annotationof proteins and genomes

httpwwwebiacukinterpro

IUPred Prediction of intrinsically unstructured proteins httpiupredenzimhuLIMBO Predicts the amylogenic regions in a protein httplimbovibbeMUPRO Prediction of protein stability changes for single-site

mutations from sequenceshttpmuproproteomicsicsuciedu

NCBI-CDD Extensive protein domain and family annotationdatabase

httpwwwncbinlmnihgovStructurecddcddshtml

Pfam Database of conserved protein domain families httppfamsangeracukPFILT Program to filter various sequence regions including

low-complexity regionshttpbioinfadmincsuclacukdownloadspfilt

PIC Protein interactions calculator httppicmbuiiscernetinProtParam Compute biochemical features like Molecular Weight

Theoretical pI Grand Average of Hydropathy(GRAVY) instability index etc

httpwebexpasyorgprotparam

PSIPRED Secondary structure prediction httpbioinfcsuclacukpsipredPURE Prediction of unassigned regions in proteins httpcapsncbsresinpureSABBLE Relative solvent accessibility prediction httpsablecchmcorgScanProsite Report the functional motifspatterns encoded in the se-

quence Helps to assess the gainloss of functionalsites owing to mutation

httpprositeexpasyorgscanprosite

SignalP Predicts the presence and location of signal peptidecleavage sites in amino-acid sequences

httpwwwcbsdtudkservicesSignalP

SMART Simple modular architecture research tool for assigningdomains to protein

httpsmartembl-heidelbergde

TANGO Predicts the aggregation-prone regions in a protein httptangocrgesTargetP Predicts the subcellular location of eukaryotic proteins httpwwwcbsdtudkservicesTargetPUniProtKB Catalog of information on proteins httpwwwuniprotorgWALTZ Predicts the aggregation-prone regions in a protein httpwaltzvubacbe

10 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

(Table 5) may potentially help investigate the evolutionary andfunctional roles of variants in unassigned regions

Impact of coding SNVs in low-complexityregions

Low-complexity regions (LCRs) in the protein universe refer to astretch of amino acids with low Shannon entropy (leucine-richdomains or poly-alanine tracts) Unlike linear motifs whichhave a specific function and sequence signature the individualfunctions of LCRs are poorly characterized LCRs do not adopt adefinite secondary structure but may exist as solvent-exposedamino acids in the coiled or disordered regions in proteins LCRsare observed in functionally diverse proteins and in both eu-karyotes and prokaryotes The predominant functions of LCRsinclude promoting mRNA stability and mediating a diverse set

of proteinndashprotein interactions [104] We scanned the referencehuman proteome (reviewed subset of 20 237 sequences) usingPFILT [105] and observed that 1432 (nfrac14 2899) protein se-quences have LCRs with a median length of 16 amino acids(Figure 4B) A recent study showed that a functional variantlocalized to an LCR in Nance-Horan Syndrome (NHS) gene wasassociated with clinical features of NHS including cardiacanomaly [106] In the absence of direct functional informationto interpret the impact of functional variants scanning proteinsequences with an LCR prediction tool like PFILT is highly rec-ommended to investigate the probable gain or loss of LCRsowing to the functional variations

Coding SNVs and intrinsically disorderedregions in proteins

Proteins are believed to be functional when the structure attainsits definite globular fold [107] Recent studies however suggestthat proteins may perform their functions even when not in afully folded state Such proteins or regions within proteins thatexist in a stable conformation without attaining a definite struc-tural conformation are generally referred to as intrinsically dis-ordered proteins (IDPs) or proteins with intrinsically disorderedregions (IDRs) or simply disordered proteins [108 109] Disprotis a database that catalogs a curated list of disordered proteinsthat includes 248 experimentally validated human proteinswith disordered regions (httpwwwdisprotorgactionsearchphpkeywordfrac14humanampcriterionfrac14organism) Prediction modelssuggest that 30ndash40 of human proteins are considered to beIDPs or have IDRs and approximately 25 of eukaryotic proteinsare predicted to be fully disordered [108 109]

Regions of the intrinsically disordered segments in proteinshave been found to be functionally important (Figure 5)Disordered regions mediate various functional roles includingprotein binding and proteinndashprotein interactions Previous stud-ies have shown that IDPs can attain a definite structure on bind-ing to their interacting partner and may thus exist in anintermediate stage of disordered (unfolded) to ordered (folded)stage [110 111] Irrespective of the structural plasticity recent

Figure 3 Proportional Venn diagram of genes with coding variants annotated as

polymorphism (Polymorphism nfrac1411 527) disease (Disease nfrac142105) and un-

classified (Unclassified nfrac141910) in Human polymorphisms and disease muta-

tions index Percentages of genes shared between the three groups are provided

Unassigned regions in human proteome

Percentage of unassigned regions in human proteins

Fre

qu

en

cy in

hu

ma

n p

rote

om

e (

Sw

issp

rot

re

vie

we

d s

eq

ue

nce

s)

0 20 40 60 80 100

01

00

02

00

03

00

04

00

0

Low-complexity regions in human proteome

Percentage of LCRs in human proteins

Fre

quency in h

um

an p

rote

om

e (

Sw

isspro

t r

evie

wed s

equences)

0 20 40 60 80 100

0500

1000

1500

2000

2500

3000(a) (b)

Figure 4 Histograms depicting the distribution (in percentage) of (A) unassigned regions and (B) LCRs in the human proteome

Interpreting functional effects of coding variants | 11

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

studies have shown the presence of evolutionarily conservedfunctional motifs within IDPs [112 113]

Recently the influence of disease mutations in the dis-ordered regions was extensively surveyed and the crucialroles of disorder-promoting amino acids in imparting vari-ations in the protein structures were proposed [114 115]Disorder-promoting amino acids are a potential cause ofvariations in protein structures and structural modelsDisorder-to-order transitions are enriched among diseasemutations compared with neutral polymorphisms [116]Currently protein modeling or variant effect prediction al-gorithms do not take the intrinsic disorder characteristicsof a protein into consideration Integrating tools such asDisopred [117] for disorder region prediction into the vari-ant annotation pipelines will improve our understanding ofthe impact of coding SNVs in the disordered regions of aprotein

Influence of coding SNVs on proteinmisfolding domain swapping aggregationmacromolecular crowding and degradation

Diseases have multiple environmental as well as geneticcauses and structural biology is an active area of research tounderstand how changes in individual protein structures orprotein complexes may play a key role in disease manifest-ations Understanding the structural bases of human diseases

would help to identify better targets and design better ligandsfor more effective therapeutic interventions

Protein misfolding and aggregation

Protein folding pathways play a crucial role in mediating cellu-lar homeostasis Defective folding of a protein product is themechanistic basis of several disease phenotypes likeAlzheimerrsquos disease and Parkinsonrsquos disease [118] When non-synonymous mutations lead to the production of misfoldedproteins with aberrant function several gate-keeping pathwaysact on such misfolded proteins to clear them from the cellularenvironment [119] Misfolded proteins targeted for ubiquityla-tion may also lead to protein aggregation Protein aggregation[120ndash123] is defined as a molecular phenomenon where a pro-tein is not cleared from the cellular environment by the normalpathways (Figure 6A) This leads to the aggregation of proteinsin the cellular environment which in turn leads to cellular tox-icity and is considered to be the mechanistic basis of varioushuman diseases such as prion diseases [123 124]

Domain swapping

3D domain swapping (Figure 6B) is a phenomenon observed in asubset of proteins where intermolecular interactions arereplaced by intramolecular interactions [125 126] Domainswapping is also recognized as a mechanism for forming pro-tein aggregates via open-ended mode Domain swapping is

Figure 5 Examples of IDRs in human proteins (A) Breast cancer type 1 susceptibility protein (BRCA1) (B) Cellular tumor antigen p53 (Oncoprotein p53) (C) T-cell sur-

face glycoprotein CD4 (Disprot) (D) Secondary structure information (UniProtKB) and (E) mutation histogram (COSMIC) of BRCA1 is provided to illustrate mutations in

the unstructured regions

12 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

observed in a variety of therapeutically important proteins andis considered to be the mechanism mediating deposition dis-eases such as neurodegenerative diseases and Alzheimerrsquos dis-ease that are caused by conformational perturbations [127 128]Missense mutations in the human phenylalanine hydroxylasehave been shown to infleunce aggregation and degradationproperties [129] Functional mutations are also considered to bea causative factor for proteins to adopt swapped conformationfrom monomers to higher oligomers [130ndash132] A recent study[128] that surveyed a subset of human proteins involved in 3Ddomain swapping suggested that swapping is not only confinedto conformational diseases it is also associated with severalkey biological pathways and also plays a role in mediating di-verse diseases in humans Another study [133] that investigatedthe structural properties of proteins involved in 3D domainswapping suggested that 10 of protein folds and 5 of proteinfamilies include domain-swapped structures

Macromolecular crowding

Proteins interact with a variety of small molecules and othermacromolecules to perform a specific functional task The qua-ternary assembly of a protein is influenced by the importantyet understudied phenomenon of macromolecular crowding

(117 118) Protein function is driven by interactions mediated byinter- and intra-chain amino acids and mutations in the sur-face amino acids may affect macromolecular homeostasisMutagenesis studies have shown the impact of functional mu-tations on proteinndashprotein interfaces (119) oligomerization (120)and stability (121) Mutations that influence the intra-chaininteractions may influence the crowding effect in the cellularenvironment Probing the impact of functional mutations onsuch effects using theoretical models and molecular dynamicsimulation studies is likely to enhance the understanding of therelationship between coding SNVs and macromolecular crowd-ing A list of selected tools and databases available for the pre-diction of misfolding protein aggregation and domainswapping is summarized in Table 5

Protein degradation

Targeted biochemical studies have revealed that the proteindegradation pathway plays an important role in the clearanceof the mutated proteins to reduce cellular toxicity For examplemutant Cu Zn-superoxide dismutase associated with amyo-trophic lateral sclerosis was cleared by macroautophagy path-way that includes proteasomal cleavage phenylalaninehydroxylase (109) and NAD(P)H quinone oxidoreductase 1 (156)

(a) (b)

Figure 6 Schematic representation of the impact of functional mutation on protein misfolding folding aggregation domain swapping macromolecular crowding and

protein degradation pathways Structure of misfolded rat CD2 structure (PDB ID 1A6P) and normal CD2 (PDB ID 1HNG) is used for representing misfolded and folded

structures Functional variant is represented using red asterisk

Interpreting functional effects of coding variants | 13

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

were also cleared by similar mechanisms Functional variantsplay a crucial role in aggregation and protein degration path-ways [134 135] Computational approaches that can predict thedegradation properties of mutated proteins using sequence orstructural information will be useful for rapid characterizationof functionally active proteins Integrated models that combinefolding misfolding aggregation 3D domain swapping macro-molecular crowding and degradation pathways (Figure 6) in asystems biology approach are likely to provide additional in-sights into the regulation of these important mechanisms andthe role of coding SNVs in mediating such mechanisms

Coding SNVs and metamorphic proteins

The molecular paradigm of sequence-structure-function sug-gests that a diverse set of sequences could attain a similarstructural fold that may lead to a functional convergence Whilethis is generally true for the protein universe occasionally devi-ations are observed Metamorphic proteins refer to a relativelynew class of proteins in which a given sequence has beenshown to attain different folded conformations under nativeconditions while performing distinct functions [136ndash138]Metamorphic proteins such as human chemokine lymphotactinhave been shown to influence evolutionary transitions of struc-ture and coding SNVs which could play an important role inswitching between two or more folds and functions for a singlesequence [139] Computational approaches to catalog and pre-dict the metamorphic properties and the impact of coding SNVson metamorphic proteins would help us understand how muta-tions drive functional plasticity at the proteome level

Impact of coding SNVs on the transcriptomicdiversity

Dynamic features of the human proteome are driven by tran-script diversity and the average number of characterized tran-scripts per gene is rapidly expanding The recent adoption ofRNA-Seq as a tool for expression profiling has led to the charac-terization of a large number of novel transcripts includingfusion transcripts [140] Several novel tissue-specific orcell-type-specific transcripts were reported from RNA-Seqexperiments The cellular compartment-based functional rolesof such transcripts are determined by alternative splicingevents [141] RNA editing is a phenomenon where a pre-mRNAmolecule is altered through a chemical change in the basemakeup thereby adding to the diversity of the transcriptomeRNA editing events occur via two distinct mechanisms of sub-stitution editing and insertiondeletion editing leading to func-tional diversity Fusion transcripts [142] and RNA editing eventsare also associated with various diseases including prostatecancer and amyotrophic lateral sclerosis [143] Computationalapproaches are available for the identification of fusion tran-scripts [144] and RNA editing events [145ndash147] from sequencingdata Understanding the precise roles and the impact of variantson such a diverse set of transcripts using computationalapproaches is a challenging task Ascribing the functional roleand assessing the impact of coding SNVs for mediating noveland fusion transcripts using bioinformatics approaches is anemerging problem and being addressed in recent studies [148]Recently we designed an integrated method to simultaneousanalyses of genome and transcriptome from RNA-seq data TheeSNV-Detect method can precisely capture genetic variation(genotypes) from RNA-seq data and helps to design cost-

effective and sustainable experimental strategies [149] Analyticand interpretation strategies that rely on multiple data-typeswould provide greater confidence for variant calling annotationand interpretation and thus actionability for clinicalinterventions

Functional impact of synonymous variations

Synonymous mutations are defined as mutations that result ina variation at the DNA level that code for the same amino acidin the protein level owing to codon degeneracy Synonymousvariants may either be coding or noncoding and the functionalroles of such variants are emerging A recent study [150] sug-gests that introns are involved in functional mechanisms andintrons with positional conservation across eukaryotic lineageare classified as functional introns A comparative analysis ofsynonymous and non-synonymous variants associated withcomplex diseases has shown similar likelihood and effect sizewith disease association [151] Synonymous mutations mayalso influence the introns regulating gene expression [152] A re-cent report also suggests that the coding exons function as tis-sue-specific enhancers and synonymous variations in suchenhancer sites may influence the expression level of certaingenes [153] Synonymous variants could influence the expres-sion of intronic noncoding RNAs [154] perturb the transientproteinndashDNA [155 156] and proteinndashRNA interactions in the cellSuch perturbations can lead to diseases such as cancer neuro-logical disorders and cardiac disease [157] Developing bioinfor-matics approaches to explore the putative impact ofsynonymous mutations on different layers of the coding andnoncoding genome and their relationships will be an importantaspect in the detailed interpretation of the genomic variantsThe detailed structural exploration of proteinndashDNA [158] andproteinndashRNA interactions [159] will help to precisely map themechanistic bases of the loss or gain of interactions owing tosynonymous variations

Impact of coding SNVs on the interactomeof a protein

The individual interactome of proteins varies to a great extent[160] Proteins often interact with nucleotides (proteinndashDNA orproteinndashRNA interactions) proteins (protein complexes obligateor non-obligate) [161] and proteinndashprotein interactions (transientor permanent) [107]) and small molecules (molecular reactionsenzymatic reactions metabolic pathways) [162 163] CodingSNVs play an important role in mediating such interactions Aninteractome of a protein can be defined using information gath-ered from high-throughput experiments that systematicallyidentify interactants of proteins Public protein-protein databaseslike BioGRID [164] and STRING [165] provide large data sets forderiving a protein interactome Network-level investigations tounderstand the category of interactors that are perturbed owingto coding SNVs will help delineate the impact of such mutationson the protein interactomes Using network measures (centralitydegree stress betweenness closeness cliques radiality transi-tivity reciprocity assortativity structural equivalence networkheterogeneity network density clustering coefficients neighbor-hood connectivity shared neighbors network topology etc) asquantitative parameters in variant annotation pipelines will helpthe researchers gain better insights into the impact of codingSNVs on the protein interactome Biologically relevant networkproperties such as network modularity [166] network fragility

14 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

[166] and lethality [166] can also be derived from an interactomeIncorporating a comparative network analysis framework thatcompares wild-type and variant interactomes will help quantifythe impact of coding SNVs on a network scale Methods such asVCF2Networks employ genotype networks (ie all genotypesassociated with a single phenotype) to understand the relation-ships between the genotype space and clinical or biological phe-notypes [167] Table 7 summarizes the available tools toinvestigate gene lists and to derive global functional trends andnetwork properties

Pathway-level impact of coding SNVs

Multiple functional genomics studies have investigated the per-turbations of pathways as a result of mutations in diseasessuch as adenocarcinoma (MAPK signaling p53 signaling Wntsignaling cell cycle and mammalian target of rapamycin path-ways) [168] childhood acute lymphoblastic leukemia (RAS path-way) [169] colorectal carcinoma (WNT signaling pathway) [170]etc Several SNP-centric approaches for pathway-level inferencehave been designed for the interpretation of SNPs identifiedfrom GWA studies or differentially expressed genes from gene-expression studies [171ndash175] Coding SNVs may also influencethe cross talk between various signaling pathways Currentvariant interpretation algorithms strive to identify pathways

associated with genes harboring coding SNVs yet a detailedunderstanding of the impact of variants on biological pathwaysand pathway cross talks [176] is often lacking Incorporatinganalytical routines to quantify the effect of functional muta-tions on pathways and pathway cross talk will be useful in in-terpreting functional variants from a biological perspective

A classification schema for annotatingcoding SNVs

Identifying the entire spectrum of molecular perturbationsowing to coding variations at the level of sequence structureinteraction and function of proteins is considered the basis ofvariant interpretation Multiple automated prediction tools thatcan assess the functional effect of mutations are currentlyavailable [See Table 6] Several variant annotation tools focuson the task of predicting the type or effect of mutations provideextended annotations from precomputed databases or lookuptables and map a coding variant to its corresponding gene pro-tein functional domain or signaling pathway Theseapproaches have several limitations because additional layersof protein-centric information that can be derived from the pre-diction or computation of sequence-based features with codingvariants are lacking To deal with this challenge and to designbroad-spectrum tools for deep variant interpretation we

Table 7 Software libraries and tools for biological network analysis

Name Description URL

Bio4J Bio4j is a bioinformatics graph-based database httpwwwbio4jcomBioConductor (Graphs

and Networks view)Collection of BioConductor modules for biological net-

work analysis and visualizationhttpwwwbioconductororgpackagesrelease

BiocViewshtml___GraphsAndNetworksCytoscape Open source platform for complex network analysis and

visualization with a large collection of plug-ins forbiological network analysis

httpwwwcytoscapeorg

DAVID Integrated functional annotation tool httpdavidabccncifcrfgovhomejspFunDO Functional Disease Ontology server httpdjangonubicnorthwesternedufundogene2pathway R package for prediction of KEGG pathway membership

for individual genes based on InterPro domainsignatures

httpwwwbioconductororgpackagesreleasebiochtmlgene2pathwayhtml

GeneAnswers R package for biological or medical interpretation of thegiven one or more groups of genes by means of statis-tical test

httpwwwbioconductororgpackagesreleasebiochtmlGeneAnswershtml

GO Tools Tools for analysis of GO T term enrichment statisticalanalysis semantic similarity and functional similar-ity using GO terms derived from gene lists

httpwwwgeneontologyorgGOtoolsshtml

Gephi Open-source graph visualization and analysis software httpwwwgephiorgGremlin Graph-based programming language httpsgithubcomtinkerpopgremlinwikiiGraph Network analysis and visualization library in C Also

available R package and a Python extensionhttpigraphsourceforgenet

KEGGgraph R package for analysis of KEGG pathways httpwwwbioconductororgpackages210biochtmlKEGGgraphhtml

LEDA A broad-spectrum Cthornthorn class library for efficient datatypes and algorithms including large-scale networkanalysis

httpwwwalgorithmic-solutionscomledaaboutindexhtm

Neat Web-based network analysis tools httpgephiorgOntology Analysis

plugins for Cytoscapehttpchiantiucsdeducyto_webplugins Plugins for functional enrichment analysis using

network dataPANTHER Classification of genes and proteins httppantherdborgReactome Curated knowledgebase of molecular events and

pathwayshttpwwwreactomeorg

TargetMine Integrates different types of biological data and enableflexible queries export results and analyze lists ofdata

httptargetminemizuguchilaborg

Interpreting functional effects of coding variants | 15

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

recommend a three-level annotation schema for the interpret-ation of coding variants (Figure 7) Primary level annotation pro-vides an overview of the types of mutations and associationannotations Secondary level annotation enables the systematicinvestigation of multiple variants in a single gene Tertiary-levelannotations help to find the global characteristics of genes har-boring coding variants and various properties in a network-scale

Level 1 Primary annotation of coding SNVs

Tools that use position information from VCF files to derive thetype of mutation (synonymous or non-synonymous) the loca-tion of the corresponding gene and its mapping onto several an-notation resources using positional data may be viewed aslsquoprimary annotationsrsquo See Table 2 for a list of tools that useminimum input data to gather a diverse set of annotationsusing database lookup tables and identifier mapping

Level 2 Secondary annotation of coding SNVs- Impactof multiple variants in a single gene

Sequence and structural explorations of coding variants arereferred to as secondary annotations VCF files can have mul-tiple variants of the same gene associated with a given pheno-type or disease A systematic investigation of multiple variantsin a single gene and an assessment of the relative impact of cod-ing variants could help in classifying multiple variants in a geneand help to delineate the variants as pathogenic moderatelypathogenic or VUS Comparative analysis of wild-type and vari-ant sequences can be performed and quantified using the totalnumber of gains or losses in the sequence features Based on the

availability of the structural data a given variant can be modeledin a protein structure using in-silico mutagenesis experimentsIn the absence of experimental protein structural data a proteinstructure model can be obtained from the homology model data-base or a new model can be built using homology modelingapproaches Once the wild type and mutant type structures areobtained the impact of variation can be rapidly computed usingstructural feature analyses or using computationally expensivemolecular dynamics simulations Table 5 summarizes key se-quence and structure feature prediction tools

Level 3 Tertiary annotation of coding SNVs- Impact ofcoding SNVs on multiple genes pathways andinteractome

The global impact of multiple coding variants on a genome orexome can be assessed using a combination of both knowledge-based enrichment or depletion analysis and network or interac-tome analyses Functional profiling using GO terms have beenused as an effective approach for characterizing collective func-tional characteristics of a perturbed set of genes [177 178]Gene-list-based analyses can provide biologically relevant infor-mation about the variants Enrichment analysis is not just con-fined to a priori defined gene sets or GO annotationsenrichment can be performed using several types of annota-tions and could provide insights into the probable functional as-sociations of genes with coding variants Enrichment analysisusing protein annotations would help to identify functionallysignificant protein domains protein classes families (mem-brane proteins or kinases etc) protein superfamilies (angioten-sin receptors or G-protein coupled receptors) or protein folds(Rossman fold beta-propeller) associated with proteins

Figure 7 Three-level schema for annotation of functional variants

16 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

harboring coding variants Furthermore it is possible to identifyenriched molecular events mediated by genes enriched TFBS inthe upstream of genes and biological pathways mediated by thegenes using annotations from Reactome UCSC and KEGG repo-sitories Disease ontology can be used to find known genendashdisease associations Following the knowledge-based analysis anetwork-level analysis of genes harboring coding SNVs can beperformed using publicly available network analysis tools suchas Cytoscape and Gephi

Future directions

The rapidly increasing availability of sequence structure func-tional and interaction information offers an attractive means toobtain a detailed characterization of coding variants Howeverthe use of large data sets to systematically explore the functionalvariants is currently limited owing to the gaps in the availablesequencendashstructurendashfunction interaction data Sequence dataare available for the entire human proteome but the availabilityof protein structural data experimentally verified functional as-sociations and biomolecular interaction data is limited Theavailability of structural data of homologous protein structuresis a prerequisite for structural investigations of functional vari-ants and its impact on the structural environment Availabilityof high-quality interaction data will also help analyze the cellu-lar networks modulated by proteins harboring coding variantsThe expansion of structural data sets using structural genomics[179] homology modeling [180] and the application of computa-tional approaches such as genomics-aided structure prediction[181] is likely to lower the increasing chasm between sequencestructure function and interaction data and may help to ascer-tain the impact of variants at the biomolecular level [163 182183] Improving the annotation databases using high-qualitycurated data and integration with cloud-based or stand-alonepipelines [184ndash189] can also help facilitate such efforts

The current version of the Bioinformatics Link Directory en-lists gt900 bioinformatics tools that are capable of processing dif-ferent types of protein data (sequence structure interactionquantification and annotation) Unifying multiple resources andenabling programmatic access via application program interfacesand web services for rapid integration will significantly enhancethe efficacy of variant annotation pipelines The development ofnovel statistical methods that can quantify various residue-specific properties would enable a comparative analysis of wild-type and variant sequences and can thereby help to facilitate theidentification and prioritization of functional mutations The in-clusion of a diverse set of sequence structure and network fea-tures into a graphically depicted database and an assessment ofthe impact of various data types (sequence structure functionand interaction) using probabilistic or machine learning modelswould also help automate variant annotation interpretation

From a sequence perspective understanding the precisefunctional role of LCRs unassigned regions and disordered re-gions as well as the relationship between these features and cod-ing SNVs is a key step in variant annotation and interpretationIdentifying deviations in protein structural space that are likelyto lead to misfolding domain swapping aggregation degrad-ation or deviations that are metamorphic in nature owing to theimpact of coding SNVs is an emerging challenge Several algo-rithms are available to predict sequence or structural domainsassign distantly related domains to unassigned regions func-tional motifs structural motifs and structural features proteindisorder and aggregation propensities However algorithms andtools to predict 3D domain swapping or to analyze metamorphic

properties degradation pathways and the network-level impactof coding SNVs are less abundant This may be attributed to thefact that these concepts are continually evolving in the proteinuniverse and concerted efforts are needed to understand the im-pact of coding variants of such unique features from both experi-mental and computational biologists alike

This review summarizes the major bioinformatics chal-lenges involved in gaining a deeper understanding of codingSNVs Coding SNVs may impart a molecular or disease pheno-type by interacting with noncoding regulatory parts of the gen-ome and the structural variants may provide an importantcontribution to this phenomenon Analyzing functional andregulatory variants in a single analytical framework may furtherenhance the interpretation of the genomic variants

Conclusions

A rapid decline in sequencing costs using NGS technologies hasled to an exponential increase in the frequency of the sequenc-ing projects over the past decade A large number of personalgenomes and exomes clinical samples and cancer genomes arebeing sequenced as part of large-scale collaborative projectsThe identification of a plethora of sequence variants associatedwith diverse molecular and disease phenotypes Scalable com-putational approaches that integrate annotations using se-quence structure functional and interaction data are necessaryfor the rapid interpretation of coding SNVs We have discussedthe bioinformatics resources available for the prediction anno-tation and visualization of coding SNVs summarized the majorbioinformatics challenges into 10 different themes for a deeperinterpretation of coding SNVs and recommended a three-levelschema to assess the phenotypic impact of functional vari-ations on individual protein sequences structures differentfunctional categories biological pathways and interactomesWe envisage that addressing the key challenges discussed inthis review and adopting a comprehensive annotation schemafor variant annotation could improve genomic reports that aregenerated as part of genomic medicine investigations and ex-perimental studies to better understand the variations impli-cated in rare common and complex disease manifestations

Key Points

bull Interpretations of variants identified from next-gener-ation sequencing pose several challenges

bull Systems-level experimental investigation of the func-tional variants is expensive and time-consumingefficient computational techniques are required toidentify the impact of functional variants

bull We recommended an integrated approach that com-bines multiple data types and tools for the predictionannotation and visualization of functional variantsand we have proposed a systematic approach for func-tional variant annotation and interpretation

bull Significant challenges that need to be negotiated dur-ing the interpretation of coding single nucleotide vari-ants are presented with the help of various examples

bull A three-level annotation approach that combines theinformation at the level of an individual variant mul-tiple variants in a single protein and global trends ofmultiple genes harboring multiple variants is proposedfor an effective interpretation of coding single nucleo-tide variants

Interpreting functional effects of coding variants | 17

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

Acknowledgements

KS and JTD would like to thank Icahn Institute forGenomics and Multiscale Biology (httpicahnmssmedudepartments-and-institutesgenomics) Mount Sinai HealthSystem for infrastructural support KS would like to ac-knowledge Michael J Zimmermann and Jean-Pierre Kocher(Mayo Clinic Rochester) for useful discussions The MayoClinic Bioinformatics Core provided infrastructure and ana-lytical support to KRK RS would like to thank NationalCentre for Biological Sciences (TIFR) for infrastructuralsupport

Funding

KS and JTD was supported by grant R01-DK098242-03from the National Institute of Diabetes and Digestive andKidney Diseases KRK is funded in part by the Mayo ClinicCenter for Individualized Medicine the PharmacogenomicsResearch Network (PGRN) and the Mayo Clinic BreastSpecialized Program of Research Excellence (SPORE)

References1 Lander ES Initial impact of the sequencing of the human

genome Nature 2011470187ndash972 Mardis ER A decadersquos perspective on DNA sequencing tech-

nology Nature 2011470198ndash2033 Green ED Guyer MS Charting a course for genomic medi-

cine from base pairs to bedside Nature 2011470204ndash134 Maxmen A Exome sequencing deciphers rare diseases Cell

2011144635ndash75 Wheeler DA Srinivasan M Egholm M et al The complete

genome of an individual by massively parallel DNAsequencing Nature 2008452872ndash6

6 Weatheritt RJ Davey NE Gibson TJ Linear motifs conferfunctional diversity onto splice variants Nucleic Acids Res201240123ndash31

7 Goldstein DB Common genetic variation and human traitsN Engl J Med 20093601696ndash8

8 Manolio TA Collins FS Cox NJ et al Finding the missing her-itability of complex diseases Nature 2009461747ndash53

9 Genome-wide association study of 14 000 cases of sevencommon diseases and 3 000 shared controls Nature2007447661ndash78

10 Kullo IJ Ding K Shameer K et al Complement receptor 1gene variants are associated with erythrocyte sedimenta-tion rate Am J Hum Genet 201189131ndash8

11 Welter D MacArthur J Morales J et al The NHGRI GWASCatalog a curated resource of SNP-trait associations NucleicAcids Res 201442D1001ndash6

12 Leslie R OrsquoDonnell CJ Johnson AD GRASP analysis of geno-type-phenotype results from 1390 genome-wide associationstudies and corresponding open access databaseBioinformatics 201430i185ndash94

13 Nelson MR Wegmann D Ehm MG et al An abundance ofrare functional variants in 202 drug target genes sequencedin 14 002 people Science 2012337100ndash4

14 MacArthur DG Balasubramanian S Frankish A et al A sys-tematic survey of loss-of-function variants in human pro-tein-coding genes Science 2012335823ndash8

15 Raychaudhuri S Mapping rare and common causal allelesfor complex human diseases Cell 201114757ndash69

16 Stankiewicz P Lupski JR Structural variation in the humangenome and its role in disease Annu Rev Med201061437ndash455

17 McCarroll SA Altshuler DM Copy-number variation and as-sociation studies of human disease Nat Genet200739S37ndash42

18 Zhang F Gu W Hurles ME et al Copy number variation inhuman health disease and evolution Annu Rev GenomicsHum Genet 200910451ndash81

19 Medvedev P Stanciu M Brudno M Computational methodsfor discovering structural variation with next-generationsequencing Nat Methods 20096S13ndash20

20 Sherry ST Ward MH Kholodov M et al dbSNP the NCBI data-base of genetic variation Nucleic Acids Res 200129308ndash11

21 Rios D McLaren WM Chen Y et al A database and API forvariation dense genotyping and resequencing data BMCBioinformatics 201011238

22 Iafrate AJ Feuk L Rivera MN et al Detection of large-scale variation in the human genome Nat Genet200436949ndash51

23 Ashley EA Butte AJ Wheeler MT et al Clinical assess-ment incorporating a personal genome Lancet20103751525ndash35

24 Chen R Mias GI Li-Pook-Than J et al Personal omics profil-ing reveals dynamic molecular and medical phenotypesCell 20121481293ndash307

25 Lupski JR Reid JG Gonzaga-Jauregui C et al Whole-genomesequencing in a patient with Charcot-Marie-Tooth neur-opathy N Engl J Med 20103621181ndash91

26 Li Y Vinckenbosch N Tian G et al Resequencing of200 human exomes identifies an excess of low-fre-quency non-synonymous coding variants Nat Genet201042969ndash72

27 Clarke L Zheng-Bradley X Smith R et al The 1000 GenomesProject data management and community access NatMethods 20129459ndash462

28 Schunkert H Konig IR Kathiresan S et al Large-scale associ-ation analysis identifies 13 new susceptibility loci for coron-ary artery disease Nat Genet 201143333ndash8

29 Helgadottir A Thorleifsson G Manolescu A et al A commonvariant on chromosome 9p21 affects the risk of myocardialinfarction Science 20073161491ndash3

30 Visel A Zhu Y May D et al Targeted deletion of the 9p21non-coding coronary artery disease risk interval in miceNature 2010464409ndash12

31 Soranzo N Rendon A Gieger C et al A novel variant onchromosome 7q223 associated with mean platelet volumecounts and function Blood 20091133831ndash7

32 Paul DS Nisbet JP Yang TP et al Maps of open chromatinguide the functional follow-up of genome-wide associationsignals application to hematological traits PLoS Genet20117e1002139

33 Hull J Campino S Rowlands K et al Identification of com-mon genetic variation that modulates alternative splicingPLoS Genet 20073e99

34 Albert FW Kruglyak L The role of regulatory vari-ation in complex traits and disease Nat Rev Genet201516197ndash212

35 Battle A Khan Z Wang SH et al Genomic variation Impactof regulatory variation from RNA to protein Science2015347664ndash7

36 den Dunnen JT Antonarakis SE Nomenclature for the de-scription of human sequence variations Hum Genet2001109121ndash4

18 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

37 Eilbeck K Lewis SE Mungall CJ et al The sequence ontologya tool for the unification of genome annotations Genome Biol20056R44

38 Landrum MJ Lee JM Riley GR et al ClinVar public archive ofrelationships among sequence variation and human pheno-type Nucleic Acids Res 201442D980ndash5

39 Kircher M Witten DM Jain P et al A general framework forestimating the relative pathogenicity of human genetic vari-ants Nat Genet 201446310ndash15

40 Kumar P Henikoff S Ng PC Predicting the effects of codingnon-synonymous variants on protein function using theSIFT algorithm Nat Protoc 200941073ndash81

41 Adzhubei I Schmidt S Peshkin L et al A method and serverfor predicting damaging missense mutations Nat Meth20107248ndash9

42 Gonzalez-Perez A Lopez-Bigas N Improving the assessmentof the outcome of nonsynonymous SNVs with a consensusdeleteriousness score Condel Am J Hum Genet 201188440ndash9

43 Grantham R Amino acid difference formula to help explainprotein evolution Science 1974185862ndash4

44 Cooper GM Stone EA Asimenos G et al Distribution and in-tensity of constraint in mammalian genomic sequenceGenome Res 200515901ndash13

45 Pollard KS Hubisz MJ Rosenbloom KR et al Detection ofnonneutral substitution rates on mammalian phylogeniesGenome Res 201020110ndash21

46 Siepel A Bejerano G Pedersen JS et al Evolutionarily con-served elements in vertebrate insect worm and yeast gen-omes Genome Res 2005151034ndash50

47 Dayhoff MO Eck RV Park CM Model of evolutionary changein proteins In MO Dayhoff (ed) Atlas of Protein Sequence andStructure Washington DC National Biomedical ResearchFoundation 1972 89ndash99

48 Henikoff S Henikoff JG Amino acid substitution matricesfrom protein blocks Proc Natl Acad Sci USA 19928910915ndash19

49 Altschul SF Gish W Miller W et al Basic local alignmentsearch tool J Mol Biol 1990215403ndash10

50 Altschul SF Madden TL Schaffer AA et al Gapped BLASTand PSI-BLAST a new generation of protein database searchprograms Nucleic Acids Res 1997253389ndash402

51 Sunyaev S Ramensky V Koch I et al Prediction of deleteri-ous human alleles Hum Mol Genet 200110591ndash7

52 Yang Z Bielawski JP Statistical methods for detecting mo-lecular adaptation Trends Ecol Evol 200015496ndash503

53 Ng PC Henikoff S Predicting deleterious amino acid substi-tutions Genome Res 200111863ndash74

54 Wang Z Moult J SNPs protein structure and disease HumMutat 200117263ndash70

55 Thomas PD Kejariwal A Coding single-nucleotide poly-morphisms associated with complex vs Mendelian diseaseevolutionary evidence for differences in molecular effectsProc Natl Acad Sci USA 200410115398ndash403

56 Stone EA Sidow A Physicochemical constraint violation bymissense substitutions mediates impairment of proteinfunction and disease severity Genome Res 200515978ndash86

57 Miller MP Kumar S Understanding human disease muta-tions through the use of interspecific genetic variation HumMol Genet 2001102319ndash28

58 Johnson AD Single-nucleotide polymorphism bioinfor-matics a comprehensive review of resources Circ CardiovascGenet 20092530ndash6

59 Marjoram P Tavare S Modern computational approachesfor analysing molecular genetic variation data Nat Rev Genet20067759ndash70

60 Excoffier L Heckel G Computer programs for populationgenetics data analysis a survival guide Nat Rev Genet20067745ndash58

61 Peterson TA Nehrt NL Park D et al Incorporating molecularand functional context into the analysis and prioritizationof human variants associated with cancer J Am Med InformAssoc 201219275ndash83

62 Kann MG Advances in translational bioinformatics compu-tational approaches for the hunting of disease genes BriefBioinform 20101196ndash110

63 Cooper GM Shendure J Needles in stacks of needles findingdisease-causal variants in a wealth of genomic data Nat RevGenet 201112628ndash40

64 Stitziel NO Kiezun A Sunyaev S Computational and statis-tical approaches to analyzing variants identified by exomesequencing Genome Biol 201112227

65 Alexander RP Fang G Rozowsky J et al Annotating non-coding regions of the genome Nat Rev Genet 201011559ndash71

66 Mort M Evani US Krishnan VG et al In silico func-tional profiling of human disease-associated and poly-morphic amino acid substitutions Hum Mutat 201031335ndash46

67 Hamosh A Scott AF Amberger JS et al Online MendelianInheritance in Man (OMIM) a knowledgebase of humangenes and genetic disorders Nucleic Acids Res200533D514ndash17

68 Stenson PD Mort M Ball EV et al The human gene mutationdatabase 2008 update Genome Med 2009113

69 Forbes SA Bindal N Bamford S et al COSMIC mining com-plete cancer genomes in the Catalogue of SomaticMutations in Cancer Nucleic Acids Res 201139D945ndash50

70 Chen YA Tripathi LP Mizuguchi K TargetMine an inte-grated data warehouse for candidate gene prioritisation andtarget discovery PLoS One 20116e17844

71 Wang K Li M Hakonarson H ANNOVAR functional annota-tion of genetic variants from high-throughput sequencingdata Nucleic Acids Res 201038e164

72 Danecek P Auton A Abecasis G et al The variant call formatand VCFtools Bioinformatics 2011272156ndash8

73 Macintyre G Jimeno Yepes A Ong CS et al Associating dis-ease-related genetic variants in intergenic regions to thegenes they impact Peer J 20142e639

74 Javed A Agrawal S Ng PC Phen-Gen combining phenotypeand genotype to analyze rare disorders Nat Methods201411935ndash7

75 Dai HJ Wu JC Tsai RT et al T-HOD a literature-based candi-date gene database for hypertension obesity and diabetesDatabase (Oxford) 20132013bas061

76 Cheng D Knox C Young N et al PolySearch a web-basedtext mining system for extracting relationships betweenhuman diseases genes mutations drugs and metabolitesNucleic Acids Res 200836W399ndash405

77 Do R Kathiresan S Abecasis GR Exome sequencing andcomplex disease practical aspects of rare variant associ-ation studies Hum Mol Genet 201221R1ndash9

78 Wagner MJ Rare-variant genome-wide association studiesa new frontier in genetic analysis of complex traitsPharmacogenomics 201314413ndash24

79 Genomes Project C Abecasis GR Auton A et al An inte-grated map of genetic variation from 1 092 human genomesNature 201249156ndash65

80 Anfinsen CB The formation and stabilization of proteinstructure Biochem J 1972128737ndash49

81 Anfinsen CB Principles that govern the folding of proteinchains Science 1973181223ndash30

Interpreting functional effects of coding variants | 19

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

82 Fay J Wu C Sequence divergence functional constraintand selection in protein evolution Annu Rev Genomics HumGenet 20034213ndash35

83 Ponting CP Russell RR The natural history of protein do-mains Annu Rev Biophys Biomol Struct 20023145ndash71

84 Ekman D Bjorklund AK Frey-Skott J et al Multi-domain pro-teins in the three kingdoms of life orphan domains andother unassigned regions J Mol Biol 2005348231ndash43

85 Sigrist CJ Cerutti L Hulo N et al PROSITE a documenteddatabase using patterns and profiles as motif descriptorsBrief Bioinform 20023265ndash74

86 Dinkel H Michael S Weatheritt RJ et al ELMndashthe database ofeukaryotic linear motifs Nucleic Acids Res 201240D242ndash51

87 Haberichter SL Budde U Obser T et al The mutation N528Sin the von Willebrand factor (VWF) propeptide causes de-fective multimerization and storage of VWF Blood20101154580ndash7

88 Kutz WE Wang LW Dagoneau N et al Functional analysis ofan ADAMTS10 signal peptide mutation in Weill-Marchesanisyndrome demonstrates a long-range effect on secretion ofthe full-length enzyme Hum Mutat 2008291425ndash34

89 Pugalenthi G Suganthan PN Sowdhamini R et al SMotif aserver for structural motifs in proteins Bioinformatics200723637ndash8

90 Sundaramurthy P Shameer K Sreenivasan R et al HORI aweb server to compute Higher Order Residue Interactions inprotein structures BMC Bioinformatics 201011(Suppl 1)S24

91 Porter CT Bartlett GJ Thornton JM The Catalytic Site Atlas aresource of catalytic sites and residues identified in enzymesusing structural data Nucleic Acids Res 200432D129ndash33

92 Stuart AC Ilyin VA Sali A LigBase a database of families ofaligned ligand binding sites in known protein sequencesand structures Bioinformatics 200218200ndash01

93 Swapna LS Mahajan S de Brevern A et al Comparison oftertiary structures of proteins in protein-protein complexeswith unbound forms suggests prevalence of allostery in sig-nalling proteins BMC Struct Biol 2012126

94 Mathivanan S Ahmed M Ahn NG et al Human Proteinpediaenables sharing of human protein data Nat Biotechnol200826164ndash7

95 Bhaskara RM Srinivasan N Stability of domain structures inmulti-domain proteins Sci Rep 2011140

96 Furnham N de Beer TA Thornton JM Current challenges ingenome annotation through structural biology and bioinfor-matics Curr Opin Struct Biol 201222594ndash601

97 Wang Y Cheng H Pan Z et al Reconfiguring phosphoryl-ation signaling by genetic polymorphisms affects cancersusceptibility J Mol Cell Biol 20157187ndash202

98 Gnad F Gunawardena J Mann M PHOSIDA 2011 the post-translational modification database Nucleic Acids Res201139D253ndash60

99 Nicolaou N Margadant C Kevelam SH et al Gain of glycosy-lation in integrin alpha3 causes lung disease and nephroticsyndrome J Clin Invest 20121224375ndash87

100 Radivojac P Baenziger PH Kann MG et al Gain and loss ofphosphorylation sites in human cancer Bioinformatics200824i241ndash7

101 Pan Y Karagiannis K Zhang H et al Human germline andpan-cancer variomes and their distinct functional profilesNucleic Acids Res 20144211570ndash88

102 Mazumder R Morampudi KS Motwani M et al Proteome-wide analysis of single-nucleotide variations in the N-glyco-sylation sequon of human genes PLoS One 20127e36212

103 Reddy CC Shameer K Offmann BO et al PURE a webserverfor the prediction of domains in unassigned regions in pro-teins BMC Bioinformatics 20089281

104 Coletta A Pinney JW Solis DY et al Low-complexity regionswithin protein sequences have position-dependent rolesBMC Syst Biol 2010443

105 Jones DT Swindells MB Getting the most from PSI-BLASTTrends Biochem Sci 200227161ndash4

106 Chograni M Rejeb I Jemaa LB et al The first missense muta-tion of NHS gene in a Tunisian family with clinical featuresof NHS syndrome including cardiac anomaly Eur J HumGenet 201119851ndash6

107 Jones S Thornton JM Principles of protein-protein inter-actions Proc Natl Acad Sci USA 19969313ndash20

108 Babu MM van der Lee R de Groot NS et al Intrinsically dis-ordered proteins regulation and disease Curr Opin Struct Biol201121432ndash40

109 Dunker AK Silman I Uversky VN et al Function and struc-ture of inherently disordered proteins Curr Opin Struct Biol200818756ndash64

110 Wright PE Dyson HJ Intrinsically unstructured proteins re-assessing the protein structure-function paradigm J Mol Biol1999293321ndash31

111 Dyson HJ Wright PE Coupling of folding and binding for un-structured proteins Curr Opin Struct Biol 20021254ndash60

112 Nguyen Ba AN Yeh BJ van Dyk D et al Proteome-wide dis-covery of evolutionary conserved sequences in disorderedregions Sci Signal 20125rs1

113 Das RK Mao AH Pappu RV Unmasking functionalmotifs within disordered regions of proteins Sci Signal20125pe17

114 Hu Y Liu Y Jung J et al Changes in predicted protein dis-order tendency may contribute to disease risk BMCGenomics 201112(Suppl 5)S2

115 Vacic V Iakoucheva LM Disease mutations in disordered re-gionsndashexception to the rule Mol BioSyst 2012827ndash32

116 Srivastava SK Gayathri S Manjasetty BA et al Analysis ofconformational variation in macromolecular structuralmodels PLoS One 20127e39993

117 Ward JJ Sodhi JS McGuffin LJ et al Prediction and functionalanalysis of native disorder in proteins from the three king-doms of life J Mol Biol 2004337635ndash45

118 Thomas PJ Qu BH Pedersen PL Defective protein folding asa basis of human disease Trends Biochem Sci 199520456ndash9

119 Welch W Role of quality control pathways in human dis-eases involving protein misfolding Semin Cell Dev Biol20041531ndash8

120 Fink AL Protein aggregation folding aggregates inclusionbodies and amyloid Fold Des 19983R9ndash23

121 Thirumalai D Klimov DK Dima RI Emerging ideas on themolecular basis of protein and peptide aggregation CurrOpin Struct Biol 200313146ndash59

122 Khare SD Dokholyan NV Molecular mechanisms of poly-peptide aggregation in human diseases Curr Protein Pept Sci20078573ndash9

123 Soto C Estrada L Castilla J Amyloids prions and the inher-ent infectious nature of misfolded protein aggregatesTrends Biochem Sci 200631150ndash5

124 Prusiner SB Molecular biology and pathogenesis of priondiseases Trends Biochem Sci 199621482ndash7

125 Shameer K Shingate PN Manjunath SC et al 3DSwap cura-ted knowledgebase of proteins involved in 3D domain swap-ping Database (Oxford) 20112011bar042

20 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

126 Bennett MJ Choe S Eisenberg D Domain swapping entan-gling alliances between proteins Proc Natl Acad Sci USA1994913127ndash31

127 Bennett MJ Sawaya MR Eisenberg D Deposition diseasesand 3D domain swapping Structure 200614811ndash24

128 Shameer K Sowdhamini R Functional repertoire molecularpathways and diseases associated with 3D domain swap-ping in the human proteome J Clin Bioinform 201228

129 Waters PJ Parniak MA Hewson AS et al Alterations in pro-tein aggregation and degradation due to mild and severemissense mutations (A104D R157N) in the human phenyl-alanine hydroxylase gene (PAH) Hum Mutat 199812344ndash54

130 OrsquoNeill JW Kim DE Johnsen K et al Single-site mutations in-duce 3D domain swapping in the B1 domain of protein Lfrom Peptostreptococcus magnus Structure 200191017ndash27

131 Seeliger MA Spichty M Kelly SE et al Role of conformationalheterogeneity in domain swapping and adapter function ofthe Cks proteins J Biol Chem 200528030448ndash59

132 Kirsten Frank M Dyda F Dobrodumov A et al Core muta-tions switch monomeric protein GB1 into an intertwinedtetramer Nat Struct Biol 20029877ndash85

133 Huang Y Cao H Liu Z Three-dimensional domain swappingin the protein structure space Proteins 2012801610ndash19

134 Gidalevitz T Wang N Deravaj T et al Natural genetic vari-ation determines susceptibility to aggregation or toxicity ina C elegans model for polyglutamine disease BMC Biol201311100

135 Ross CA Poirier MA Protein aggregation and neurodegener-ative disease Nat Med 200410(Suppl)S10ndash17

136 Murzin A Metamorphic proteins Science 20083201725ndash6137 Shortle D One sequence plus one mutation equals two

folds Proc Natl Acad Sci USA 200910621011ndash12138 Yadid I Kirshenbaum N Sharon M et al Metamorphic pro-

teins mediate evolutionary transitions of structure Proc NatlAcad Sci USA 20101077287ndash92

139 Yadid I Kirshenbaum N Sharon M et al Metamorphic pro-teins mediate evolutionary transitions of structure Proc NatlAcad Sci USA 20101077287ndash92

140 Pickrell J Marioni J Pai A et al Understanding mechanismsunderlying human gene expression variation with RNAsequencing Nature 2010464768ndash72

141 Matlin AJ Clark F Smith CW Understanding alternative splic-ing towards a cellular code Nat Rev Mol Cell Biol 20056386ndash98

142 Rickman DS Pflueger D Moss B et al SLC45A3-ELK4 is anovel and frequent erythroblast transformation-specific fu-sion transcript in prostate cancer Cancer Res 2009692734ndash8

143 Flomen R Makoff A Increased RNA editing in EAAT2 pre-mRNA from amyotrophic lateral sclerosis patients involve-ment of a cryptic polyadenylation site Neurosci Lett2011497139ndash43

144 Asmann YW Hossain A Necela BM et al A novel bioinfor-matics pipeline for identification and characterization of fu-sion transcripts in breast cancer and normal cell linesNucleic Acids Res 201139e100

145 Bundschuh R Computational prediction of RNA editingsites Bioinformatics 2004203214ndash20

146 Eisenberg E Li JB Levanon EY Sequence based identificationof RNA editing sites RNA Biol 20107248ndash52

147 Levanon EY Eisenberg E Algorithmic approaches for identi-fication of RNA editing sites Brief Funct Genomics Proteomic2006543ndash5

148 McCarthy DJ Humburg P Kanapin A et al Choice of tran-scripts and software has a large effect on variant annota-tion Genome Med 2014626

149 Tang X Baheti S Shameer K et al The eSNV-detect a com-putational system to identify expressed single nucleotidevariants from transcriptome sequencing data Nucleic AcidsRes 201442e172

150 Chorev M Carmel L The function of introns Front Genet2012355

151 Chen R Davydov EV Sirota M et al Non-synonymous andsynonymous coding SNPs show similar likelihood and effectsize of human disease association PLoS One 20105e13574

152 Rose AB Intron-mediated regulation of gene expressionCurr Top Microbiol Immunol 2008326277ndash90

153 Birnbaum RY Clowney EJ Agamy O et al Coding exonsfunction as tissue-specific enhancers of nearby genesGenome Res 2012221059ndash68

154 Nakaya HI Amaral PP Louro R et al Genome mapping andexpression analyses of human intronic noncoding RNAsreveal tissue-specific patterns and enrichment in genesrelated to regulation of transcription Genome Biol 20078R43

155 Rohs R Jin X West SM et al Origins of specificity in protein-DNA recognition Ann Rev Biochem 201079233ndash69

156 Karczewski KJ Dudley JT Kukurba KR et al Systematic func-tional regulatory assessment of disease-associated variantsProc Natl Acad Sci USA 20131109607ndash12

157 Khalil AM Rinn JL RNA-protein interactions in humanhealth and disease Semin Cell Dev Biol 201122359ndash65

158 Jones S van Heyningen P Berman HM et al Protein-DNAinteractions a structural analysis J Mol Biol 1999287877ndash96

159 Jones S Daley DT Luscombe NM et al Protein-RNAinteractions a structural analysis Nucleic Acids Res200129943ndash54

160 Barabasi AL Oltvai ZN Network biology understanding thecellrsquos functional organization Nat Rev Genet 20045101ndash13

161 Nooren IM Thornton JM Diversity of protein-protein inter-actions EMBO J 2003223486ndash92

162 Caspi R Altman T Dreher K et al The MetaCyc database ofmetabolic pathways and enzymes and the BioCyc collectionof pathwaygenome databases Nucleic Acids Res201240D742ndash53

163 Yamada T Bork P Evolution of biomolecular networks les-sons from metabolic and protein interactions Nat Rev MolCell Biol 200910791ndash803

164 Stark C Breitkreutz BJ Chatr-Aryamontri A et al TheBioGRID interaction database 2011 update Nucleic Acids Res201139D698ndash704

165 Szklarczyk D Franceschini A Kuhn M et al The STRINGdatabase in 2011 functional interaction networks of pro-teins globally integrated and scored Nucleic Acids Res201139D561ndash8

166 Bader GD Hogue CW An automated method for finding mo-lecular complexes in large protein interaction networksBMC Bioinformatics 200342

167 DallrsquoOlio GM Vahdati AR Bertranpetit J et alVCF2Networks applying Genotype Networks to SingleNucleotide Variants data Bioinformatics 201531438ndash9

168 Ding L Getz G Wheeler DA et al Somatic mutations affectkey pathways in lung adenocarcinoma Nature20084551069ndash75

169 Case M Matheson E Minto L et al Mutation of genes affect-ing the RAS pathway is common in childhood acutelymphoblastic leukemia Cancer Res 2008686803ndash9

170 Thorstensen L Lind GE Lovig T et al Genetic and epigeneticchanges of components affecting the WNT pathway in colo-rectal carcinomas stratified by microsatellite instabilityNeoplasia 2005799ndash108

Interpreting functional effects of coding variants | 21

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

171 Weng L Macciardi F Subramanian A et al SNP-based path-way enrichment analysis for genome-wide associationstudies BMC Bioinformatics 20111299

172 Wang K Li M Bucan M Pathway-based approaches for ana-lysis of genomewide association studies Am J Hum Genet2007811278ndash83

173 OrsquoDushlaine C Kenny E Heron EA et al The SNP ratio testpathway analysis of genome-wide association datasetsBioinformatics 2009252762ndash3

174 Tarca AL Draghici S Khatri P et al A novel signaling path-way impact analysis Bioinformatics 20092575ndash82

175 Ramanan VK Shen L Moore JH et al Pathway analysis ofgenomic data concepts methods and prospects for futuredevelopment Trends Genet 201228323ndash2

176 Li Y Agarwal P Rajagopalan D A global pathway crosstalknetwork Bioinformatics 2008241442ndash7

177 Rivals I Personnaz L Taing L et al Enrichment or depletionof a GO category within a class of genes which testBioinformatics 200723401ndash7

178 Rhee SY Wood V Dolinski K et al Use and misuse of thegene ontology annotations Nat Rev Genet 20089509ndash15

179 Chandonia JM Brenner SE The impact of structural gen-omics expectations and outcomes Science 2006311347ndash51

180 Johnson MS Srinivasan N Sowdhamini R et al Knowledge-based protein modeling Crit Rev Biochem Mol Biol 1994291ndash68

181 Sulkowska JI Morcos F Weigt M et al Genomics-aided struc-ture prediction Proc Natl Acad Sci USA 201210910340ndash5

182 Lee D Redfern O Orengo C Predicting protein function from se-quence and structure Nat Rev Mol Cell Biol 20078995ndash1005

183 Worth CL Gong S Blundell TL Structural and functionalconstraints in the evolution of protein families Nat Rev MolCell Biol 200910709ndash20

184 Karczewski KJ Fernald GH Martin AR et al STORMSeq anopen-source user-friendly pipeline for processing personalgenomics data in the cloud PLoS One 20149e84860

185 Kalari KR Rossell D Necela BM et al Deep sequence ana-lysis of non-small cell lung cancer integrated analysis ofgene expression alternative splicing and single nucleotidevariations in lung adenocarcinomas with and without onco-genic KRAS mutations Front Oncol 2012212

186 Patil S Upadhayay A Arya D et al A high throughput exomesequencing approach to analyse events within a good re-sponder CML patient under imatinib at diagnosis and underremission Blood 20131225161

187 Kullo IJ Shameer K Jouni H et al The ATXN2-SH2B3 locus isassociated with peripheral arterial disease an electronicmedical record-based genome-wide association study FrontGenet 20145166

188 Shameer K Denny JC Ding K et al A genome- and phe-nome-wide association study to identify genetic variantsinfluencing platelet count and volume and their pleiotropiceffects Hum Genet 201413395ndash109

189 Sabarinathan R Wenzel A Novotny P et al Transcriptome-wide analysis of UTRs in non-small cell lung cancer revealscancer-related genes with SNV-induced changes on RNAsecondary structure and miRNA target sites PLoS One20149e82699

22 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

  • bbv084-TF1
Page 11: challenges in proteome-scale prediction, annotation and assessment

(Table 5) may potentially help investigate the evolutionary andfunctional roles of variants in unassigned regions

Impact of coding SNVs in low-complexityregions

Low-complexity regions (LCRs) in the protein universe refer to astretch of amino acids with low Shannon entropy (leucine-richdomains or poly-alanine tracts) Unlike linear motifs whichhave a specific function and sequence signature the individualfunctions of LCRs are poorly characterized LCRs do not adopt adefinite secondary structure but may exist as solvent-exposedamino acids in the coiled or disordered regions in proteins LCRsare observed in functionally diverse proteins and in both eu-karyotes and prokaryotes The predominant functions of LCRsinclude promoting mRNA stability and mediating a diverse set

of proteinndashprotein interactions [104] We scanned the referencehuman proteome (reviewed subset of 20 237 sequences) usingPFILT [105] and observed that 1432 (nfrac14 2899) protein se-quences have LCRs with a median length of 16 amino acids(Figure 4B) A recent study showed that a functional variantlocalized to an LCR in Nance-Horan Syndrome (NHS) gene wasassociated with clinical features of NHS including cardiacanomaly [106] In the absence of direct functional informationto interpret the impact of functional variants scanning proteinsequences with an LCR prediction tool like PFILT is highly rec-ommended to investigate the probable gain or loss of LCRsowing to the functional variations

Coding SNVs and intrinsically disorderedregions in proteins

Proteins are believed to be functional when the structure attainsits definite globular fold [107] Recent studies however suggestthat proteins may perform their functions even when not in afully folded state Such proteins or regions within proteins thatexist in a stable conformation without attaining a definite struc-tural conformation are generally referred to as intrinsically dis-ordered proteins (IDPs) or proteins with intrinsically disorderedregions (IDRs) or simply disordered proteins [108 109] Disprotis a database that catalogs a curated list of disordered proteinsthat includes 248 experimentally validated human proteinswith disordered regions (httpwwwdisprotorgactionsearchphpkeywordfrac14humanampcriterionfrac14organism) Prediction modelssuggest that 30ndash40 of human proteins are considered to beIDPs or have IDRs and approximately 25 of eukaryotic proteinsare predicted to be fully disordered [108 109]

Regions of the intrinsically disordered segments in proteinshave been found to be functionally important (Figure 5)Disordered regions mediate various functional roles includingprotein binding and proteinndashprotein interactions Previous stud-ies have shown that IDPs can attain a definite structure on bind-ing to their interacting partner and may thus exist in anintermediate stage of disordered (unfolded) to ordered (folded)stage [110 111] Irrespective of the structural plasticity recent

Figure 3 Proportional Venn diagram of genes with coding variants annotated as

polymorphism (Polymorphism nfrac1411 527) disease (Disease nfrac142105) and un-

classified (Unclassified nfrac141910) in Human polymorphisms and disease muta-

tions index Percentages of genes shared between the three groups are provided

Unassigned regions in human proteome

Percentage of unassigned regions in human proteins

Fre

qu

en

cy in

hu

ma

n p

rote

om

e (

Sw

issp

rot

re

vie

we

d s

eq

ue

nce

s)

0 20 40 60 80 100

01

00

02

00

03

00

04

00

0

Low-complexity regions in human proteome

Percentage of LCRs in human proteins

Fre

quency in h

um

an p

rote

om

e (

Sw

isspro

t r

evie

wed s

equences)

0 20 40 60 80 100

0500

1000

1500

2000

2500

3000(a) (b)

Figure 4 Histograms depicting the distribution (in percentage) of (A) unassigned regions and (B) LCRs in the human proteome

Interpreting functional effects of coding variants | 11

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

studies have shown the presence of evolutionarily conservedfunctional motifs within IDPs [112 113]

Recently the influence of disease mutations in the dis-ordered regions was extensively surveyed and the crucialroles of disorder-promoting amino acids in imparting vari-ations in the protein structures were proposed [114 115]Disorder-promoting amino acids are a potential cause ofvariations in protein structures and structural modelsDisorder-to-order transitions are enriched among diseasemutations compared with neutral polymorphisms [116]Currently protein modeling or variant effect prediction al-gorithms do not take the intrinsic disorder characteristicsof a protein into consideration Integrating tools such asDisopred [117] for disorder region prediction into the vari-ant annotation pipelines will improve our understanding ofthe impact of coding SNVs in the disordered regions of aprotein

Influence of coding SNVs on proteinmisfolding domain swapping aggregationmacromolecular crowding and degradation

Diseases have multiple environmental as well as geneticcauses and structural biology is an active area of research tounderstand how changes in individual protein structures orprotein complexes may play a key role in disease manifest-ations Understanding the structural bases of human diseases

would help to identify better targets and design better ligandsfor more effective therapeutic interventions

Protein misfolding and aggregation

Protein folding pathways play a crucial role in mediating cellu-lar homeostasis Defective folding of a protein product is themechanistic basis of several disease phenotypes likeAlzheimerrsquos disease and Parkinsonrsquos disease [118] When non-synonymous mutations lead to the production of misfoldedproteins with aberrant function several gate-keeping pathwaysact on such misfolded proteins to clear them from the cellularenvironment [119] Misfolded proteins targeted for ubiquityla-tion may also lead to protein aggregation Protein aggregation[120ndash123] is defined as a molecular phenomenon where a pro-tein is not cleared from the cellular environment by the normalpathways (Figure 6A) This leads to the aggregation of proteinsin the cellular environment which in turn leads to cellular tox-icity and is considered to be the mechanistic basis of varioushuman diseases such as prion diseases [123 124]

Domain swapping

3D domain swapping (Figure 6B) is a phenomenon observed in asubset of proteins where intermolecular interactions arereplaced by intramolecular interactions [125 126] Domainswapping is also recognized as a mechanism for forming pro-tein aggregates via open-ended mode Domain swapping is

Figure 5 Examples of IDRs in human proteins (A) Breast cancer type 1 susceptibility protein (BRCA1) (B) Cellular tumor antigen p53 (Oncoprotein p53) (C) T-cell sur-

face glycoprotein CD4 (Disprot) (D) Secondary structure information (UniProtKB) and (E) mutation histogram (COSMIC) of BRCA1 is provided to illustrate mutations in

the unstructured regions

12 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

observed in a variety of therapeutically important proteins andis considered to be the mechanism mediating deposition dis-eases such as neurodegenerative diseases and Alzheimerrsquos dis-ease that are caused by conformational perturbations [127 128]Missense mutations in the human phenylalanine hydroxylasehave been shown to infleunce aggregation and degradationproperties [129] Functional mutations are also considered to bea causative factor for proteins to adopt swapped conformationfrom monomers to higher oligomers [130ndash132] A recent study[128] that surveyed a subset of human proteins involved in 3Ddomain swapping suggested that swapping is not only confinedto conformational diseases it is also associated with severalkey biological pathways and also plays a role in mediating di-verse diseases in humans Another study [133] that investigatedthe structural properties of proteins involved in 3D domainswapping suggested that 10 of protein folds and 5 of proteinfamilies include domain-swapped structures

Macromolecular crowding

Proteins interact with a variety of small molecules and othermacromolecules to perform a specific functional task The qua-ternary assembly of a protein is influenced by the importantyet understudied phenomenon of macromolecular crowding

(117 118) Protein function is driven by interactions mediated byinter- and intra-chain amino acids and mutations in the sur-face amino acids may affect macromolecular homeostasisMutagenesis studies have shown the impact of functional mu-tations on proteinndashprotein interfaces (119) oligomerization (120)and stability (121) Mutations that influence the intra-chaininteractions may influence the crowding effect in the cellularenvironment Probing the impact of functional mutations onsuch effects using theoretical models and molecular dynamicsimulation studies is likely to enhance the understanding of therelationship between coding SNVs and macromolecular crowd-ing A list of selected tools and databases available for the pre-diction of misfolding protein aggregation and domainswapping is summarized in Table 5

Protein degradation

Targeted biochemical studies have revealed that the proteindegradation pathway plays an important role in the clearanceof the mutated proteins to reduce cellular toxicity For examplemutant Cu Zn-superoxide dismutase associated with amyo-trophic lateral sclerosis was cleared by macroautophagy path-way that includes proteasomal cleavage phenylalaninehydroxylase (109) and NAD(P)H quinone oxidoreductase 1 (156)

(a) (b)

Figure 6 Schematic representation of the impact of functional mutation on protein misfolding folding aggregation domain swapping macromolecular crowding and

protein degradation pathways Structure of misfolded rat CD2 structure (PDB ID 1A6P) and normal CD2 (PDB ID 1HNG) is used for representing misfolded and folded

structures Functional variant is represented using red asterisk

Interpreting functional effects of coding variants | 13

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

were also cleared by similar mechanisms Functional variantsplay a crucial role in aggregation and protein degration path-ways [134 135] Computational approaches that can predict thedegradation properties of mutated proteins using sequence orstructural information will be useful for rapid characterizationof functionally active proteins Integrated models that combinefolding misfolding aggregation 3D domain swapping macro-molecular crowding and degradation pathways (Figure 6) in asystems biology approach are likely to provide additional in-sights into the regulation of these important mechanisms andthe role of coding SNVs in mediating such mechanisms

Coding SNVs and metamorphic proteins

The molecular paradigm of sequence-structure-function sug-gests that a diverse set of sequences could attain a similarstructural fold that may lead to a functional convergence Whilethis is generally true for the protein universe occasionally devi-ations are observed Metamorphic proteins refer to a relativelynew class of proteins in which a given sequence has beenshown to attain different folded conformations under nativeconditions while performing distinct functions [136ndash138]Metamorphic proteins such as human chemokine lymphotactinhave been shown to influence evolutionary transitions of struc-ture and coding SNVs which could play an important role inswitching between two or more folds and functions for a singlesequence [139] Computational approaches to catalog and pre-dict the metamorphic properties and the impact of coding SNVson metamorphic proteins would help us understand how muta-tions drive functional plasticity at the proteome level

Impact of coding SNVs on the transcriptomicdiversity

Dynamic features of the human proteome are driven by tran-script diversity and the average number of characterized tran-scripts per gene is rapidly expanding The recent adoption ofRNA-Seq as a tool for expression profiling has led to the charac-terization of a large number of novel transcripts includingfusion transcripts [140] Several novel tissue-specific orcell-type-specific transcripts were reported from RNA-Seqexperiments The cellular compartment-based functional rolesof such transcripts are determined by alternative splicingevents [141] RNA editing is a phenomenon where a pre-mRNAmolecule is altered through a chemical change in the basemakeup thereby adding to the diversity of the transcriptomeRNA editing events occur via two distinct mechanisms of sub-stitution editing and insertiondeletion editing leading to func-tional diversity Fusion transcripts [142] and RNA editing eventsare also associated with various diseases including prostatecancer and amyotrophic lateral sclerosis [143] Computationalapproaches are available for the identification of fusion tran-scripts [144] and RNA editing events [145ndash147] from sequencingdata Understanding the precise roles and the impact of variantson such a diverse set of transcripts using computationalapproaches is a challenging task Ascribing the functional roleand assessing the impact of coding SNVs for mediating noveland fusion transcripts using bioinformatics approaches is anemerging problem and being addressed in recent studies [148]Recently we designed an integrated method to simultaneousanalyses of genome and transcriptome from RNA-seq data TheeSNV-Detect method can precisely capture genetic variation(genotypes) from RNA-seq data and helps to design cost-

effective and sustainable experimental strategies [149] Analyticand interpretation strategies that rely on multiple data-typeswould provide greater confidence for variant calling annotationand interpretation and thus actionability for clinicalinterventions

Functional impact of synonymous variations

Synonymous mutations are defined as mutations that result ina variation at the DNA level that code for the same amino acidin the protein level owing to codon degeneracy Synonymousvariants may either be coding or noncoding and the functionalroles of such variants are emerging A recent study [150] sug-gests that introns are involved in functional mechanisms andintrons with positional conservation across eukaryotic lineageare classified as functional introns A comparative analysis ofsynonymous and non-synonymous variants associated withcomplex diseases has shown similar likelihood and effect sizewith disease association [151] Synonymous mutations mayalso influence the introns regulating gene expression [152] A re-cent report also suggests that the coding exons function as tis-sue-specific enhancers and synonymous variations in suchenhancer sites may influence the expression level of certaingenes [153] Synonymous variants could influence the expres-sion of intronic noncoding RNAs [154] perturb the transientproteinndashDNA [155 156] and proteinndashRNA interactions in the cellSuch perturbations can lead to diseases such as cancer neuro-logical disorders and cardiac disease [157] Developing bioinfor-matics approaches to explore the putative impact ofsynonymous mutations on different layers of the coding andnoncoding genome and their relationships will be an importantaspect in the detailed interpretation of the genomic variantsThe detailed structural exploration of proteinndashDNA [158] andproteinndashRNA interactions [159] will help to precisely map themechanistic bases of the loss or gain of interactions owing tosynonymous variations

Impact of coding SNVs on the interactomeof a protein

The individual interactome of proteins varies to a great extent[160] Proteins often interact with nucleotides (proteinndashDNA orproteinndashRNA interactions) proteins (protein complexes obligateor non-obligate) [161] and proteinndashprotein interactions (transientor permanent) [107]) and small molecules (molecular reactionsenzymatic reactions metabolic pathways) [162 163] CodingSNVs play an important role in mediating such interactions Aninteractome of a protein can be defined using information gath-ered from high-throughput experiments that systematicallyidentify interactants of proteins Public protein-protein databaseslike BioGRID [164] and STRING [165] provide large data sets forderiving a protein interactome Network-level investigations tounderstand the category of interactors that are perturbed owingto coding SNVs will help delineate the impact of such mutationson the protein interactomes Using network measures (centralitydegree stress betweenness closeness cliques radiality transi-tivity reciprocity assortativity structural equivalence networkheterogeneity network density clustering coefficients neighbor-hood connectivity shared neighbors network topology etc) asquantitative parameters in variant annotation pipelines will helpthe researchers gain better insights into the impact of codingSNVs on the protein interactome Biologically relevant networkproperties such as network modularity [166] network fragility

14 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

[166] and lethality [166] can also be derived from an interactomeIncorporating a comparative network analysis framework thatcompares wild-type and variant interactomes will help quantifythe impact of coding SNVs on a network scale Methods such asVCF2Networks employ genotype networks (ie all genotypesassociated with a single phenotype) to understand the relation-ships between the genotype space and clinical or biological phe-notypes [167] Table 7 summarizes the available tools toinvestigate gene lists and to derive global functional trends andnetwork properties

Pathway-level impact of coding SNVs

Multiple functional genomics studies have investigated the per-turbations of pathways as a result of mutations in diseasessuch as adenocarcinoma (MAPK signaling p53 signaling Wntsignaling cell cycle and mammalian target of rapamycin path-ways) [168] childhood acute lymphoblastic leukemia (RAS path-way) [169] colorectal carcinoma (WNT signaling pathway) [170]etc Several SNP-centric approaches for pathway-level inferencehave been designed for the interpretation of SNPs identifiedfrom GWA studies or differentially expressed genes from gene-expression studies [171ndash175] Coding SNVs may also influencethe cross talk between various signaling pathways Currentvariant interpretation algorithms strive to identify pathways

associated with genes harboring coding SNVs yet a detailedunderstanding of the impact of variants on biological pathwaysand pathway cross talks [176] is often lacking Incorporatinganalytical routines to quantify the effect of functional muta-tions on pathways and pathway cross talk will be useful in in-terpreting functional variants from a biological perspective

A classification schema for annotatingcoding SNVs

Identifying the entire spectrum of molecular perturbationsowing to coding variations at the level of sequence structureinteraction and function of proteins is considered the basis ofvariant interpretation Multiple automated prediction tools thatcan assess the functional effect of mutations are currentlyavailable [See Table 6] Several variant annotation tools focuson the task of predicting the type or effect of mutations provideextended annotations from precomputed databases or lookuptables and map a coding variant to its corresponding gene pro-tein functional domain or signaling pathway Theseapproaches have several limitations because additional layersof protein-centric information that can be derived from the pre-diction or computation of sequence-based features with codingvariants are lacking To deal with this challenge and to designbroad-spectrum tools for deep variant interpretation we

Table 7 Software libraries and tools for biological network analysis

Name Description URL

Bio4J Bio4j is a bioinformatics graph-based database httpwwwbio4jcomBioConductor (Graphs

and Networks view)Collection of BioConductor modules for biological net-

work analysis and visualizationhttpwwwbioconductororgpackagesrelease

BiocViewshtml___GraphsAndNetworksCytoscape Open source platform for complex network analysis and

visualization with a large collection of plug-ins forbiological network analysis

httpwwwcytoscapeorg

DAVID Integrated functional annotation tool httpdavidabccncifcrfgovhomejspFunDO Functional Disease Ontology server httpdjangonubicnorthwesternedufundogene2pathway R package for prediction of KEGG pathway membership

for individual genes based on InterPro domainsignatures

httpwwwbioconductororgpackagesreleasebiochtmlgene2pathwayhtml

GeneAnswers R package for biological or medical interpretation of thegiven one or more groups of genes by means of statis-tical test

httpwwwbioconductororgpackagesreleasebiochtmlGeneAnswershtml

GO Tools Tools for analysis of GO T term enrichment statisticalanalysis semantic similarity and functional similar-ity using GO terms derived from gene lists

httpwwwgeneontologyorgGOtoolsshtml

Gephi Open-source graph visualization and analysis software httpwwwgephiorgGremlin Graph-based programming language httpsgithubcomtinkerpopgremlinwikiiGraph Network analysis and visualization library in C Also

available R package and a Python extensionhttpigraphsourceforgenet

KEGGgraph R package for analysis of KEGG pathways httpwwwbioconductororgpackages210biochtmlKEGGgraphhtml

LEDA A broad-spectrum Cthornthorn class library for efficient datatypes and algorithms including large-scale networkanalysis

httpwwwalgorithmic-solutionscomledaaboutindexhtm

Neat Web-based network analysis tools httpgephiorgOntology Analysis

plugins for Cytoscapehttpchiantiucsdeducyto_webplugins Plugins for functional enrichment analysis using

network dataPANTHER Classification of genes and proteins httppantherdborgReactome Curated knowledgebase of molecular events and

pathwayshttpwwwreactomeorg

TargetMine Integrates different types of biological data and enableflexible queries export results and analyze lists ofdata

httptargetminemizuguchilaborg

Interpreting functional effects of coding variants | 15

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

recommend a three-level annotation schema for the interpret-ation of coding variants (Figure 7) Primary level annotation pro-vides an overview of the types of mutations and associationannotations Secondary level annotation enables the systematicinvestigation of multiple variants in a single gene Tertiary-levelannotations help to find the global characteristics of genes har-boring coding variants and various properties in a network-scale

Level 1 Primary annotation of coding SNVs

Tools that use position information from VCF files to derive thetype of mutation (synonymous or non-synonymous) the loca-tion of the corresponding gene and its mapping onto several an-notation resources using positional data may be viewed aslsquoprimary annotationsrsquo See Table 2 for a list of tools that useminimum input data to gather a diverse set of annotationsusing database lookup tables and identifier mapping

Level 2 Secondary annotation of coding SNVs- Impactof multiple variants in a single gene

Sequence and structural explorations of coding variants arereferred to as secondary annotations VCF files can have mul-tiple variants of the same gene associated with a given pheno-type or disease A systematic investigation of multiple variantsin a single gene and an assessment of the relative impact of cod-ing variants could help in classifying multiple variants in a geneand help to delineate the variants as pathogenic moderatelypathogenic or VUS Comparative analysis of wild-type and vari-ant sequences can be performed and quantified using the totalnumber of gains or losses in the sequence features Based on the

availability of the structural data a given variant can be modeledin a protein structure using in-silico mutagenesis experimentsIn the absence of experimental protein structural data a proteinstructure model can be obtained from the homology model data-base or a new model can be built using homology modelingapproaches Once the wild type and mutant type structures areobtained the impact of variation can be rapidly computed usingstructural feature analyses or using computationally expensivemolecular dynamics simulations Table 5 summarizes key se-quence and structure feature prediction tools

Level 3 Tertiary annotation of coding SNVs- Impact ofcoding SNVs on multiple genes pathways andinteractome

The global impact of multiple coding variants on a genome orexome can be assessed using a combination of both knowledge-based enrichment or depletion analysis and network or interac-tome analyses Functional profiling using GO terms have beenused as an effective approach for characterizing collective func-tional characteristics of a perturbed set of genes [177 178]Gene-list-based analyses can provide biologically relevant infor-mation about the variants Enrichment analysis is not just con-fined to a priori defined gene sets or GO annotationsenrichment can be performed using several types of annota-tions and could provide insights into the probable functional as-sociations of genes with coding variants Enrichment analysisusing protein annotations would help to identify functionallysignificant protein domains protein classes families (mem-brane proteins or kinases etc) protein superfamilies (angioten-sin receptors or G-protein coupled receptors) or protein folds(Rossman fold beta-propeller) associated with proteins

Figure 7 Three-level schema for annotation of functional variants

16 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

harboring coding variants Furthermore it is possible to identifyenriched molecular events mediated by genes enriched TFBS inthe upstream of genes and biological pathways mediated by thegenes using annotations from Reactome UCSC and KEGG repo-sitories Disease ontology can be used to find known genendashdisease associations Following the knowledge-based analysis anetwork-level analysis of genes harboring coding SNVs can beperformed using publicly available network analysis tools suchas Cytoscape and Gephi

Future directions

The rapidly increasing availability of sequence structure func-tional and interaction information offers an attractive means toobtain a detailed characterization of coding variants Howeverthe use of large data sets to systematically explore the functionalvariants is currently limited owing to the gaps in the availablesequencendashstructurendashfunction interaction data Sequence dataare available for the entire human proteome but the availabilityof protein structural data experimentally verified functional as-sociations and biomolecular interaction data is limited Theavailability of structural data of homologous protein structuresis a prerequisite for structural investigations of functional vari-ants and its impact on the structural environment Availabilityof high-quality interaction data will also help analyze the cellu-lar networks modulated by proteins harboring coding variantsThe expansion of structural data sets using structural genomics[179] homology modeling [180] and the application of computa-tional approaches such as genomics-aided structure prediction[181] is likely to lower the increasing chasm between sequencestructure function and interaction data and may help to ascer-tain the impact of variants at the biomolecular level [163 182183] Improving the annotation databases using high-qualitycurated data and integration with cloud-based or stand-alonepipelines [184ndash189] can also help facilitate such efforts

The current version of the Bioinformatics Link Directory en-lists gt900 bioinformatics tools that are capable of processing dif-ferent types of protein data (sequence structure interactionquantification and annotation) Unifying multiple resources andenabling programmatic access via application program interfacesand web services for rapid integration will significantly enhancethe efficacy of variant annotation pipelines The development ofnovel statistical methods that can quantify various residue-specific properties would enable a comparative analysis of wild-type and variant sequences and can thereby help to facilitate theidentification and prioritization of functional mutations The in-clusion of a diverse set of sequence structure and network fea-tures into a graphically depicted database and an assessment ofthe impact of various data types (sequence structure functionand interaction) using probabilistic or machine learning modelswould also help automate variant annotation interpretation

From a sequence perspective understanding the precisefunctional role of LCRs unassigned regions and disordered re-gions as well as the relationship between these features and cod-ing SNVs is a key step in variant annotation and interpretationIdentifying deviations in protein structural space that are likelyto lead to misfolding domain swapping aggregation degrad-ation or deviations that are metamorphic in nature owing to theimpact of coding SNVs is an emerging challenge Several algo-rithms are available to predict sequence or structural domainsassign distantly related domains to unassigned regions func-tional motifs structural motifs and structural features proteindisorder and aggregation propensities However algorithms andtools to predict 3D domain swapping or to analyze metamorphic

properties degradation pathways and the network-level impactof coding SNVs are less abundant This may be attributed to thefact that these concepts are continually evolving in the proteinuniverse and concerted efforts are needed to understand the im-pact of coding variants of such unique features from both experi-mental and computational biologists alike

This review summarizes the major bioinformatics chal-lenges involved in gaining a deeper understanding of codingSNVs Coding SNVs may impart a molecular or disease pheno-type by interacting with noncoding regulatory parts of the gen-ome and the structural variants may provide an importantcontribution to this phenomenon Analyzing functional andregulatory variants in a single analytical framework may furtherenhance the interpretation of the genomic variants

Conclusions

A rapid decline in sequencing costs using NGS technologies hasled to an exponential increase in the frequency of the sequenc-ing projects over the past decade A large number of personalgenomes and exomes clinical samples and cancer genomes arebeing sequenced as part of large-scale collaborative projectsThe identification of a plethora of sequence variants associatedwith diverse molecular and disease phenotypes Scalable com-putational approaches that integrate annotations using se-quence structure functional and interaction data are necessaryfor the rapid interpretation of coding SNVs We have discussedthe bioinformatics resources available for the prediction anno-tation and visualization of coding SNVs summarized the majorbioinformatics challenges into 10 different themes for a deeperinterpretation of coding SNVs and recommended a three-levelschema to assess the phenotypic impact of functional vari-ations on individual protein sequences structures differentfunctional categories biological pathways and interactomesWe envisage that addressing the key challenges discussed inthis review and adopting a comprehensive annotation schemafor variant annotation could improve genomic reports that aregenerated as part of genomic medicine investigations and ex-perimental studies to better understand the variations impli-cated in rare common and complex disease manifestations

Key Points

bull Interpretations of variants identified from next-gener-ation sequencing pose several challenges

bull Systems-level experimental investigation of the func-tional variants is expensive and time-consumingefficient computational techniques are required toidentify the impact of functional variants

bull We recommended an integrated approach that com-bines multiple data types and tools for the predictionannotation and visualization of functional variantsand we have proposed a systematic approach for func-tional variant annotation and interpretation

bull Significant challenges that need to be negotiated dur-ing the interpretation of coding single nucleotide vari-ants are presented with the help of various examples

bull A three-level annotation approach that combines theinformation at the level of an individual variant mul-tiple variants in a single protein and global trends ofmultiple genes harboring multiple variants is proposedfor an effective interpretation of coding single nucleo-tide variants

Interpreting functional effects of coding variants | 17

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

Acknowledgements

KS and JTD would like to thank Icahn Institute forGenomics and Multiscale Biology (httpicahnmssmedudepartments-and-institutesgenomics) Mount Sinai HealthSystem for infrastructural support KS would like to ac-knowledge Michael J Zimmermann and Jean-Pierre Kocher(Mayo Clinic Rochester) for useful discussions The MayoClinic Bioinformatics Core provided infrastructure and ana-lytical support to KRK RS would like to thank NationalCentre for Biological Sciences (TIFR) for infrastructuralsupport

Funding

KS and JTD was supported by grant R01-DK098242-03from the National Institute of Diabetes and Digestive andKidney Diseases KRK is funded in part by the Mayo ClinicCenter for Individualized Medicine the PharmacogenomicsResearch Network (PGRN) and the Mayo Clinic BreastSpecialized Program of Research Excellence (SPORE)

References1 Lander ES Initial impact of the sequencing of the human

genome Nature 2011470187ndash972 Mardis ER A decadersquos perspective on DNA sequencing tech-

nology Nature 2011470198ndash2033 Green ED Guyer MS Charting a course for genomic medi-

cine from base pairs to bedside Nature 2011470204ndash134 Maxmen A Exome sequencing deciphers rare diseases Cell

2011144635ndash75 Wheeler DA Srinivasan M Egholm M et al The complete

genome of an individual by massively parallel DNAsequencing Nature 2008452872ndash6

6 Weatheritt RJ Davey NE Gibson TJ Linear motifs conferfunctional diversity onto splice variants Nucleic Acids Res201240123ndash31

7 Goldstein DB Common genetic variation and human traitsN Engl J Med 20093601696ndash8

8 Manolio TA Collins FS Cox NJ et al Finding the missing her-itability of complex diseases Nature 2009461747ndash53

9 Genome-wide association study of 14 000 cases of sevencommon diseases and 3 000 shared controls Nature2007447661ndash78

10 Kullo IJ Ding K Shameer K et al Complement receptor 1gene variants are associated with erythrocyte sedimenta-tion rate Am J Hum Genet 201189131ndash8

11 Welter D MacArthur J Morales J et al The NHGRI GWASCatalog a curated resource of SNP-trait associations NucleicAcids Res 201442D1001ndash6

12 Leslie R OrsquoDonnell CJ Johnson AD GRASP analysis of geno-type-phenotype results from 1390 genome-wide associationstudies and corresponding open access databaseBioinformatics 201430i185ndash94

13 Nelson MR Wegmann D Ehm MG et al An abundance ofrare functional variants in 202 drug target genes sequencedin 14 002 people Science 2012337100ndash4

14 MacArthur DG Balasubramanian S Frankish A et al A sys-tematic survey of loss-of-function variants in human pro-tein-coding genes Science 2012335823ndash8

15 Raychaudhuri S Mapping rare and common causal allelesfor complex human diseases Cell 201114757ndash69

16 Stankiewicz P Lupski JR Structural variation in the humangenome and its role in disease Annu Rev Med201061437ndash455

17 McCarroll SA Altshuler DM Copy-number variation and as-sociation studies of human disease Nat Genet200739S37ndash42

18 Zhang F Gu W Hurles ME et al Copy number variation inhuman health disease and evolution Annu Rev GenomicsHum Genet 200910451ndash81

19 Medvedev P Stanciu M Brudno M Computational methodsfor discovering structural variation with next-generationsequencing Nat Methods 20096S13ndash20

20 Sherry ST Ward MH Kholodov M et al dbSNP the NCBI data-base of genetic variation Nucleic Acids Res 200129308ndash11

21 Rios D McLaren WM Chen Y et al A database and API forvariation dense genotyping and resequencing data BMCBioinformatics 201011238

22 Iafrate AJ Feuk L Rivera MN et al Detection of large-scale variation in the human genome Nat Genet200436949ndash51

23 Ashley EA Butte AJ Wheeler MT et al Clinical assess-ment incorporating a personal genome Lancet20103751525ndash35

24 Chen R Mias GI Li-Pook-Than J et al Personal omics profil-ing reveals dynamic molecular and medical phenotypesCell 20121481293ndash307

25 Lupski JR Reid JG Gonzaga-Jauregui C et al Whole-genomesequencing in a patient with Charcot-Marie-Tooth neur-opathy N Engl J Med 20103621181ndash91

26 Li Y Vinckenbosch N Tian G et al Resequencing of200 human exomes identifies an excess of low-fre-quency non-synonymous coding variants Nat Genet201042969ndash72

27 Clarke L Zheng-Bradley X Smith R et al The 1000 GenomesProject data management and community access NatMethods 20129459ndash462

28 Schunkert H Konig IR Kathiresan S et al Large-scale associ-ation analysis identifies 13 new susceptibility loci for coron-ary artery disease Nat Genet 201143333ndash8

29 Helgadottir A Thorleifsson G Manolescu A et al A commonvariant on chromosome 9p21 affects the risk of myocardialinfarction Science 20073161491ndash3

30 Visel A Zhu Y May D et al Targeted deletion of the 9p21non-coding coronary artery disease risk interval in miceNature 2010464409ndash12

31 Soranzo N Rendon A Gieger C et al A novel variant onchromosome 7q223 associated with mean platelet volumecounts and function Blood 20091133831ndash7

32 Paul DS Nisbet JP Yang TP et al Maps of open chromatinguide the functional follow-up of genome-wide associationsignals application to hematological traits PLoS Genet20117e1002139

33 Hull J Campino S Rowlands K et al Identification of com-mon genetic variation that modulates alternative splicingPLoS Genet 20073e99

34 Albert FW Kruglyak L The role of regulatory vari-ation in complex traits and disease Nat Rev Genet201516197ndash212

35 Battle A Khan Z Wang SH et al Genomic variation Impactof regulatory variation from RNA to protein Science2015347664ndash7

36 den Dunnen JT Antonarakis SE Nomenclature for the de-scription of human sequence variations Hum Genet2001109121ndash4

18 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

37 Eilbeck K Lewis SE Mungall CJ et al The sequence ontologya tool for the unification of genome annotations Genome Biol20056R44

38 Landrum MJ Lee JM Riley GR et al ClinVar public archive ofrelationships among sequence variation and human pheno-type Nucleic Acids Res 201442D980ndash5

39 Kircher M Witten DM Jain P et al A general framework forestimating the relative pathogenicity of human genetic vari-ants Nat Genet 201446310ndash15

40 Kumar P Henikoff S Ng PC Predicting the effects of codingnon-synonymous variants on protein function using theSIFT algorithm Nat Protoc 200941073ndash81

41 Adzhubei I Schmidt S Peshkin L et al A method and serverfor predicting damaging missense mutations Nat Meth20107248ndash9

42 Gonzalez-Perez A Lopez-Bigas N Improving the assessmentof the outcome of nonsynonymous SNVs with a consensusdeleteriousness score Condel Am J Hum Genet 201188440ndash9

43 Grantham R Amino acid difference formula to help explainprotein evolution Science 1974185862ndash4

44 Cooper GM Stone EA Asimenos G et al Distribution and in-tensity of constraint in mammalian genomic sequenceGenome Res 200515901ndash13

45 Pollard KS Hubisz MJ Rosenbloom KR et al Detection ofnonneutral substitution rates on mammalian phylogeniesGenome Res 201020110ndash21

46 Siepel A Bejerano G Pedersen JS et al Evolutionarily con-served elements in vertebrate insect worm and yeast gen-omes Genome Res 2005151034ndash50

47 Dayhoff MO Eck RV Park CM Model of evolutionary changein proteins In MO Dayhoff (ed) Atlas of Protein Sequence andStructure Washington DC National Biomedical ResearchFoundation 1972 89ndash99

48 Henikoff S Henikoff JG Amino acid substitution matricesfrom protein blocks Proc Natl Acad Sci USA 19928910915ndash19

49 Altschul SF Gish W Miller W et al Basic local alignmentsearch tool J Mol Biol 1990215403ndash10

50 Altschul SF Madden TL Schaffer AA et al Gapped BLASTand PSI-BLAST a new generation of protein database searchprograms Nucleic Acids Res 1997253389ndash402

51 Sunyaev S Ramensky V Koch I et al Prediction of deleteri-ous human alleles Hum Mol Genet 200110591ndash7

52 Yang Z Bielawski JP Statistical methods for detecting mo-lecular adaptation Trends Ecol Evol 200015496ndash503

53 Ng PC Henikoff S Predicting deleterious amino acid substi-tutions Genome Res 200111863ndash74

54 Wang Z Moult J SNPs protein structure and disease HumMutat 200117263ndash70

55 Thomas PD Kejariwal A Coding single-nucleotide poly-morphisms associated with complex vs Mendelian diseaseevolutionary evidence for differences in molecular effectsProc Natl Acad Sci USA 200410115398ndash403

56 Stone EA Sidow A Physicochemical constraint violation bymissense substitutions mediates impairment of proteinfunction and disease severity Genome Res 200515978ndash86

57 Miller MP Kumar S Understanding human disease muta-tions through the use of interspecific genetic variation HumMol Genet 2001102319ndash28

58 Johnson AD Single-nucleotide polymorphism bioinfor-matics a comprehensive review of resources Circ CardiovascGenet 20092530ndash6

59 Marjoram P Tavare S Modern computational approachesfor analysing molecular genetic variation data Nat Rev Genet20067759ndash70

60 Excoffier L Heckel G Computer programs for populationgenetics data analysis a survival guide Nat Rev Genet20067745ndash58

61 Peterson TA Nehrt NL Park D et al Incorporating molecularand functional context into the analysis and prioritizationof human variants associated with cancer J Am Med InformAssoc 201219275ndash83

62 Kann MG Advances in translational bioinformatics compu-tational approaches for the hunting of disease genes BriefBioinform 20101196ndash110

63 Cooper GM Shendure J Needles in stacks of needles findingdisease-causal variants in a wealth of genomic data Nat RevGenet 201112628ndash40

64 Stitziel NO Kiezun A Sunyaev S Computational and statis-tical approaches to analyzing variants identified by exomesequencing Genome Biol 201112227

65 Alexander RP Fang G Rozowsky J et al Annotating non-coding regions of the genome Nat Rev Genet 201011559ndash71

66 Mort M Evani US Krishnan VG et al In silico func-tional profiling of human disease-associated and poly-morphic amino acid substitutions Hum Mutat 201031335ndash46

67 Hamosh A Scott AF Amberger JS et al Online MendelianInheritance in Man (OMIM) a knowledgebase of humangenes and genetic disorders Nucleic Acids Res200533D514ndash17

68 Stenson PD Mort M Ball EV et al The human gene mutationdatabase 2008 update Genome Med 2009113

69 Forbes SA Bindal N Bamford S et al COSMIC mining com-plete cancer genomes in the Catalogue of SomaticMutations in Cancer Nucleic Acids Res 201139D945ndash50

70 Chen YA Tripathi LP Mizuguchi K TargetMine an inte-grated data warehouse for candidate gene prioritisation andtarget discovery PLoS One 20116e17844

71 Wang K Li M Hakonarson H ANNOVAR functional annota-tion of genetic variants from high-throughput sequencingdata Nucleic Acids Res 201038e164

72 Danecek P Auton A Abecasis G et al The variant call formatand VCFtools Bioinformatics 2011272156ndash8

73 Macintyre G Jimeno Yepes A Ong CS et al Associating dis-ease-related genetic variants in intergenic regions to thegenes they impact Peer J 20142e639

74 Javed A Agrawal S Ng PC Phen-Gen combining phenotypeand genotype to analyze rare disorders Nat Methods201411935ndash7

75 Dai HJ Wu JC Tsai RT et al T-HOD a literature-based candi-date gene database for hypertension obesity and diabetesDatabase (Oxford) 20132013bas061

76 Cheng D Knox C Young N et al PolySearch a web-basedtext mining system for extracting relationships betweenhuman diseases genes mutations drugs and metabolitesNucleic Acids Res 200836W399ndash405

77 Do R Kathiresan S Abecasis GR Exome sequencing andcomplex disease practical aspects of rare variant associ-ation studies Hum Mol Genet 201221R1ndash9

78 Wagner MJ Rare-variant genome-wide association studiesa new frontier in genetic analysis of complex traitsPharmacogenomics 201314413ndash24

79 Genomes Project C Abecasis GR Auton A et al An inte-grated map of genetic variation from 1 092 human genomesNature 201249156ndash65

80 Anfinsen CB The formation and stabilization of proteinstructure Biochem J 1972128737ndash49

81 Anfinsen CB Principles that govern the folding of proteinchains Science 1973181223ndash30

Interpreting functional effects of coding variants | 19

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

82 Fay J Wu C Sequence divergence functional constraintand selection in protein evolution Annu Rev Genomics HumGenet 20034213ndash35

83 Ponting CP Russell RR The natural history of protein do-mains Annu Rev Biophys Biomol Struct 20023145ndash71

84 Ekman D Bjorklund AK Frey-Skott J et al Multi-domain pro-teins in the three kingdoms of life orphan domains andother unassigned regions J Mol Biol 2005348231ndash43

85 Sigrist CJ Cerutti L Hulo N et al PROSITE a documenteddatabase using patterns and profiles as motif descriptorsBrief Bioinform 20023265ndash74

86 Dinkel H Michael S Weatheritt RJ et al ELMndashthe database ofeukaryotic linear motifs Nucleic Acids Res 201240D242ndash51

87 Haberichter SL Budde U Obser T et al The mutation N528Sin the von Willebrand factor (VWF) propeptide causes de-fective multimerization and storage of VWF Blood20101154580ndash7

88 Kutz WE Wang LW Dagoneau N et al Functional analysis ofan ADAMTS10 signal peptide mutation in Weill-Marchesanisyndrome demonstrates a long-range effect on secretion ofthe full-length enzyme Hum Mutat 2008291425ndash34

89 Pugalenthi G Suganthan PN Sowdhamini R et al SMotif aserver for structural motifs in proteins Bioinformatics200723637ndash8

90 Sundaramurthy P Shameer K Sreenivasan R et al HORI aweb server to compute Higher Order Residue Interactions inprotein structures BMC Bioinformatics 201011(Suppl 1)S24

91 Porter CT Bartlett GJ Thornton JM The Catalytic Site Atlas aresource of catalytic sites and residues identified in enzymesusing structural data Nucleic Acids Res 200432D129ndash33

92 Stuart AC Ilyin VA Sali A LigBase a database of families ofaligned ligand binding sites in known protein sequencesand structures Bioinformatics 200218200ndash01

93 Swapna LS Mahajan S de Brevern A et al Comparison oftertiary structures of proteins in protein-protein complexeswith unbound forms suggests prevalence of allostery in sig-nalling proteins BMC Struct Biol 2012126

94 Mathivanan S Ahmed M Ahn NG et al Human Proteinpediaenables sharing of human protein data Nat Biotechnol200826164ndash7

95 Bhaskara RM Srinivasan N Stability of domain structures inmulti-domain proteins Sci Rep 2011140

96 Furnham N de Beer TA Thornton JM Current challenges ingenome annotation through structural biology and bioinfor-matics Curr Opin Struct Biol 201222594ndash601

97 Wang Y Cheng H Pan Z et al Reconfiguring phosphoryl-ation signaling by genetic polymorphisms affects cancersusceptibility J Mol Cell Biol 20157187ndash202

98 Gnad F Gunawardena J Mann M PHOSIDA 2011 the post-translational modification database Nucleic Acids Res201139D253ndash60

99 Nicolaou N Margadant C Kevelam SH et al Gain of glycosy-lation in integrin alpha3 causes lung disease and nephroticsyndrome J Clin Invest 20121224375ndash87

100 Radivojac P Baenziger PH Kann MG et al Gain and loss ofphosphorylation sites in human cancer Bioinformatics200824i241ndash7

101 Pan Y Karagiannis K Zhang H et al Human germline andpan-cancer variomes and their distinct functional profilesNucleic Acids Res 20144211570ndash88

102 Mazumder R Morampudi KS Motwani M et al Proteome-wide analysis of single-nucleotide variations in the N-glyco-sylation sequon of human genes PLoS One 20127e36212

103 Reddy CC Shameer K Offmann BO et al PURE a webserverfor the prediction of domains in unassigned regions in pro-teins BMC Bioinformatics 20089281

104 Coletta A Pinney JW Solis DY et al Low-complexity regionswithin protein sequences have position-dependent rolesBMC Syst Biol 2010443

105 Jones DT Swindells MB Getting the most from PSI-BLASTTrends Biochem Sci 200227161ndash4

106 Chograni M Rejeb I Jemaa LB et al The first missense muta-tion of NHS gene in a Tunisian family with clinical featuresof NHS syndrome including cardiac anomaly Eur J HumGenet 201119851ndash6

107 Jones S Thornton JM Principles of protein-protein inter-actions Proc Natl Acad Sci USA 19969313ndash20

108 Babu MM van der Lee R de Groot NS et al Intrinsically dis-ordered proteins regulation and disease Curr Opin Struct Biol201121432ndash40

109 Dunker AK Silman I Uversky VN et al Function and struc-ture of inherently disordered proteins Curr Opin Struct Biol200818756ndash64

110 Wright PE Dyson HJ Intrinsically unstructured proteins re-assessing the protein structure-function paradigm J Mol Biol1999293321ndash31

111 Dyson HJ Wright PE Coupling of folding and binding for un-structured proteins Curr Opin Struct Biol 20021254ndash60

112 Nguyen Ba AN Yeh BJ van Dyk D et al Proteome-wide dis-covery of evolutionary conserved sequences in disorderedregions Sci Signal 20125rs1

113 Das RK Mao AH Pappu RV Unmasking functionalmotifs within disordered regions of proteins Sci Signal20125pe17

114 Hu Y Liu Y Jung J et al Changes in predicted protein dis-order tendency may contribute to disease risk BMCGenomics 201112(Suppl 5)S2

115 Vacic V Iakoucheva LM Disease mutations in disordered re-gionsndashexception to the rule Mol BioSyst 2012827ndash32

116 Srivastava SK Gayathri S Manjasetty BA et al Analysis ofconformational variation in macromolecular structuralmodels PLoS One 20127e39993

117 Ward JJ Sodhi JS McGuffin LJ et al Prediction and functionalanalysis of native disorder in proteins from the three king-doms of life J Mol Biol 2004337635ndash45

118 Thomas PJ Qu BH Pedersen PL Defective protein folding asa basis of human disease Trends Biochem Sci 199520456ndash9

119 Welch W Role of quality control pathways in human dis-eases involving protein misfolding Semin Cell Dev Biol20041531ndash8

120 Fink AL Protein aggregation folding aggregates inclusionbodies and amyloid Fold Des 19983R9ndash23

121 Thirumalai D Klimov DK Dima RI Emerging ideas on themolecular basis of protein and peptide aggregation CurrOpin Struct Biol 200313146ndash59

122 Khare SD Dokholyan NV Molecular mechanisms of poly-peptide aggregation in human diseases Curr Protein Pept Sci20078573ndash9

123 Soto C Estrada L Castilla J Amyloids prions and the inher-ent infectious nature of misfolded protein aggregatesTrends Biochem Sci 200631150ndash5

124 Prusiner SB Molecular biology and pathogenesis of priondiseases Trends Biochem Sci 199621482ndash7

125 Shameer K Shingate PN Manjunath SC et al 3DSwap cura-ted knowledgebase of proteins involved in 3D domain swap-ping Database (Oxford) 20112011bar042

20 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

126 Bennett MJ Choe S Eisenberg D Domain swapping entan-gling alliances between proteins Proc Natl Acad Sci USA1994913127ndash31

127 Bennett MJ Sawaya MR Eisenberg D Deposition diseasesand 3D domain swapping Structure 200614811ndash24

128 Shameer K Sowdhamini R Functional repertoire molecularpathways and diseases associated with 3D domain swap-ping in the human proteome J Clin Bioinform 201228

129 Waters PJ Parniak MA Hewson AS et al Alterations in pro-tein aggregation and degradation due to mild and severemissense mutations (A104D R157N) in the human phenyl-alanine hydroxylase gene (PAH) Hum Mutat 199812344ndash54

130 OrsquoNeill JW Kim DE Johnsen K et al Single-site mutations in-duce 3D domain swapping in the B1 domain of protein Lfrom Peptostreptococcus magnus Structure 200191017ndash27

131 Seeliger MA Spichty M Kelly SE et al Role of conformationalheterogeneity in domain swapping and adapter function ofthe Cks proteins J Biol Chem 200528030448ndash59

132 Kirsten Frank M Dyda F Dobrodumov A et al Core muta-tions switch monomeric protein GB1 into an intertwinedtetramer Nat Struct Biol 20029877ndash85

133 Huang Y Cao H Liu Z Three-dimensional domain swappingin the protein structure space Proteins 2012801610ndash19

134 Gidalevitz T Wang N Deravaj T et al Natural genetic vari-ation determines susceptibility to aggregation or toxicity ina C elegans model for polyglutamine disease BMC Biol201311100

135 Ross CA Poirier MA Protein aggregation and neurodegener-ative disease Nat Med 200410(Suppl)S10ndash17

136 Murzin A Metamorphic proteins Science 20083201725ndash6137 Shortle D One sequence plus one mutation equals two

folds Proc Natl Acad Sci USA 200910621011ndash12138 Yadid I Kirshenbaum N Sharon M et al Metamorphic pro-

teins mediate evolutionary transitions of structure Proc NatlAcad Sci USA 20101077287ndash92

139 Yadid I Kirshenbaum N Sharon M et al Metamorphic pro-teins mediate evolutionary transitions of structure Proc NatlAcad Sci USA 20101077287ndash92

140 Pickrell J Marioni J Pai A et al Understanding mechanismsunderlying human gene expression variation with RNAsequencing Nature 2010464768ndash72

141 Matlin AJ Clark F Smith CW Understanding alternative splic-ing towards a cellular code Nat Rev Mol Cell Biol 20056386ndash98

142 Rickman DS Pflueger D Moss B et al SLC45A3-ELK4 is anovel and frequent erythroblast transformation-specific fu-sion transcript in prostate cancer Cancer Res 2009692734ndash8

143 Flomen R Makoff A Increased RNA editing in EAAT2 pre-mRNA from amyotrophic lateral sclerosis patients involve-ment of a cryptic polyadenylation site Neurosci Lett2011497139ndash43

144 Asmann YW Hossain A Necela BM et al A novel bioinfor-matics pipeline for identification and characterization of fu-sion transcripts in breast cancer and normal cell linesNucleic Acids Res 201139e100

145 Bundschuh R Computational prediction of RNA editingsites Bioinformatics 2004203214ndash20

146 Eisenberg E Li JB Levanon EY Sequence based identificationof RNA editing sites RNA Biol 20107248ndash52

147 Levanon EY Eisenberg E Algorithmic approaches for identi-fication of RNA editing sites Brief Funct Genomics Proteomic2006543ndash5

148 McCarthy DJ Humburg P Kanapin A et al Choice of tran-scripts and software has a large effect on variant annota-tion Genome Med 2014626

149 Tang X Baheti S Shameer K et al The eSNV-detect a com-putational system to identify expressed single nucleotidevariants from transcriptome sequencing data Nucleic AcidsRes 201442e172

150 Chorev M Carmel L The function of introns Front Genet2012355

151 Chen R Davydov EV Sirota M et al Non-synonymous andsynonymous coding SNPs show similar likelihood and effectsize of human disease association PLoS One 20105e13574

152 Rose AB Intron-mediated regulation of gene expressionCurr Top Microbiol Immunol 2008326277ndash90

153 Birnbaum RY Clowney EJ Agamy O et al Coding exonsfunction as tissue-specific enhancers of nearby genesGenome Res 2012221059ndash68

154 Nakaya HI Amaral PP Louro R et al Genome mapping andexpression analyses of human intronic noncoding RNAsreveal tissue-specific patterns and enrichment in genesrelated to regulation of transcription Genome Biol 20078R43

155 Rohs R Jin X West SM et al Origins of specificity in protein-DNA recognition Ann Rev Biochem 201079233ndash69

156 Karczewski KJ Dudley JT Kukurba KR et al Systematic func-tional regulatory assessment of disease-associated variantsProc Natl Acad Sci USA 20131109607ndash12

157 Khalil AM Rinn JL RNA-protein interactions in humanhealth and disease Semin Cell Dev Biol 201122359ndash65

158 Jones S van Heyningen P Berman HM et al Protein-DNAinteractions a structural analysis J Mol Biol 1999287877ndash96

159 Jones S Daley DT Luscombe NM et al Protein-RNAinteractions a structural analysis Nucleic Acids Res200129943ndash54

160 Barabasi AL Oltvai ZN Network biology understanding thecellrsquos functional organization Nat Rev Genet 20045101ndash13

161 Nooren IM Thornton JM Diversity of protein-protein inter-actions EMBO J 2003223486ndash92

162 Caspi R Altman T Dreher K et al The MetaCyc database ofmetabolic pathways and enzymes and the BioCyc collectionof pathwaygenome databases Nucleic Acids Res201240D742ndash53

163 Yamada T Bork P Evolution of biomolecular networks les-sons from metabolic and protein interactions Nat Rev MolCell Biol 200910791ndash803

164 Stark C Breitkreutz BJ Chatr-Aryamontri A et al TheBioGRID interaction database 2011 update Nucleic Acids Res201139D698ndash704

165 Szklarczyk D Franceschini A Kuhn M et al The STRINGdatabase in 2011 functional interaction networks of pro-teins globally integrated and scored Nucleic Acids Res201139D561ndash8

166 Bader GD Hogue CW An automated method for finding mo-lecular complexes in large protein interaction networksBMC Bioinformatics 200342

167 DallrsquoOlio GM Vahdati AR Bertranpetit J et alVCF2Networks applying Genotype Networks to SingleNucleotide Variants data Bioinformatics 201531438ndash9

168 Ding L Getz G Wheeler DA et al Somatic mutations affectkey pathways in lung adenocarcinoma Nature20084551069ndash75

169 Case M Matheson E Minto L et al Mutation of genes affect-ing the RAS pathway is common in childhood acutelymphoblastic leukemia Cancer Res 2008686803ndash9

170 Thorstensen L Lind GE Lovig T et al Genetic and epigeneticchanges of components affecting the WNT pathway in colo-rectal carcinomas stratified by microsatellite instabilityNeoplasia 2005799ndash108

Interpreting functional effects of coding variants | 21

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

171 Weng L Macciardi F Subramanian A et al SNP-based path-way enrichment analysis for genome-wide associationstudies BMC Bioinformatics 20111299

172 Wang K Li M Bucan M Pathway-based approaches for ana-lysis of genomewide association studies Am J Hum Genet2007811278ndash83

173 OrsquoDushlaine C Kenny E Heron EA et al The SNP ratio testpathway analysis of genome-wide association datasetsBioinformatics 2009252762ndash3

174 Tarca AL Draghici S Khatri P et al A novel signaling path-way impact analysis Bioinformatics 20092575ndash82

175 Ramanan VK Shen L Moore JH et al Pathway analysis ofgenomic data concepts methods and prospects for futuredevelopment Trends Genet 201228323ndash2

176 Li Y Agarwal P Rajagopalan D A global pathway crosstalknetwork Bioinformatics 2008241442ndash7

177 Rivals I Personnaz L Taing L et al Enrichment or depletionof a GO category within a class of genes which testBioinformatics 200723401ndash7

178 Rhee SY Wood V Dolinski K et al Use and misuse of thegene ontology annotations Nat Rev Genet 20089509ndash15

179 Chandonia JM Brenner SE The impact of structural gen-omics expectations and outcomes Science 2006311347ndash51

180 Johnson MS Srinivasan N Sowdhamini R et al Knowledge-based protein modeling Crit Rev Biochem Mol Biol 1994291ndash68

181 Sulkowska JI Morcos F Weigt M et al Genomics-aided struc-ture prediction Proc Natl Acad Sci USA 201210910340ndash5

182 Lee D Redfern O Orengo C Predicting protein function from se-quence and structure Nat Rev Mol Cell Biol 20078995ndash1005

183 Worth CL Gong S Blundell TL Structural and functionalconstraints in the evolution of protein families Nat Rev MolCell Biol 200910709ndash20

184 Karczewski KJ Fernald GH Martin AR et al STORMSeq anopen-source user-friendly pipeline for processing personalgenomics data in the cloud PLoS One 20149e84860

185 Kalari KR Rossell D Necela BM et al Deep sequence ana-lysis of non-small cell lung cancer integrated analysis ofgene expression alternative splicing and single nucleotidevariations in lung adenocarcinomas with and without onco-genic KRAS mutations Front Oncol 2012212

186 Patil S Upadhayay A Arya D et al A high throughput exomesequencing approach to analyse events within a good re-sponder CML patient under imatinib at diagnosis and underremission Blood 20131225161

187 Kullo IJ Shameer K Jouni H et al The ATXN2-SH2B3 locus isassociated with peripheral arterial disease an electronicmedical record-based genome-wide association study FrontGenet 20145166

188 Shameer K Denny JC Ding K et al A genome- and phe-nome-wide association study to identify genetic variantsinfluencing platelet count and volume and their pleiotropiceffects Hum Genet 201413395ndash109

189 Sabarinathan R Wenzel A Novotny P et al Transcriptome-wide analysis of UTRs in non-small cell lung cancer revealscancer-related genes with SNV-induced changes on RNAsecondary structure and miRNA target sites PLoS One20149e82699

22 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

  • bbv084-TF1
Page 12: challenges in proteome-scale prediction, annotation and assessment

studies have shown the presence of evolutionarily conservedfunctional motifs within IDPs [112 113]

Recently the influence of disease mutations in the dis-ordered regions was extensively surveyed and the crucialroles of disorder-promoting amino acids in imparting vari-ations in the protein structures were proposed [114 115]Disorder-promoting amino acids are a potential cause ofvariations in protein structures and structural modelsDisorder-to-order transitions are enriched among diseasemutations compared with neutral polymorphisms [116]Currently protein modeling or variant effect prediction al-gorithms do not take the intrinsic disorder characteristicsof a protein into consideration Integrating tools such asDisopred [117] for disorder region prediction into the vari-ant annotation pipelines will improve our understanding ofthe impact of coding SNVs in the disordered regions of aprotein

Influence of coding SNVs on proteinmisfolding domain swapping aggregationmacromolecular crowding and degradation

Diseases have multiple environmental as well as geneticcauses and structural biology is an active area of research tounderstand how changes in individual protein structures orprotein complexes may play a key role in disease manifest-ations Understanding the structural bases of human diseases

would help to identify better targets and design better ligandsfor more effective therapeutic interventions

Protein misfolding and aggregation

Protein folding pathways play a crucial role in mediating cellu-lar homeostasis Defective folding of a protein product is themechanistic basis of several disease phenotypes likeAlzheimerrsquos disease and Parkinsonrsquos disease [118] When non-synonymous mutations lead to the production of misfoldedproteins with aberrant function several gate-keeping pathwaysact on such misfolded proteins to clear them from the cellularenvironment [119] Misfolded proteins targeted for ubiquityla-tion may also lead to protein aggregation Protein aggregation[120ndash123] is defined as a molecular phenomenon where a pro-tein is not cleared from the cellular environment by the normalpathways (Figure 6A) This leads to the aggregation of proteinsin the cellular environment which in turn leads to cellular tox-icity and is considered to be the mechanistic basis of varioushuman diseases such as prion diseases [123 124]

Domain swapping

3D domain swapping (Figure 6B) is a phenomenon observed in asubset of proteins where intermolecular interactions arereplaced by intramolecular interactions [125 126] Domainswapping is also recognized as a mechanism for forming pro-tein aggregates via open-ended mode Domain swapping is

Figure 5 Examples of IDRs in human proteins (A) Breast cancer type 1 susceptibility protein (BRCA1) (B) Cellular tumor antigen p53 (Oncoprotein p53) (C) T-cell sur-

face glycoprotein CD4 (Disprot) (D) Secondary structure information (UniProtKB) and (E) mutation histogram (COSMIC) of BRCA1 is provided to illustrate mutations in

the unstructured regions

12 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

observed in a variety of therapeutically important proteins andis considered to be the mechanism mediating deposition dis-eases such as neurodegenerative diseases and Alzheimerrsquos dis-ease that are caused by conformational perturbations [127 128]Missense mutations in the human phenylalanine hydroxylasehave been shown to infleunce aggregation and degradationproperties [129] Functional mutations are also considered to bea causative factor for proteins to adopt swapped conformationfrom monomers to higher oligomers [130ndash132] A recent study[128] that surveyed a subset of human proteins involved in 3Ddomain swapping suggested that swapping is not only confinedto conformational diseases it is also associated with severalkey biological pathways and also plays a role in mediating di-verse diseases in humans Another study [133] that investigatedthe structural properties of proteins involved in 3D domainswapping suggested that 10 of protein folds and 5 of proteinfamilies include domain-swapped structures

Macromolecular crowding

Proteins interact with a variety of small molecules and othermacromolecules to perform a specific functional task The qua-ternary assembly of a protein is influenced by the importantyet understudied phenomenon of macromolecular crowding

(117 118) Protein function is driven by interactions mediated byinter- and intra-chain amino acids and mutations in the sur-face amino acids may affect macromolecular homeostasisMutagenesis studies have shown the impact of functional mu-tations on proteinndashprotein interfaces (119) oligomerization (120)and stability (121) Mutations that influence the intra-chaininteractions may influence the crowding effect in the cellularenvironment Probing the impact of functional mutations onsuch effects using theoretical models and molecular dynamicsimulation studies is likely to enhance the understanding of therelationship between coding SNVs and macromolecular crowd-ing A list of selected tools and databases available for the pre-diction of misfolding protein aggregation and domainswapping is summarized in Table 5

Protein degradation

Targeted biochemical studies have revealed that the proteindegradation pathway plays an important role in the clearanceof the mutated proteins to reduce cellular toxicity For examplemutant Cu Zn-superoxide dismutase associated with amyo-trophic lateral sclerosis was cleared by macroautophagy path-way that includes proteasomal cleavage phenylalaninehydroxylase (109) and NAD(P)H quinone oxidoreductase 1 (156)

(a) (b)

Figure 6 Schematic representation of the impact of functional mutation on protein misfolding folding aggregation domain swapping macromolecular crowding and

protein degradation pathways Structure of misfolded rat CD2 structure (PDB ID 1A6P) and normal CD2 (PDB ID 1HNG) is used for representing misfolded and folded

structures Functional variant is represented using red asterisk

Interpreting functional effects of coding variants | 13

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

were also cleared by similar mechanisms Functional variantsplay a crucial role in aggregation and protein degration path-ways [134 135] Computational approaches that can predict thedegradation properties of mutated proteins using sequence orstructural information will be useful for rapid characterizationof functionally active proteins Integrated models that combinefolding misfolding aggregation 3D domain swapping macro-molecular crowding and degradation pathways (Figure 6) in asystems biology approach are likely to provide additional in-sights into the regulation of these important mechanisms andthe role of coding SNVs in mediating such mechanisms

Coding SNVs and metamorphic proteins

The molecular paradigm of sequence-structure-function sug-gests that a diverse set of sequences could attain a similarstructural fold that may lead to a functional convergence Whilethis is generally true for the protein universe occasionally devi-ations are observed Metamorphic proteins refer to a relativelynew class of proteins in which a given sequence has beenshown to attain different folded conformations under nativeconditions while performing distinct functions [136ndash138]Metamorphic proteins such as human chemokine lymphotactinhave been shown to influence evolutionary transitions of struc-ture and coding SNVs which could play an important role inswitching between two or more folds and functions for a singlesequence [139] Computational approaches to catalog and pre-dict the metamorphic properties and the impact of coding SNVson metamorphic proteins would help us understand how muta-tions drive functional plasticity at the proteome level

Impact of coding SNVs on the transcriptomicdiversity

Dynamic features of the human proteome are driven by tran-script diversity and the average number of characterized tran-scripts per gene is rapidly expanding The recent adoption ofRNA-Seq as a tool for expression profiling has led to the charac-terization of a large number of novel transcripts includingfusion transcripts [140] Several novel tissue-specific orcell-type-specific transcripts were reported from RNA-Seqexperiments The cellular compartment-based functional rolesof such transcripts are determined by alternative splicingevents [141] RNA editing is a phenomenon where a pre-mRNAmolecule is altered through a chemical change in the basemakeup thereby adding to the diversity of the transcriptomeRNA editing events occur via two distinct mechanisms of sub-stitution editing and insertiondeletion editing leading to func-tional diversity Fusion transcripts [142] and RNA editing eventsare also associated with various diseases including prostatecancer and amyotrophic lateral sclerosis [143] Computationalapproaches are available for the identification of fusion tran-scripts [144] and RNA editing events [145ndash147] from sequencingdata Understanding the precise roles and the impact of variantson such a diverse set of transcripts using computationalapproaches is a challenging task Ascribing the functional roleand assessing the impact of coding SNVs for mediating noveland fusion transcripts using bioinformatics approaches is anemerging problem and being addressed in recent studies [148]Recently we designed an integrated method to simultaneousanalyses of genome and transcriptome from RNA-seq data TheeSNV-Detect method can precisely capture genetic variation(genotypes) from RNA-seq data and helps to design cost-

effective and sustainable experimental strategies [149] Analyticand interpretation strategies that rely on multiple data-typeswould provide greater confidence for variant calling annotationand interpretation and thus actionability for clinicalinterventions

Functional impact of synonymous variations

Synonymous mutations are defined as mutations that result ina variation at the DNA level that code for the same amino acidin the protein level owing to codon degeneracy Synonymousvariants may either be coding or noncoding and the functionalroles of such variants are emerging A recent study [150] sug-gests that introns are involved in functional mechanisms andintrons with positional conservation across eukaryotic lineageare classified as functional introns A comparative analysis ofsynonymous and non-synonymous variants associated withcomplex diseases has shown similar likelihood and effect sizewith disease association [151] Synonymous mutations mayalso influence the introns regulating gene expression [152] A re-cent report also suggests that the coding exons function as tis-sue-specific enhancers and synonymous variations in suchenhancer sites may influence the expression level of certaingenes [153] Synonymous variants could influence the expres-sion of intronic noncoding RNAs [154] perturb the transientproteinndashDNA [155 156] and proteinndashRNA interactions in the cellSuch perturbations can lead to diseases such as cancer neuro-logical disorders and cardiac disease [157] Developing bioinfor-matics approaches to explore the putative impact ofsynonymous mutations on different layers of the coding andnoncoding genome and their relationships will be an importantaspect in the detailed interpretation of the genomic variantsThe detailed structural exploration of proteinndashDNA [158] andproteinndashRNA interactions [159] will help to precisely map themechanistic bases of the loss or gain of interactions owing tosynonymous variations

Impact of coding SNVs on the interactomeof a protein

The individual interactome of proteins varies to a great extent[160] Proteins often interact with nucleotides (proteinndashDNA orproteinndashRNA interactions) proteins (protein complexes obligateor non-obligate) [161] and proteinndashprotein interactions (transientor permanent) [107]) and small molecules (molecular reactionsenzymatic reactions metabolic pathways) [162 163] CodingSNVs play an important role in mediating such interactions Aninteractome of a protein can be defined using information gath-ered from high-throughput experiments that systematicallyidentify interactants of proteins Public protein-protein databaseslike BioGRID [164] and STRING [165] provide large data sets forderiving a protein interactome Network-level investigations tounderstand the category of interactors that are perturbed owingto coding SNVs will help delineate the impact of such mutationson the protein interactomes Using network measures (centralitydegree stress betweenness closeness cliques radiality transi-tivity reciprocity assortativity structural equivalence networkheterogeneity network density clustering coefficients neighbor-hood connectivity shared neighbors network topology etc) asquantitative parameters in variant annotation pipelines will helpthe researchers gain better insights into the impact of codingSNVs on the protein interactome Biologically relevant networkproperties such as network modularity [166] network fragility

14 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

[166] and lethality [166] can also be derived from an interactomeIncorporating a comparative network analysis framework thatcompares wild-type and variant interactomes will help quantifythe impact of coding SNVs on a network scale Methods such asVCF2Networks employ genotype networks (ie all genotypesassociated with a single phenotype) to understand the relation-ships between the genotype space and clinical or biological phe-notypes [167] Table 7 summarizes the available tools toinvestigate gene lists and to derive global functional trends andnetwork properties

Pathway-level impact of coding SNVs

Multiple functional genomics studies have investigated the per-turbations of pathways as a result of mutations in diseasessuch as adenocarcinoma (MAPK signaling p53 signaling Wntsignaling cell cycle and mammalian target of rapamycin path-ways) [168] childhood acute lymphoblastic leukemia (RAS path-way) [169] colorectal carcinoma (WNT signaling pathway) [170]etc Several SNP-centric approaches for pathway-level inferencehave been designed for the interpretation of SNPs identifiedfrom GWA studies or differentially expressed genes from gene-expression studies [171ndash175] Coding SNVs may also influencethe cross talk between various signaling pathways Currentvariant interpretation algorithms strive to identify pathways

associated with genes harboring coding SNVs yet a detailedunderstanding of the impact of variants on biological pathwaysand pathway cross talks [176] is often lacking Incorporatinganalytical routines to quantify the effect of functional muta-tions on pathways and pathway cross talk will be useful in in-terpreting functional variants from a biological perspective

A classification schema for annotatingcoding SNVs

Identifying the entire spectrum of molecular perturbationsowing to coding variations at the level of sequence structureinteraction and function of proteins is considered the basis ofvariant interpretation Multiple automated prediction tools thatcan assess the functional effect of mutations are currentlyavailable [See Table 6] Several variant annotation tools focuson the task of predicting the type or effect of mutations provideextended annotations from precomputed databases or lookuptables and map a coding variant to its corresponding gene pro-tein functional domain or signaling pathway Theseapproaches have several limitations because additional layersof protein-centric information that can be derived from the pre-diction or computation of sequence-based features with codingvariants are lacking To deal with this challenge and to designbroad-spectrum tools for deep variant interpretation we

Table 7 Software libraries and tools for biological network analysis

Name Description URL

Bio4J Bio4j is a bioinformatics graph-based database httpwwwbio4jcomBioConductor (Graphs

and Networks view)Collection of BioConductor modules for biological net-

work analysis and visualizationhttpwwwbioconductororgpackagesrelease

BiocViewshtml___GraphsAndNetworksCytoscape Open source platform for complex network analysis and

visualization with a large collection of plug-ins forbiological network analysis

httpwwwcytoscapeorg

DAVID Integrated functional annotation tool httpdavidabccncifcrfgovhomejspFunDO Functional Disease Ontology server httpdjangonubicnorthwesternedufundogene2pathway R package for prediction of KEGG pathway membership

for individual genes based on InterPro domainsignatures

httpwwwbioconductororgpackagesreleasebiochtmlgene2pathwayhtml

GeneAnswers R package for biological or medical interpretation of thegiven one or more groups of genes by means of statis-tical test

httpwwwbioconductororgpackagesreleasebiochtmlGeneAnswershtml

GO Tools Tools for analysis of GO T term enrichment statisticalanalysis semantic similarity and functional similar-ity using GO terms derived from gene lists

httpwwwgeneontologyorgGOtoolsshtml

Gephi Open-source graph visualization and analysis software httpwwwgephiorgGremlin Graph-based programming language httpsgithubcomtinkerpopgremlinwikiiGraph Network analysis and visualization library in C Also

available R package and a Python extensionhttpigraphsourceforgenet

KEGGgraph R package for analysis of KEGG pathways httpwwwbioconductororgpackages210biochtmlKEGGgraphhtml

LEDA A broad-spectrum Cthornthorn class library for efficient datatypes and algorithms including large-scale networkanalysis

httpwwwalgorithmic-solutionscomledaaboutindexhtm

Neat Web-based network analysis tools httpgephiorgOntology Analysis

plugins for Cytoscapehttpchiantiucsdeducyto_webplugins Plugins for functional enrichment analysis using

network dataPANTHER Classification of genes and proteins httppantherdborgReactome Curated knowledgebase of molecular events and

pathwayshttpwwwreactomeorg

TargetMine Integrates different types of biological data and enableflexible queries export results and analyze lists ofdata

httptargetminemizuguchilaborg

Interpreting functional effects of coding variants | 15

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

recommend a three-level annotation schema for the interpret-ation of coding variants (Figure 7) Primary level annotation pro-vides an overview of the types of mutations and associationannotations Secondary level annotation enables the systematicinvestigation of multiple variants in a single gene Tertiary-levelannotations help to find the global characteristics of genes har-boring coding variants and various properties in a network-scale

Level 1 Primary annotation of coding SNVs

Tools that use position information from VCF files to derive thetype of mutation (synonymous or non-synonymous) the loca-tion of the corresponding gene and its mapping onto several an-notation resources using positional data may be viewed aslsquoprimary annotationsrsquo See Table 2 for a list of tools that useminimum input data to gather a diverse set of annotationsusing database lookup tables and identifier mapping

Level 2 Secondary annotation of coding SNVs- Impactof multiple variants in a single gene

Sequence and structural explorations of coding variants arereferred to as secondary annotations VCF files can have mul-tiple variants of the same gene associated with a given pheno-type or disease A systematic investigation of multiple variantsin a single gene and an assessment of the relative impact of cod-ing variants could help in classifying multiple variants in a geneand help to delineate the variants as pathogenic moderatelypathogenic or VUS Comparative analysis of wild-type and vari-ant sequences can be performed and quantified using the totalnumber of gains or losses in the sequence features Based on the

availability of the structural data a given variant can be modeledin a protein structure using in-silico mutagenesis experimentsIn the absence of experimental protein structural data a proteinstructure model can be obtained from the homology model data-base or a new model can be built using homology modelingapproaches Once the wild type and mutant type structures areobtained the impact of variation can be rapidly computed usingstructural feature analyses or using computationally expensivemolecular dynamics simulations Table 5 summarizes key se-quence and structure feature prediction tools

Level 3 Tertiary annotation of coding SNVs- Impact ofcoding SNVs on multiple genes pathways andinteractome

The global impact of multiple coding variants on a genome orexome can be assessed using a combination of both knowledge-based enrichment or depletion analysis and network or interac-tome analyses Functional profiling using GO terms have beenused as an effective approach for characterizing collective func-tional characteristics of a perturbed set of genes [177 178]Gene-list-based analyses can provide biologically relevant infor-mation about the variants Enrichment analysis is not just con-fined to a priori defined gene sets or GO annotationsenrichment can be performed using several types of annota-tions and could provide insights into the probable functional as-sociations of genes with coding variants Enrichment analysisusing protein annotations would help to identify functionallysignificant protein domains protein classes families (mem-brane proteins or kinases etc) protein superfamilies (angioten-sin receptors or G-protein coupled receptors) or protein folds(Rossman fold beta-propeller) associated with proteins

Figure 7 Three-level schema for annotation of functional variants

16 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

harboring coding variants Furthermore it is possible to identifyenriched molecular events mediated by genes enriched TFBS inthe upstream of genes and biological pathways mediated by thegenes using annotations from Reactome UCSC and KEGG repo-sitories Disease ontology can be used to find known genendashdisease associations Following the knowledge-based analysis anetwork-level analysis of genes harboring coding SNVs can beperformed using publicly available network analysis tools suchas Cytoscape and Gephi

Future directions

The rapidly increasing availability of sequence structure func-tional and interaction information offers an attractive means toobtain a detailed characterization of coding variants Howeverthe use of large data sets to systematically explore the functionalvariants is currently limited owing to the gaps in the availablesequencendashstructurendashfunction interaction data Sequence dataare available for the entire human proteome but the availabilityof protein structural data experimentally verified functional as-sociations and biomolecular interaction data is limited Theavailability of structural data of homologous protein structuresis a prerequisite for structural investigations of functional vari-ants and its impact on the structural environment Availabilityof high-quality interaction data will also help analyze the cellu-lar networks modulated by proteins harboring coding variantsThe expansion of structural data sets using structural genomics[179] homology modeling [180] and the application of computa-tional approaches such as genomics-aided structure prediction[181] is likely to lower the increasing chasm between sequencestructure function and interaction data and may help to ascer-tain the impact of variants at the biomolecular level [163 182183] Improving the annotation databases using high-qualitycurated data and integration with cloud-based or stand-alonepipelines [184ndash189] can also help facilitate such efforts

The current version of the Bioinformatics Link Directory en-lists gt900 bioinformatics tools that are capable of processing dif-ferent types of protein data (sequence structure interactionquantification and annotation) Unifying multiple resources andenabling programmatic access via application program interfacesand web services for rapid integration will significantly enhancethe efficacy of variant annotation pipelines The development ofnovel statistical methods that can quantify various residue-specific properties would enable a comparative analysis of wild-type and variant sequences and can thereby help to facilitate theidentification and prioritization of functional mutations The in-clusion of a diverse set of sequence structure and network fea-tures into a graphically depicted database and an assessment ofthe impact of various data types (sequence structure functionand interaction) using probabilistic or machine learning modelswould also help automate variant annotation interpretation

From a sequence perspective understanding the precisefunctional role of LCRs unassigned regions and disordered re-gions as well as the relationship between these features and cod-ing SNVs is a key step in variant annotation and interpretationIdentifying deviations in protein structural space that are likelyto lead to misfolding domain swapping aggregation degrad-ation or deviations that are metamorphic in nature owing to theimpact of coding SNVs is an emerging challenge Several algo-rithms are available to predict sequence or structural domainsassign distantly related domains to unassigned regions func-tional motifs structural motifs and structural features proteindisorder and aggregation propensities However algorithms andtools to predict 3D domain swapping or to analyze metamorphic

properties degradation pathways and the network-level impactof coding SNVs are less abundant This may be attributed to thefact that these concepts are continually evolving in the proteinuniverse and concerted efforts are needed to understand the im-pact of coding variants of such unique features from both experi-mental and computational biologists alike

This review summarizes the major bioinformatics chal-lenges involved in gaining a deeper understanding of codingSNVs Coding SNVs may impart a molecular or disease pheno-type by interacting with noncoding regulatory parts of the gen-ome and the structural variants may provide an importantcontribution to this phenomenon Analyzing functional andregulatory variants in a single analytical framework may furtherenhance the interpretation of the genomic variants

Conclusions

A rapid decline in sequencing costs using NGS technologies hasled to an exponential increase in the frequency of the sequenc-ing projects over the past decade A large number of personalgenomes and exomes clinical samples and cancer genomes arebeing sequenced as part of large-scale collaborative projectsThe identification of a plethora of sequence variants associatedwith diverse molecular and disease phenotypes Scalable com-putational approaches that integrate annotations using se-quence structure functional and interaction data are necessaryfor the rapid interpretation of coding SNVs We have discussedthe bioinformatics resources available for the prediction anno-tation and visualization of coding SNVs summarized the majorbioinformatics challenges into 10 different themes for a deeperinterpretation of coding SNVs and recommended a three-levelschema to assess the phenotypic impact of functional vari-ations on individual protein sequences structures differentfunctional categories biological pathways and interactomesWe envisage that addressing the key challenges discussed inthis review and adopting a comprehensive annotation schemafor variant annotation could improve genomic reports that aregenerated as part of genomic medicine investigations and ex-perimental studies to better understand the variations impli-cated in rare common and complex disease manifestations

Key Points

bull Interpretations of variants identified from next-gener-ation sequencing pose several challenges

bull Systems-level experimental investigation of the func-tional variants is expensive and time-consumingefficient computational techniques are required toidentify the impact of functional variants

bull We recommended an integrated approach that com-bines multiple data types and tools for the predictionannotation and visualization of functional variantsand we have proposed a systematic approach for func-tional variant annotation and interpretation

bull Significant challenges that need to be negotiated dur-ing the interpretation of coding single nucleotide vari-ants are presented with the help of various examples

bull A three-level annotation approach that combines theinformation at the level of an individual variant mul-tiple variants in a single protein and global trends ofmultiple genes harboring multiple variants is proposedfor an effective interpretation of coding single nucleo-tide variants

Interpreting functional effects of coding variants | 17

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

Acknowledgements

KS and JTD would like to thank Icahn Institute forGenomics and Multiscale Biology (httpicahnmssmedudepartments-and-institutesgenomics) Mount Sinai HealthSystem for infrastructural support KS would like to ac-knowledge Michael J Zimmermann and Jean-Pierre Kocher(Mayo Clinic Rochester) for useful discussions The MayoClinic Bioinformatics Core provided infrastructure and ana-lytical support to KRK RS would like to thank NationalCentre for Biological Sciences (TIFR) for infrastructuralsupport

Funding

KS and JTD was supported by grant R01-DK098242-03from the National Institute of Diabetes and Digestive andKidney Diseases KRK is funded in part by the Mayo ClinicCenter for Individualized Medicine the PharmacogenomicsResearch Network (PGRN) and the Mayo Clinic BreastSpecialized Program of Research Excellence (SPORE)

References1 Lander ES Initial impact of the sequencing of the human

genome Nature 2011470187ndash972 Mardis ER A decadersquos perspective on DNA sequencing tech-

nology Nature 2011470198ndash2033 Green ED Guyer MS Charting a course for genomic medi-

cine from base pairs to bedside Nature 2011470204ndash134 Maxmen A Exome sequencing deciphers rare diseases Cell

2011144635ndash75 Wheeler DA Srinivasan M Egholm M et al The complete

genome of an individual by massively parallel DNAsequencing Nature 2008452872ndash6

6 Weatheritt RJ Davey NE Gibson TJ Linear motifs conferfunctional diversity onto splice variants Nucleic Acids Res201240123ndash31

7 Goldstein DB Common genetic variation and human traitsN Engl J Med 20093601696ndash8

8 Manolio TA Collins FS Cox NJ et al Finding the missing her-itability of complex diseases Nature 2009461747ndash53

9 Genome-wide association study of 14 000 cases of sevencommon diseases and 3 000 shared controls Nature2007447661ndash78

10 Kullo IJ Ding K Shameer K et al Complement receptor 1gene variants are associated with erythrocyte sedimenta-tion rate Am J Hum Genet 201189131ndash8

11 Welter D MacArthur J Morales J et al The NHGRI GWASCatalog a curated resource of SNP-trait associations NucleicAcids Res 201442D1001ndash6

12 Leslie R OrsquoDonnell CJ Johnson AD GRASP analysis of geno-type-phenotype results from 1390 genome-wide associationstudies and corresponding open access databaseBioinformatics 201430i185ndash94

13 Nelson MR Wegmann D Ehm MG et al An abundance ofrare functional variants in 202 drug target genes sequencedin 14 002 people Science 2012337100ndash4

14 MacArthur DG Balasubramanian S Frankish A et al A sys-tematic survey of loss-of-function variants in human pro-tein-coding genes Science 2012335823ndash8

15 Raychaudhuri S Mapping rare and common causal allelesfor complex human diseases Cell 201114757ndash69

16 Stankiewicz P Lupski JR Structural variation in the humangenome and its role in disease Annu Rev Med201061437ndash455

17 McCarroll SA Altshuler DM Copy-number variation and as-sociation studies of human disease Nat Genet200739S37ndash42

18 Zhang F Gu W Hurles ME et al Copy number variation inhuman health disease and evolution Annu Rev GenomicsHum Genet 200910451ndash81

19 Medvedev P Stanciu M Brudno M Computational methodsfor discovering structural variation with next-generationsequencing Nat Methods 20096S13ndash20

20 Sherry ST Ward MH Kholodov M et al dbSNP the NCBI data-base of genetic variation Nucleic Acids Res 200129308ndash11

21 Rios D McLaren WM Chen Y et al A database and API forvariation dense genotyping and resequencing data BMCBioinformatics 201011238

22 Iafrate AJ Feuk L Rivera MN et al Detection of large-scale variation in the human genome Nat Genet200436949ndash51

23 Ashley EA Butte AJ Wheeler MT et al Clinical assess-ment incorporating a personal genome Lancet20103751525ndash35

24 Chen R Mias GI Li-Pook-Than J et al Personal omics profil-ing reveals dynamic molecular and medical phenotypesCell 20121481293ndash307

25 Lupski JR Reid JG Gonzaga-Jauregui C et al Whole-genomesequencing in a patient with Charcot-Marie-Tooth neur-opathy N Engl J Med 20103621181ndash91

26 Li Y Vinckenbosch N Tian G et al Resequencing of200 human exomes identifies an excess of low-fre-quency non-synonymous coding variants Nat Genet201042969ndash72

27 Clarke L Zheng-Bradley X Smith R et al The 1000 GenomesProject data management and community access NatMethods 20129459ndash462

28 Schunkert H Konig IR Kathiresan S et al Large-scale associ-ation analysis identifies 13 new susceptibility loci for coron-ary artery disease Nat Genet 201143333ndash8

29 Helgadottir A Thorleifsson G Manolescu A et al A commonvariant on chromosome 9p21 affects the risk of myocardialinfarction Science 20073161491ndash3

30 Visel A Zhu Y May D et al Targeted deletion of the 9p21non-coding coronary artery disease risk interval in miceNature 2010464409ndash12

31 Soranzo N Rendon A Gieger C et al A novel variant onchromosome 7q223 associated with mean platelet volumecounts and function Blood 20091133831ndash7

32 Paul DS Nisbet JP Yang TP et al Maps of open chromatinguide the functional follow-up of genome-wide associationsignals application to hematological traits PLoS Genet20117e1002139

33 Hull J Campino S Rowlands K et al Identification of com-mon genetic variation that modulates alternative splicingPLoS Genet 20073e99

34 Albert FW Kruglyak L The role of regulatory vari-ation in complex traits and disease Nat Rev Genet201516197ndash212

35 Battle A Khan Z Wang SH et al Genomic variation Impactof regulatory variation from RNA to protein Science2015347664ndash7

36 den Dunnen JT Antonarakis SE Nomenclature for the de-scription of human sequence variations Hum Genet2001109121ndash4

18 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

37 Eilbeck K Lewis SE Mungall CJ et al The sequence ontologya tool for the unification of genome annotations Genome Biol20056R44

38 Landrum MJ Lee JM Riley GR et al ClinVar public archive ofrelationships among sequence variation and human pheno-type Nucleic Acids Res 201442D980ndash5

39 Kircher M Witten DM Jain P et al A general framework forestimating the relative pathogenicity of human genetic vari-ants Nat Genet 201446310ndash15

40 Kumar P Henikoff S Ng PC Predicting the effects of codingnon-synonymous variants on protein function using theSIFT algorithm Nat Protoc 200941073ndash81

41 Adzhubei I Schmidt S Peshkin L et al A method and serverfor predicting damaging missense mutations Nat Meth20107248ndash9

42 Gonzalez-Perez A Lopez-Bigas N Improving the assessmentof the outcome of nonsynonymous SNVs with a consensusdeleteriousness score Condel Am J Hum Genet 201188440ndash9

43 Grantham R Amino acid difference formula to help explainprotein evolution Science 1974185862ndash4

44 Cooper GM Stone EA Asimenos G et al Distribution and in-tensity of constraint in mammalian genomic sequenceGenome Res 200515901ndash13

45 Pollard KS Hubisz MJ Rosenbloom KR et al Detection ofnonneutral substitution rates on mammalian phylogeniesGenome Res 201020110ndash21

46 Siepel A Bejerano G Pedersen JS et al Evolutionarily con-served elements in vertebrate insect worm and yeast gen-omes Genome Res 2005151034ndash50

47 Dayhoff MO Eck RV Park CM Model of evolutionary changein proteins In MO Dayhoff (ed) Atlas of Protein Sequence andStructure Washington DC National Biomedical ResearchFoundation 1972 89ndash99

48 Henikoff S Henikoff JG Amino acid substitution matricesfrom protein blocks Proc Natl Acad Sci USA 19928910915ndash19

49 Altschul SF Gish W Miller W et al Basic local alignmentsearch tool J Mol Biol 1990215403ndash10

50 Altschul SF Madden TL Schaffer AA et al Gapped BLASTand PSI-BLAST a new generation of protein database searchprograms Nucleic Acids Res 1997253389ndash402

51 Sunyaev S Ramensky V Koch I et al Prediction of deleteri-ous human alleles Hum Mol Genet 200110591ndash7

52 Yang Z Bielawski JP Statistical methods for detecting mo-lecular adaptation Trends Ecol Evol 200015496ndash503

53 Ng PC Henikoff S Predicting deleterious amino acid substi-tutions Genome Res 200111863ndash74

54 Wang Z Moult J SNPs protein structure and disease HumMutat 200117263ndash70

55 Thomas PD Kejariwal A Coding single-nucleotide poly-morphisms associated with complex vs Mendelian diseaseevolutionary evidence for differences in molecular effectsProc Natl Acad Sci USA 200410115398ndash403

56 Stone EA Sidow A Physicochemical constraint violation bymissense substitutions mediates impairment of proteinfunction and disease severity Genome Res 200515978ndash86

57 Miller MP Kumar S Understanding human disease muta-tions through the use of interspecific genetic variation HumMol Genet 2001102319ndash28

58 Johnson AD Single-nucleotide polymorphism bioinfor-matics a comprehensive review of resources Circ CardiovascGenet 20092530ndash6

59 Marjoram P Tavare S Modern computational approachesfor analysing molecular genetic variation data Nat Rev Genet20067759ndash70

60 Excoffier L Heckel G Computer programs for populationgenetics data analysis a survival guide Nat Rev Genet20067745ndash58

61 Peterson TA Nehrt NL Park D et al Incorporating molecularand functional context into the analysis and prioritizationof human variants associated with cancer J Am Med InformAssoc 201219275ndash83

62 Kann MG Advances in translational bioinformatics compu-tational approaches for the hunting of disease genes BriefBioinform 20101196ndash110

63 Cooper GM Shendure J Needles in stacks of needles findingdisease-causal variants in a wealth of genomic data Nat RevGenet 201112628ndash40

64 Stitziel NO Kiezun A Sunyaev S Computational and statis-tical approaches to analyzing variants identified by exomesequencing Genome Biol 201112227

65 Alexander RP Fang G Rozowsky J et al Annotating non-coding regions of the genome Nat Rev Genet 201011559ndash71

66 Mort M Evani US Krishnan VG et al In silico func-tional profiling of human disease-associated and poly-morphic amino acid substitutions Hum Mutat 201031335ndash46

67 Hamosh A Scott AF Amberger JS et al Online MendelianInheritance in Man (OMIM) a knowledgebase of humangenes and genetic disorders Nucleic Acids Res200533D514ndash17

68 Stenson PD Mort M Ball EV et al The human gene mutationdatabase 2008 update Genome Med 2009113

69 Forbes SA Bindal N Bamford S et al COSMIC mining com-plete cancer genomes in the Catalogue of SomaticMutations in Cancer Nucleic Acids Res 201139D945ndash50

70 Chen YA Tripathi LP Mizuguchi K TargetMine an inte-grated data warehouse for candidate gene prioritisation andtarget discovery PLoS One 20116e17844

71 Wang K Li M Hakonarson H ANNOVAR functional annota-tion of genetic variants from high-throughput sequencingdata Nucleic Acids Res 201038e164

72 Danecek P Auton A Abecasis G et al The variant call formatand VCFtools Bioinformatics 2011272156ndash8

73 Macintyre G Jimeno Yepes A Ong CS et al Associating dis-ease-related genetic variants in intergenic regions to thegenes they impact Peer J 20142e639

74 Javed A Agrawal S Ng PC Phen-Gen combining phenotypeand genotype to analyze rare disorders Nat Methods201411935ndash7

75 Dai HJ Wu JC Tsai RT et al T-HOD a literature-based candi-date gene database for hypertension obesity and diabetesDatabase (Oxford) 20132013bas061

76 Cheng D Knox C Young N et al PolySearch a web-basedtext mining system for extracting relationships betweenhuman diseases genes mutations drugs and metabolitesNucleic Acids Res 200836W399ndash405

77 Do R Kathiresan S Abecasis GR Exome sequencing andcomplex disease practical aspects of rare variant associ-ation studies Hum Mol Genet 201221R1ndash9

78 Wagner MJ Rare-variant genome-wide association studiesa new frontier in genetic analysis of complex traitsPharmacogenomics 201314413ndash24

79 Genomes Project C Abecasis GR Auton A et al An inte-grated map of genetic variation from 1 092 human genomesNature 201249156ndash65

80 Anfinsen CB The formation and stabilization of proteinstructure Biochem J 1972128737ndash49

81 Anfinsen CB Principles that govern the folding of proteinchains Science 1973181223ndash30

Interpreting functional effects of coding variants | 19

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

82 Fay J Wu C Sequence divergence functional constraintand selection in protein evolution Annu Rev Genomics HumGenet 20034213ndash35

83 Ponting CP Russell RR The natural history of protein do-mains Annu Rev Biophys Biomol Struct 20023145ndash71

84 Ekman D Bjorklund AK Frey-Skott J et al Multi-domain pro-teins in the three kingdoms of life orphan domains andother unassigned regions J Mol Biol 2005348231ndash43

85 Sigrist CJ Cerutti L Hulo N et al PROSITE a documenteddatabase using patterns and profiles as motif descriptorsBrief Bioinform 20023265ndash74

86 Dinkel H Michael S Weatheritt RJ et al ELMndashthe database ofeukaryotic linear motifs Nucleic Acids Res 201240D242ndash51

87 Haberichter SL Budde U Obser T et al The mutation N528Sin the von Willebrand factor (VWF) propeptide causes de-fective multimerization and storage of VWF Blood20101154580ndash7

88 Kutz WE Wang LW Dagoneau N et al Functional analysis ofan ADAMTS10 signal peptide mutation in Weill-Marchesanisyndrome demonstrates a long-range effect on secretion ofthe full-length enzyme Hum Mutat 2008291425ndash34

89 Pugalenthi G Suganthan PN Sowdhamini R et al SMotif aserver for structural motifs in proteins Bioinformatics200723637ndash8

90 Sundaramurthy P Shameer K Sreenivasan R et al HORI aweb server to compute Higher Order Residue Interactions inprotein structures BMC Bioinformatics 201011(Suppl 1)S24

91 Porter CT Bartlett GJ Thornton JM The Catalytic Site Atlas aresource of catalytic sites and residues identified in enzymesusing structural data Nucleic Acids Res 200432D129ndash33

92 Stuart AC Ilyin VA Sali A LigBase a database of families ofaligned ligand binding sites in known protein sequencesand structures Bioinformatics 200218200ndash01

93 Swapna LS Mahajan S de Brevern A et al Comparison oftertiary structures of proteins in protein-protein complexeswith unbound forms suggests prevalence of allostery in sig-nalling proteins BMC Struct Biol 2012126

94 Mathivanan S Ahmed M Ahn NG et al Human Proteinpediaenables sharing of human protein data Nat Biotechnol200826164ndash7

95 Bhaskara RM Srinivasan N Stability of domain structures inmulti-domain proteins Sci Rep 2011140

96 Furnham N de Beer TA Thornton JM Current challenges ingenome annotation through structural biology and bioinfor-matics Curr Opin Struct Biol 201222594ndash601

97 Wang Y Cheng H Pan Z et al Reconfiguring phosphoryl-ation signaling by genetic polymorphisms affects cancersusceptibility J Mol Cell Biol 20157187ndash202

98 Gnad F Gunawardena J Mann M PHOSIDA 2011 the post-translational modification database Nucleic Acids Res201139D253ndash60

99 Nicolaou N Margadant C Kevelam SH et al Gain of glycosy-lation in integrin alpha3 causes lung disease and nephroticsyndrome J Clin Invest 20121224375ndash87

100 Radivojac P Baenziger PH Kann MG et al Gain and loss ofphosphorylation sites in human cancer Bioinformatics200824i241ndash7

101 Pan Y Karagiannis K Zhang H et al Human germline andpan-cancer variomes and their distinct functional profilesNucleic Acids Res 20144211570ndash88

102 Mazumder R Morampudi KS Motwani M et al Proteome-wide analysis of single-nucleotide variations in the N-glyco-sylation sequon of human genes PLoS One 20127e36212

103 Reddy CC Shameer K Offmann BO et al PURE a webserverfor the prediction of domains in unassigned regions in pro-teins BMC Bioinformatics 20089281

104 Coletta A Pinney JW Solis DY et al Low-complexity regionswithin protein sequences have position-dependent rolesBMC Syst Biol 2010443

105 Jones DT Swindells MB Getting the most from PSI-BLASTTrends Biochem Sci 200227161ndash4

106 Chograni M Rejeb I Jemaa LB et al The first missense muta-tion of NHS gene in a Tunisian family with clinical featuresof NHS syndrome including cardiac anomaly Eur J HumGenet 201119851ndash6

107 Jones S Thornton JM Principles of protein-protein inter-actions Proc Natl Acad Sci USA 19969313ndash20

108 Babu MM van der Lee R de Groot NS et al Intrinsically dis-ordered proteins regulation and disease Curr Opin Struct Biol201121432ndash40

109 Dunker AK Silman I Uversky VN et al Function and struc-ture of inherently disordered proteins Curr Opin Struct Biol200818756ndash64

110 Wright PE Dyson HJ Intrinsically unstructured proteins re-assessing the protein structure-function paradigm J Mol Biol1999293321ndash31

111 Dyson HJ Wright PE Coupling of folding and binding for un-structured proteins Curr Opin Struct Biol 20021254ndash60

112 Nguyen Ba AN Yeh BJ van Dyk D et al Proteome-wide dis-covery of evolutionary conserved sequences in disorderedregions Sci Signal 20125rs1

113 Das RK Mao AH Pappu RV Unmasking functionalmotifs within disordered regions of proteins Sci Signal20125pe17

114 Hu Y Liu Y Jung J et al Changes in predicted protein dis-order tendency may contribute to disease risk BMCGenomics 201112(Suppl 5)S2

115 Vacic V Iakoucheva LM Disease mutations in disordered re-gionsndashexception to the rule Mol BioSyst 2012827ndash32

116 Srivastava SK Gayathri S Manjasetty BA et al Analysis ofconformational variation in macromolecular structuralmodels PLoS One 20127e39993

117 Ward JJ Sodhi JS McGuffin LJ et al Prediction and functionalanalysis of native disorder in proteins from the three king-doms of life J Mol Biol 2004337635ndash45

118 Thomas PJ Qu BH Pedersen PL Defective protein folding asa basis of human disease Trends Biochem Sci 199520456ndash9

119 Welch W Role of quality control pathways in human dis-eases involving protein misfolding Semin Cell Dev Biol20041531ndash8

120 Fink AL Protein aggregation folding aggregates inclusionbodies and amyloid Fold Des 19983R9ndash23

121 Thirumalai D Klimov DK Dima RI Emerging ideas on themolecular basis of protein and peptide aggregation CurrOpin Struct Biol 200313146ndash59

122 Khare SD Dokholyan NV Molecular mechanisms of poly-peptide aggregation in human diseases Curr Protein Pept Sci20078573ndash9

123 Soto C Estrada L Castilla J Amyloids prions and the inher-ent infectious nature of misfolded protein aggregatesTrends Biochem Sci 200631150ndash5

124 Prusiner SB Molecular biology and pathogenesis of priondiseases Trends Biochem Sci 199621482ndash7

125 Shameer K Shingate PN Manjunath SC et al 3DSwap cura-ted knowledgebase of proteins involved in 3D domain swap-ping Database (Oxford) 20112011bar042

20 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

126 Bennett MJ Choe S Eisenberg D Domain swapping entan-gling alliances between proteins Proc Natl Acad Sci USA1994913127ndash31

127 Bennett MJ Sawaya MR Eisenberg D Deposition diseasesand 3D domain swapping Structure 200614811ndash24

128 Shameer K Sowdhamini R Functional repertoire molecularpathways and diseases associated with 3D domain swap-ping in the human proteome J Clin Bioinform 201228

129 Waters PJ Parniak MA Hewson AS et al Alterations in pro-tein aggregation and degradation due to mild and severemissense mutations (A104D R157N) in the human phenyl-alanine hydroxylase gene (PAH) Hum Mutat 199812344ndash54

130 OrsquoNeill JW Kim DE Johnsen K et al Single-site mutations in-duce 3D domain swapping in the B1 domain of protein Lfrom Peptostreptococcus magnus Structure 200191017ndash27

131 Seeliger MA Spichty M Kelly SE et al Role of conformationalheterogeneity in domain swapping and adapter function ofthe Cks proteins J Biol Chem 200528030448ndash59

132 Kirsten Frank M Dyda F Dobrodumov A et al Core muta-tions switch monomeric protein GB1 into an intertwinedtetramer Nat Struct Biol 20029877ndash85

133 Huang Y Cao H Liu Z Three-dimensional domain swappingin the protein structure space Proteins 2012801610ndash19

134 Gidalevitz T Wang N Deravaj T et al Natural genetic vari-ation determines susceptibility to aggregation or toxicity ina C elegans model for polyglutamine disease BMC Biol201311100

135 Ross CA Poirier MA Protein aggregation and neurodegener-ative disease Nat Med 200410(Suppl)S10ndash17

136 Murzin A Metamorphic proteins Science 20083201725ndash6137 Shortle D One sequence plus one mutation equals two

folds Proc Natl Acad Sci USA 200910621011ndash12138 Yadid I Kirshenbaum N Sharon M et al Metamorphic pro-

teins mediate evolutionary transitions of structure Proc NatlAcad Sci USA 20101077287ndash92

139 Yadid I Kirshenbaum N Sharon M et al Metamorphic pro-teins mediate evolutionary transitions of structure Proc NatlAcad Sci USA 20101077287ndash92

140 Pickrell J Marioni J Pai A et al Understanding mechanismsunderlying human gene expression variation with RNAsequencing Nature 2010464768ndash72

141 Matlin AJ Clark F Smith CW Understanding alternative splic-ing towards a cellular code Nat Rev Mol Cell Biol 20056386ndash98

142 Rickman DS Pflueger D Moss B et al SLC45A3-ELK4 is anovel and frequent erythroblast transformation-specific fu-sion transcript in prostate cancer Cancer Res 2009692734ndash8

143 Flomen R Makoff A Increased RNA editing in EAAT2 pre-mRNA from amyotrophic lateral sclerosis patients involve-ment of a cryptic polyadenylation site Neurosci Lett2011497139ndash43

144 Asmann YW Hossain A Necela BM et al A novel bioinfor-matics pipeline for identification and characterization of fu-sion transcripts in breast cancer and normal cell linesNucleic Acids Res 201139e100

145 Bundschuh R Computational prediction of RNA editingsites Bioinformatics 2004203214ndash20

146 Eisenberg E Li JB Levanon EY Sequence based identificationof RNA editing sites RNA Biol 20107248ndash52

147 Levanon EY Eisenberg E Algorithmic approaches for identi-fication of RNA editing sites Brief Funct Genomics Proteomic2006543ndash5

148 McCarthy DJ Humburg P Kanapin A et al Choice of tran-scripts and software has a large effect on variant annota-tion Genome Med 2014626

149 Tang X Baheti S Shameer K et al The eSNV-detect a com-putational system to identify expressed single nucleotidevariants from transcriptome sequencing data Nucleic AcidsRes 201442e172

150 Chorev M Carmel L The function of introns Front Genet2012355

151 Chen R Davydov EV Sirota M et al Non-synonymous andsynonymous coding SNPs show similar likelihood and effectsize of human disease association PLoS One 20105e13574

152 Rose AB Intron-mediated regulation of gene expressionCurr Top Microbiol Immunol 2008326277ndash90

153 Birnbaum RY Clowney EJ Agamy O et al Coding exonsfunction as tissue-specific enhancers of nearby genesGenome Res 2012221059ndash68

154 Nakaya HI Amaral PP Louro R et al Genome mapping andexpression analyses of human intronic noncoding RNAsreveal tissue-specific patterns and enrichment in genesrelated to regulation of transcription Genome Biol 20078R43

155 Rohs R Jin X West SM et al Origins of specificity in protein-DNA recognition Ann Rev Biochem 201079233ndash69

156 Karczewski KJ Dudley JT Kukurba KR et al Systematic func-tional regulatory assessment of disease-associated variantsProc Natl Acad Sci USA 20131109607ndash12

157 Khalil AM Rinn JL RNA-protein interactions in humanhealth and disease Semin Cell Dev Biol 201122359ndash65

158 Jones S van Heyningen P Berman HM et al Protein-DNAinteractions a structural analysis J Mol Biol 1999287877ndash96

159 Jones S Daley DT Luscombe NM et al Protein-RNAinteractions a structural analysis Nucleic Acids Res200129943ndash54

160 Barabasi AL Oltvai ZN Network biology understanding thecellrsquos functional organization Nat Rev Genet 20045101ndash13

161 Nooren IM Thornton JM Diversity of protein-protein inter-actions EMBO J 2003223486ndash92

162 Caspi R Altman T Dreher K et al The MetaCyc database ofmetabolic pathways and enzymes and the BioCyc collectionof pathwaygenome databases Nucleic Acids Res201240D742ndash53

163 Yamada T Bork P Evolution of biomolecular networks les-sons from metabolic and protein interactions Nat Rev MolCell Biol 200910791ndash803

164 Stark C Breitkreutz BJ Chatr-Aryamontri A et al TheBioGRID interaction database 2011 update Nucleic Acids Res201139D698ndash704

165 Szklarczyk D Franceschini A Kuhn M et al The STRINGdatabase in 2011 functional interaction networks of pro-teins globally integrated and scored Nucleic Acids Res201139D561ndash8

166 Bader GD Hogue CW An automated method for finding mo-lecular complexes in large protein interaction networksBMC Bioinformatics 200342

167 DallrsquoOlio GM Vahdati AR Bertranpetit J et alVCF2Networks applying Genotype Networks to SingleNucleotide Variants data Bioinformatics 201531438ndash9

168 Ding L Getz G Wheeler DA et al Somatic mutations affectkey pathways in lung adenocarcinoma Nature20084551069ndash75

169 Case M Matheson E Minto L et al Mutation of genes affect-ing the RAS pathway is common in childhood acutelymphoblastic leukemia Cancer Res 2008686803ndash9

170 Thorstensen L Lind GE Lovig T et al Genetic and epigeneticchanges of components affecting the WNT pathway in colo-rectal carcinomas stratified by microsatellite instabilityNeoplasia 2005799ndash108

Interpreting functional effects of coding variants | 21

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

171 Weng L Macciardi F Subramanian A et al SNP-based path-way enrichment analysis for genome-wide associationstudies BMC Bioinformatics 20111299

172 Wang K Li M Bucan M Pathway-based approaches for ana-lysis of genomewide association studies Am J Hum Genet2007811278ndash83

173 OrsquoDushlaine C Kenny E Heron EA et al The SNP ratio testpathway analysis of genome-wide association datasetsBioinformatics 2009252762ndash3

174 Tarca AL Draghici S Khatri P et al A novel signaling path-way impact analysis Bioinformatics 20092575ndash82

175 Ramanan VK Shen L Moore JH et al Pathway analysis ofgenomic data concepts methods and prospects for futuredevelopment Trends Genet 201228323ndash2

176 Li Y Agarwal P Rajagopalan D A global pathway crosstalknetwork Bioinformatics 2008241442ndash7

177 Rivals I Personnaz L Taing L et al Enrichment or depletionof a GO category within a class of genes which testBioinformatics 200723401ndash7

178 Rhee SY Wood V Dolinski K et al Use and misuse of thegene ontology annotations Nat Rev Genet 20089509ndash15

179 Chandonia JM Brenner SE The impact of structural gen-omics expectations and outcomes Science 2006311347ndash51

180 Johnson MS Srinivasan N Sowdhamini R et al Knowledge-based protein modeling Crit Rev Biochem Mol Biol 1994291ndash68

181 Sulkowska JI Morcos F Weigt M et al Genomics-aided struc-ture prediction Proc Natl Acad Sci USA 201210910340ndash5

182 Lee D Redfern O Orengo C Predicting protein function from se-quence and structure Nat Rev Mol Cell Biol 20078995ndash1005

183 Worth CL Gong S Blundell TL Structural and functionalconstraints in the evolution of protein families Nat Rev MolCell Biol 200910709ndash20

184 Karczewski KJ Fernald GH Martin AR et al STORMSeq anopen-source user-friendly pipeline for processing personalgenomics data in the cloud PLoS One 20149e84860

185 Kalari KR Rossell D Necela BM et al Deep sequence ana-lysis of non-small cell lung cancer integrated analysis ofgene expression alternative splicing and single nucleotidevariations in lung adenocarcinomas with and without onco-genic KRAS mutations Front Oncol 2012212

186 Patil S Upadhayay A Arya D et al A high throughput exomesequencing approach to analyse events within a good re-sponder CML patient under imatinib at diagnosis and underremission Blood 20131225161

187 Kullo IJ Shameer K Jouni H et al The ATXN2-SH2B3 locus isassociated with peripheral arterial disease an electronicmedical record-based genome-wide association study FrontGenet 20145166

188 Shameer K Denny JC Ding K et al A genome- and phe-nome-wide association study to identify genetic variantsinfluencing platelet count and volume and their pleiotropiceffects Hum Genet 201413395ndash109

189 Sabarinathan R Wenzel A Novotny P et al Transcriptome-wide analysis of UTRs in non-small cell lung cancer revealscancer-related genes with SNV-induced changes on RNAsecondary structure and miRNA target sites PLoS One20149e82699

22 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

  • bbv084-TF1
Page 13: challenges in proteome-scale prediction, annotation and assessment

observed in a variety of therapeutically important proteins andis considered to be the mechanism mediating deposition dis-eases such as neurodegenerative diseases and Alzheimerrsquos dis-ease that are caused by conformational perturbations [127 128]Missense mutations in the human phenylalanine hydroxylasehave been shown to infleunce aggregation and degradationproperties [129] Functional mutations are also considered to bea causative factor for proteins to adopt swapped conformationfrom monomers to higher oligomers [130ndash132] A recent study[128] that surveyed a subset of human proteins involved in 3Ddomain swapping suggested that swapping is not only confinedto conformational diseases it is also associated with severalkey biological pathways and also plays a role in mediating di-verse diseases in humans Another study [133] that investigatedthe structural properties of proteins involved in 3D domainswapping suggested that 10 of protein folds and 5 of proteinfamilies include domain-swapped structures

Macromolecular crowding

Proteins interact with a variety of small molecules and othermacromolecules to perform a specific functional task The qua-ternary assembly of a protein is influenced by the importantyet understudied phenomenon of macromolecular crowding

(117 118) Protein function is driven by interactions mediated byinter- and intra-chain amino acids and mutations in the sur-face amino acids may affect macromolecular homeostasisMutagenesis studies have shown the impact of functional mu-tations on proteinndashprotein interfaces (119) oligomerization (120)and stability (121) Mutations that influence the intra-chaininteractions may influence the crowding effect in the cellularenvironment Probing the impact of functional mutations onsuch effects using theoretical models and molecular dynamicsimulation studies is likely to enhance the understanding of therelationship between coding SNVs and macromolecular crowd-ing A list of selected tools and databases available for the pre-diction of misfolding protein aggregation and domainswapping is summarized in Table 5

Protein degradation

Targeted biochemical studies have revealed that the proteindegradation pathway plays an important role in the clearanceof the mutated proteins to reduce cellular toxicity For examplemutant Cu Zn-superoxide dismutase associated with amyo-trophic lateral sclerosis was cleared by macroautophagy path-way that includes proteasomal cleavage phenylalaninehydroxylase (109) and NAD(P)H quinone oxidoreductase 1 (156)

(a) (b)

Figure 6 Schematic representation of the impact of functional mutation on protein misfolding folding aggregation domain swapping macromolecular crowding and

protein degradation pathways Structure of misfolded rat CD2 structure (PDB ID 1A6P) and normal CD2 (PDB ID 1HNG) is used for representing misfolded and folded

structures Functional variant is represented using red asterisk

Interpreting functional effects of coding variants | 13

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

were also cleared by similar mechanisms Functional variantsplay a crucial role in aggregation and protein degration path-ways [134 135] Computational approaches that can predict thedegradation properties of mutated proteins using sequence orstructural information will be useful for rapid characterizationof functionally active proteins Integrated models that combinefolding misfolding aggregation 3D domain swapping macro-molecular crowding and degradation pathways (Figure 6) in asystems biology approach are likely to provide additional in-sights into the regulation of these important mechanisms andthe role of coding SNVs in mediating such mechanisms

Coding SNVs and metamorphic proteins

The molecular paradigm of sequence-structure-function sug-gests that a diverse set of sequences could attain a similarstructural fold that may lead to a functional convergence Whilethis is generally true for the protein universe occasionally devi-ations are observed Metamorphic proteins refer to a relativelynew class of proteins in which a given sequence has beenshown to attain different folded conformations under nativeconditions while performing distinct functions [136ndash138]Metamorphic proteins such as human chemokine lymphotactinhave been shown to influence evolutionary transitions of struc-ture and coding SNVs which could play an important role inswitching between two or more folds and functions for a singlesequence [139] Computational approaches to catalog and pre-dict the metamorphic properties and the impact of coding SNVson metamorphic proteins would help us understand how muta-tions drive functional plasticity at the proteome level

Impact of coding SNVs on the transcriptomicdiversity

Dynamic features of the human proteome are driven by tran-script diversity and the average number of characterized tran-scripts per gene is rapidly expanding The recent adoption ofRNA-Seq as a tool for expression profiling has led to the charac-terization of a large number of novel transcripts includingfusion transcripts [140] Several novel tissue-specific orcell-type-specific transcripts were reported from RNA-Seqexperiments The cellular compartment-based functional rolesof such transcripts are determined by alternative splicingevents [141] RNA editing is a phenomenon where a pre-mRNAmolecule is altered through a chemical change in the basemakeup thereby adding to the diversity of the transcriptomeRNA editing events occur via two distinct mechanisms of sub-stitution editing and insertiondeletion editing leading to func-tional diversity Fusion transcripts [142] and RNA editing eventsare also associated with various diseases including prostatecancer and amyotrophic lateral sclerosis [143] Computationalapproaches are available for the identification of fusion tran-scripts [144] and RNA editing events [145ndash147] from sequencingdata Understanding the precise roles and the impact of variantson such a diverse set of transcripts using computationalapproaches is a challenging task Ascribing the functional roleand assessing the impact of coding SNVs for mediating noveland fusion transcripts using bioinformatics approaches is anemerging problem and being addressed in recent studies [148]Recently we designed an integrated method to simultaneousanalyses of genome and transcriptome from RNA-seq data TheeSNV-Detect method can precisely capture genetic variation(genotypes) from RNA-seq data and helps to design cost-

effective and sustainable experimental strategies [149] Analyticand interpretation strategies that rely on multiple data-typeswould provide greater confidence for variant calling annotationand interpretation and thus actionability for clinicalinterventions

Functional impact of synonymous variations

Synonymous mutations are defined as mutations that result ina variation at the DNA level that code for the same amino acidin the protein level owing to codon degeneracy Synonymousvariants may either be coding or noncoding and the functionalroles of such variants are emerging A recent study [150] sug-gests that introns are involved in functional mechanisms andintrons with positional conservation across eukaryotic lineageare classified as functional introns A comparative analysis ofsynonymous and non-synonymous variants associated withcomplex diseases has shown similar likelihood and effect sizewith disease association [151] Synonymous mutations mayalso influence the introns regulating gene expression [152] A re-cent report also suggests that the coding exons function as tis-sue-specific enhancers and synonymous variations in suchenhancer sites may influence the expression level of certaingenes [153] Synonymous variants could influence the expres-sion of intronic noncoding RNAs [154] perturb the transientproteinndashDNA [155 156] and proteinndashRNA interactions in the cellSuch perturbations can lead to diseases such as cancer neuro-logical disorders and cardiac disease [157] Developing bioinfor-matics approaches to explore the putative impact ofsynonymous mutations on different layers of the coding andnoncoding genome and their relationships will be an importantaspect in the detailed interpretation of the genomic variantsThe detailed structural exploration of proteinndashDNA [158] andproteinndashRNA interactions [159] will help to precisely map themechanistic bases of the loss or gain of interactions owing tosynonymous variations

Impact of coding SNVs on the interactomeof a protein

The individual interactome of proteins varies to a great extent[160] Proteins often interact with nucleotides (proteinndashDNA orproteinndashRNA interactions) proteins (protein complexes obligateor non-obligate) [161] and proteinndashprotein interactions (transientor permanent) [107]) and small molecules (molecular reactionsenzymatic reactions metabolic pathways) [162 163] CodingSNVs play an important role in mediating such interactions Aninteractome of a protein can be defined using information gath-ered from high-throughput experiments that systematicallyidentify interactants of proteins Public protein-protein databaseslike BioGRID [164] and STRING [165] provide large data sets forderiving a protein interactome Network-level investigations tounderstand the category of interactors that are perturbed owingto coding SNVs will help delineate the impact of such mutationson the protein interactomes Using network measures (centralitydegree stress betweenness closeness cliques radiality transi-tivity reciprocity assortativity structural equivalence networkheterogeneity network density clustering coefficients neighbor-hood connectivity shared neighbors network topology etc) asquantitative parameters in variant annotation pipelines will helpthe researchers gain better insights into the impact of codingSNVs on the protein interactome Biologically relevant networkproperties such as network modularity [166] network fragility

14 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

[166] and lethality [166] can also be derived from an interactomeIncorporating a comparative network analysis framework thatcompares wild-type and variant interactomes will help quantifythe impact of coding SNVs on a network scale Methods such asVCF2Networks employ genotype networks (ie all genotypesassociated with a single phenotype) to understand the relation-ships between the genotype space and clinical or biological phe-notypes [167] Table 7 summarizes the available tools toinvestigate gene lists and to derive global functional trends andnetwork properties

Pathway-level impact of coding SNVs

Multiple functional genomics studies have investigated the per-turbations of pathways as a result of mutations in diseasessuch as adenocarcinoma (MAPK signaling p53 signaling Wntsignaling cell cycle and mammalian target of rapamycin path-ways) [168] childhood acute lymphoblastic leukemia (RAS path-way) [169] colorectal carcinoma (WNT signaling pathway) [170]etc Several SNP-centric approaches for pathway-level inferencehave been designed for the interpretation of SNPs identifiedfrom GWA studies or differentially expressed genes from gene-expression studies [171ndash175] Coding SNVs may also influencethe cross talk between various signaling pathways Currentvariant interpretation algorithms strive to identify pathways

associated with genes harboring coding SNVs yet a detailedunderstanding of the impact of variants on biological pathwaysand pathway cross talks [176] is often lacking Incorporatinganalytical routines to quantify the effect of functional muta-tions on pathways and pathway cross talk will be useful in in-terpreting functional variants from a biological perspective

A classification schema for annotatingcoding SNVs

Identifying the entire spectrum of molecular perturbationsowing to coding variations at the level of sequence structureinteraction and function of proteins is considered the basis ofvariant interpretation Multiple automated prediction tools thatcan assess the functional effect of mutations are currentlyavailable [See Table 6] Several variant annotation tools focuson the task of predicting the type or effect of mutations provideextended annotations from precomputed databases or lookuptables and map a coding variant to its corresponding gene pro-tein functional domain or signaling pathway Theseapproaches have several limitations because additional layersof protein-centric information that can be derived from the pre-diction or computation of sequence-based features with codingvariants are lacking To deal with this challenge and to designbroad-spectrum tools for deep variant interpretation we

Table 7 Software libraries and tools for biological network analysis

Name Description URL

Bio4J Bio4j is a bioinformatics graph-based database httpwwwbio4jcomBioConductor (Graphs

and Networks view)Collection of BioConductor modules for biological net-

work analysis and visualizationhttpwwwbioconductororgpackagesrelease

BiocViewshtml___GraphsAndNetworksCytoscape Open source platform for complex network analysis and

visualization with a large collection of plug-ins forbiological network analysis

httpwwwcytoscapeorg

DAVID Integrated functional annotation tool httpdavidabccncifcrfgovhomejspFunDO Functional Disease Ontology server httpdjangonubicnorthwesternedufundogene2pathway R package for prediction of KEGG pathway membership

for individual genes based on InterPro domainsignatures

httpwwwbioconductororgpackagesreleasebiochtmlgene2pathwayhtml

GeneAnswers R package for biological or medical interpretation of thegiven one or more groups of genes by means of statis-tical test

httpwwwbioconductororgpackagesreleasebiochtmlGeneAnswershtml

GO Tools Tools for analysis of GO T term enrichment statisticalanalysis semantic similarity and functional similar-ity using GO terms derived from gene lists

httpwwwgeneontologyorgGOtoolsshtml

Gephi Open-source graph visualization and analysis software httpwwwgephiorgGremlin Graph-based programming language httpsgithubcomtinkerpopgremlinwikiiGraph Network analysis and visualization library in C Also

available R package and a Python extensionhttpigraphsourceforgenet

KEGGgraph R package for analysis of KEGG pathways httpwwwbioconductororgpackages210biochtmlKEGGgraphhtml

LEDA A broad-spectrum Cthornthorn class library for efficient datatypes and algorithms including large-scale networkanalysis

httpwwwalgorithmic-solutionscomledaaboutindexhtm

Neat Web-based network analysis tools httpgephiorgOntology Analysis

plugins for Cytoscapehttpchiantiucsdeducyto_webplugins Plugins for functional enrichment analysis using

network dataPANTHER Classification of genes and proteins httppantherdborgReactome Curated knowledgebase of molecular events and

pathwayshttpwwwreactomeorg

TargetMine Integrates different types of biological data and enableflexible queries export results and analyze lists ofdata

httptargetminemizuguchilaborg

Interpreting functional effects of coding variants | 15

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

recommend a three-level annotation schema for the interpret-ation of coding variants (Figure 7) Primary level annotation pro-vides an overview of the types of mutations and associationannotations Secondary level annotation enables the systematicinvestigation of multiple variants in a single gene Tertiary-levelannotations help to find the global characteristics of genes har-boring coding variants and various properties in a network-scale

Level 1 Primary annotation of coding SNVs

Tools that use position information from VCF files to derive thetype of mutation (synonymous or non-synonymous) the loca-tion of the corresponding gene and its mapping onto several an-notation resources using positional data may be viewed aslsquoprimary annotationsrsquo See Table 2 for a list of tools that useminimum input data to gather a diverse set of annotationsusing database lookup tables and identifier mapping

Level 2 Secondary annotation of coding SNVs- Impactof multiple variants in a single gene

Sequence and structural explorations of coding variants arereferred to as secondary annotations VCF files can have mul-tiple variants of the same gene associated with a given pheno-type or disease A systematic investigation of multiple variantsin a single gene and an assessment of the relative impact of cod-ing variants could help in classifying multiple variants in a geneand help to delineate the variants as pathogenic moderatelypathogenic or VUS Comparative analysis of wild-type and vari-ant sequences can be performed and quantified using the totalnumber of gains or losses in the sequence features Based on the

availability of the structural data a given variant can be modeledin a protein structure using in-silico mutagenesis experimentsIn the absence of experimental protein structural data a proteinstructure model can be obtained from the homology model data-base or a new model can be built using homology modelingapproaches Once the wild type and mutant type structures areobtained the impact of variation can be rapidly computed usingstructural feature analyses or using computationally expensivemolecular dynamics simulations Table 5 summarizes key se-quence and structure feature prediction tools

Level 3 Tertiary annotation of coding SNVs- Impact ofcoding SNVs on multiple genes pathways andinteractome

The global impact of multiple coding variants on a genome orexome can be assessed using a combination of both knowledge-based enrichment or depletion analysis and network or interac-tome analyses Functional profiling using GO terms have beenused as an effective approach for characterizing collective func-tional characteristics of a perturbed set of genes [177 178]Gene-list-based analyses can provide biologically relevant infor-mation about the variants Enrichment analysis is not just con-fined to a priori defined gene sets or GO annotationsenrichment can be performed using several types of annota-tions and could provide insights into the probable functional as-sociations of genes with coding variants Enrichment analysisusing protein annotations would help to identify functionallysignificant protein domains protein classes families (mem-brane proteins or kinases etc) protein superfamilies (angioten-sin receptors or G-protein coupled receptors) or protein folds(Rossman fold beta-propeller) associated with proteins

Figure 7 Three-level schema for annotation of functional variants

16 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

harboring coding variants Furthermore it is possible to identifyenriched molecular events mediated by genes enriched TFBS inthe upstream of genes and biological pathways mediated by thegenes using annotations from Reactome UCSC and KEGG repo-sitories Disease ontology can be used to find known genendashdisease associations Following the knowledge-based analysis anetwork-level analysis of genes harboring coding SNVs can beperformed using publicly available network analysis tools suchas Cytoscape and Gephi

Future directions

The rapidly increasing availability of sequence structure func-tional and interaction information offers an attractive means toobtain a detailed characterization of coding variants Howeverthe use of large data sets to systematically explore the functionalvariants is currently limited owing to the gaps in the availablesequencendashstructurendashfunction interaction data Sequence dataare available for the entire human proteome but the availabilityof protein structural data experimentally verified functional as-sociations and biomolecular interaction data is limited Theavailability of structural data of homologous protein structuresis a prerequisite for structural investigations of functional vari-ants and its impact on the structural environment Availabilityof high-quality interaction data will also help analyze the cellu-lar networks modulated by proteins harboring coding variantsThe expansion of structural data sets using structural genomics[179] homology modeling [180] and the application of computa-tional approaches such as genomics-aided structure prediction[181] is likely to lower the increasing chasm between sequencestructure function and interaction data and may help to ascer-tain the impact of variants at the biomolecular level [163 182183] Improving the annotation databases using high-qualitycurated data and integration with cloud-based or stand-alonepipelines [184ndash189] can also help facilitate such efforts

The current version of the Bioinformatics Link Directory en-lists gt900 bioinformatics tools that are capable of processing dif-ferent types of protein data (sequence structure interactionquantification and annotation) Unifying multiple resources andenabling programmatic access via application program interfacesand web services for rapid integration will significantly enhancethe efficacy of variant annotation pipelines The development ofnovel statistical methods that can quantify various residue-specific properties would enable a comparative analysis of wild-type and variant sequences and can thereby help to facilitate theidentification and prioritization of functional mutations The in-clusion of a diverse set of sequence structure and network fea-tures into a graphically depicted database and an assessment ofthe impact of various data types (sequence structure functionand interaction) using probabilistic or machine learning modelswould also help automate variant annotation interpretation

From a sequence perspective understanding the precisefunctional role of LCRs unassigned regions and disordered re-gions as well as the relationship between these features and cod-ing SNVs is a key step in variant annotation and interpretationIdentifying deviations in protein structural space that are likelyto lead to misfolding domain swapping aggregation degrad-ation or deviations that are metamorphic in nature owing to theimpact of coding SNVs is an emerging challenge Several algo-rithms are available to predict sequence or structural domainsassign distantly related domains to unassigned regions func-tional motifs structural motifs and structural features proteindisorder and aggregation propensities However algorithms andtools to predict 3D domain swapping or to analyze metamorphic

properties degradation pathways and the network-level impactof coding SNVs are less abundant This may be attributed to thefact that these concepts are continually evolving in the proteinuniverse and concerted efforts are needed to understand the im-pact of coding variants of such unique features from both experi-mental and computational biologists alike

This review summarizes the major bioinformatics chal-lenges involved in gaining a deeper understanding of codingSNVs Coding SNVs may impart a molecular or disease pheno-type by interacting with noncoding regulatory parts of the gen-ome and the structural variants may provide an importantcontribution to this phenomenon Analyzing functional andregulatory variants in a single analytical framework may furtherenhance the interpretation of the genomic variants

Conclusions

A rapid decline in sequencing costs using NGS technologies hasled to an exponential increase in the frequency of the sequenc-ing projects over the past decade A large number of personalgenomes and exomes clinical samples and cancer genomes arebeing sequenced as part of large-scale collaborative projectsThe identification of a plethora of sequence variants associatedwith diverse molecular and disease phenotypes Scalable com-putational approaches that integrate annotations using se-quence structure functional and interaction data are necessaryfor the rapid interpretation of coding SNVs We have discussedthe bioinformatics resources available for the prediction anno-tation and visualization of coding SNVs summarized the majorbioinformatics challenges into 10 different themes for a deeperinterpretation of coding SNVs and recommended a three-levelschema to assess the phenotypic impact of functional vari-ations on individual protein sequences structures differentfunctional categories biological pathways and interactomesWe envisage that addressing the key challenges discussed inthis review and adopting a comprehensive annotation schemafor variant annotation could improve genomic reports that aregenerated as part of genomic medicine investigations and ex-perimental studies to better understand the variations impli-cated in rare common and complex disease manifestations

Key Points

bull Interpretations of variants identified from next-gener-ation sequencing pose several challenges

bull Systems-level experimental investigation of the func-tional variants is expensive and time-consumingefficient computational techniques are required toidentify the impact of functional variants

bull We recommended an integrated approach that com-bines multiple data types and tools for the predictionannotation and visualization of functional variantsand we have proposed a systematic approach for func-tional variant annotation and interpretation

bull Significant challenges that need to be negotiated dur-ing the interpretation of coding single nucleotide vari-ants are presented with the help of various examples

bull A three-level annotation approach that combines theinformation at the level of an individual variant mul-tiple variants in a single protein and global trends ofmultiple genes harboring multiple variants is proposedfor an effective interpretation of coding single nucleo-tide variants

Interpreting functional effects of coding variants | 17

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

Acknowledgements

KS and JTD would like to thank Icahn Institute forGenomics and Multiscale Biology (httpicahnmssmedudepartments-and-institutesgenomics) Mount Sinai HealthSystem for infrastructural support KS would like to ac-knowledge Michael J Zimmermann and Jean-Pierre Kocher(Mayo Clinic Rochester) for useful discussions The MayoClinic Bioinformatics Core provided infrastructure and ana-lytical support to KRK RS would like to thank NationalCentre for Biological Sciences (TIFR) for infrastructuralsupport

Funding

KS and JTD was supported by grant R01-DK098242-03from the National Institute of Diabetes and Digestive andKidney Diseases KRK is funded in part by the Mayo ClinicCenter for Individualized Medicine the PharmacogenomicsResearch Network (PGRN) and the Mayo Clinic BreastSpecialized Program of Research Excellence (SPORE)

References1 Lander ES Initial impact of the sequencing of the human

genome Nature 2011470187ndash972 Mardis ER A decadersquos perspective on DNA sequencing tech-

nology Nature 2011470198ndash2033 Green ED Guyer MS Charting a course for genomic medi-

cine from base pairs to bedside Nature 2011470204ndash134 Maxmen A Exome sequencing deciphers rare diseases Cell

2011144635ndash75 Wheeler DA Srinivasan M Egholm M et al The complete

genome of an individual by massively parallel DNAsequencing Nature 2008452872ndash6

6 Weatheritt RJ Davey NE Gibson TJ Linear motifs conferfunctional diversity onto splice variants Nucleic Acids Res201240123ndash31

7 Goldstein DB Common genetic variation and human traitsN Engl J Med 20093601696ndash8

8 Manolio TA Collins FS Cox NJ et al Finding the missing her-itability of complex diseases Nature 2009461747ndash53

9 Genome-wide association study of 14 000 cases of sevencommon diseases and 3 000 shared controls Nature2007447661ndash78

10 Kullo IJ Ding K Shameer K et al Complement receptor 1gene variants are associated with erythrocyte sedimenta-tion rate Am J Hum Genet 201189131ndash8

11 Welter D MacArthur J Morales J et al The NHGRI GWASCatalog a curated resource of SNP-trait associations NucleicAcids Res 201442D1001ndash6

12 Leslie R OrsquoDonnell CJ Johnson AD GRASP analysis of geno-type-phenotype results from 1390 genome-wide associationstudies and corresponding open access databaseBioinformatics 201430i185ndash94

13 Nelson MR Wegmann D Ehm MG et al An abundance ofrare functional variants in 202 drug target genes sequencedin 14 002 people Science 2012337100ndash4

14 MacArthur DG Balasubramanian S Frankish A et al A sys-tematic survey of loss-of-function variants in human pro-tein-coding genes Science 2012335823ndash8

15 Raychaudhuri S Mapping rare and common causal allelesfor complex human diseases Cell 201114757ndash69

16 Stankiewicz P Lupski JR Structural variation in the humangenome and its role in disease Annu Rev Med201061437ndash455

17 McCarroll SA Altshuler DM Copy-number variation and as-sociation studies of human disease Nat Genet200739S37ndash42

18 Zhang F Gu W Hurles ME et al Copy number variation inhuman health disease and evolution Annu Rev GenomicsHum Genet 200910451ndash81

19 Medvedev P Stanciu M Brudno M Computational methodsfor discovering structural variation with next-generationsequencing Nat Methods 20096S13ndash20

20 Sherry ST Ward MH Kholodov M et al dbSNP the NCBI data-base of genetic variation Nucleic Acids Res 200129308ndash11

21 Rios D McLaren WM Chen Y et al A database and API forvariation dense genotyping and resequencing data BMCBioinformatics 201011238

22 Iafrate AJ Feuk L Rivera MN et al Detection of large-scale variation in the human genome Nat Genet200436949ndash51

23 Ashley EA Butte AJ Wheeler MT et al Clinical assess-ment incorporating a personal genome Lancet20103751525ndash35

24 Chen R Mias GI Li-Pook-Than J et al Personal omics profil-ing reveals dynamic molecular and medical phenotypesCell 20121481293ndash307

25 Lupski JR Reid JG Gonzaga-Jauregui C et al Whole-genomesequencing in a patient with Charcot-Marie-Tooth neur-opathy N Engl J Med 20103621181ndash91

26 Li Y Vinckenbosch N Tian G et al Resequencing of200 human exomes identifies an excess of low-fre-quency non-synonymous coding variants Nat Genet201042969ndash72

27 Clarke L Zheng-Bradley X Smith R et al The 1000 GenomesProject data management and community access NatMethods 20129459ndash462

28 Schunkert H Konig IR Kathiresan S et al Large-scale associ-ation analysis identifies 13 new susceptibility loci for coron-ary artery disease Nat Genet 201143333ndash8

29 Helgadottir A Thorleifsson G Manolescu A et al A commonvariant on chromosome 9p21 affects the risk of myocardialinfarction Science 20073161491ndash3

30 Visel A Zhu Y May D et al Targeted deletion of the 9p21non-coding coronary artery disease risk interval in miceNature 2010464409ndash12

31 Soranzo N Rendon A Gieger C et al A novel variant onchromosome 7q223 associated with mean platelet volumecounts and function Blood 20091133831ndash7

32 Paul DS Nisbet JP Yang TP et al Maps of open chromatinguide the functional follow-up of genome-wide associationsignals application to hematological traits PLoS Genet20117e1002139

33 Hull J Campino S Rowlands K et al Identification of com-mon genetic variation that modulates alternative splicingPLoS Genet 20073e99

34 Albert FW Kruglyak L The role of regulatory vari-ation in complex traits and disease Nat Rev Genet201516197ndash212

35 Battle A Khan Z Wang SH et al Genomic variation Impactof regulatory variation from RNA to protein Science2015347664ndash7

36 den Dunnen JT Antonarakis SE Nomenclature for the de-scription of human sequence variations Hum Genet2001109121ndash4

18 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

37 Eilbeck K Lewis SE Mungall CJ et al The sequence ontologya tool for the unification of genome annotations Genome Biol20056R44

38 Landrum MJ Lee JM Riley GR et al ClinVar public archive ofrelationships among sequence variation and human pheno-type Nucleic Acids Res 201442D980ndash5

39 Kircher M Witten DM Jain P et al A general framework forestimating the relative pathogenicity of human genetic vari-ants Nat Genet 201446310ndash15

40 Kumar P Henikoff S Ng PC Predicting the effects of codingnon-synonymous variants on protein function using theSIFT algorithm Nat Protoc 200941073ndash81

41 Adzhubei I Schmidt S Peshkin L et al A method and serverfor predicting damaging missense mutations Nat Meth20107248ndash9

42 Gonzalez-Perez A Lopez-Bigas N Improving the assessmentof the outcome of nonsynonymous SNVs with a consensusdeleteriousness score Condel Am J Hum Genet 201188440ndash9

43 Grantham R Amino acid difference formula to help explainprotein evolution Science 1974185862ndash4

44 Cooper GM Stone EA Asimenos G et al Distribution and in-tensity of constraint in mammalian genomic sequenceGenome Res 200515901ndash13

45 Pollard KS Hubisz MJ Rosenbloom KR et al Detection ofnonneutral substitution rates on mammalian phylogeniesGenome Res 201020110ndash21

46 Siepel A Bejerano G Pedersen JS et al Evolutionarily con-served elements in vertebrate insect worm and yeast gen-omes Genome Res 2005151034ndash50

47 Dayhoff MO Eck RV Park CM Model of evolutionary changein proteins In MO Dayhoff (ed) Atlas of Protein Sequence andStructure Washington DC National Biomedical ResearchFoundation 1972 89ndash99

48 Henikoff S Henikoff JG Amino acid substitution matricesfrom protein blocks Proc Natl Acad Sci USA 19928910915ndash19

49 Altschul SF Gish W Miller W et al Basic local alignmentsearch tool J Mol Biol 1990215403ndash10

50 Altschul SF Madden TL Schaffer AA et al Gapped BLASTand PSI-BLAST a new generation of protein database searchprograms Nucleic Acids Res 1997253389ndash402

51 Sunyaev S Ramensky V Koch I et al Prediction of deleteri-ous human alleles Hum Mol Genet 200110591ndash7

52 Yang Z Bielawski JP Statistical methods for detecting mo-lecular adaptation Trends Ecol Evol 200015496ndash503

53 Ng PC Henikoff S Predicting deleterious amino acid substi-tutions Genome Res 200111863ndash74

54 Wang Z Moult J SNPs protein structure and disease HumMutat 200117263ndash70

55 Thomas PD Kejariwal A Coding single-nucleotide poly-morphisms associated with complex vs Mendelian diseaseevolutionary evidence for differences in molecular effectsProc Natl Acad Sci USA 200410115398ndash403

56 Stone EA Sidow A Physicochemical constraint violation bymissense substitutions mediates impairment of proteinfunction and disease severity Genome Res 200515978ndash86

57 Miller MP Kumar S Understanding human disease muta-tions through the use of interspecific genetic variation HumMol Genet 2001102319ndash28

58 Johnson AD Single-nucleotide polymorphism bioinfor-matics a comprehensive review of resources Circ CardiovascGenet 20092530ndash6

59 Marjoram P Tavare S Modern computational approachesfor analysing molecular genetic variation data Nat Rev Genet20067759ndash70

60 Excoffier L Heckel G Computer programs for populationgenetics data analysis a survival guide Nat Rev Genet20067745ndash58

61 Peterson TA Nehrt NL Park D et al Incorporating molecularand functional context into the analysis and prioritizationof human variants associated with cancer J Am Med InformAssoc 201219275ndash83

62 Kann MG Advances in translational bioinformatics compu-tational approaches for the hunting of disease genes BriefBioinform 20101196ndash110

63 Cooper GM Shendure J Needles in stacks of needles findingdisease-causal variants in a wealth of genomic data Nat RevGenet 201112628ndash40

64 Stitziel NO Kiezun A Sunyaev S Computational and statis-tical approaches to analyzing variants identified by exomesequencing Genome Biol 201112227

65 Alexander RP Fang G Rozowsky J et al Annotating non-coding regions of the genome Nat Rev Genet 201011559ndash71

66 Mort M Evani US Krishnan VG et al In silico func-tional profiling of human disease-associated and poly-morphic amino acid substitutions Hum Mutat 201031335ndash46

67 Hamosh A Scott AF Amberger JS et al Online MendelianInheritance in Man (OMIM) a knowledgebase of humangenes and genetic disorders Nucleic Acids Res200533D514ndash17

68 Stenson PD Mort M Ball EV et al The human gene mutationdatabase 2008 update Genome Med 2009113

69 Forbes SA Bindal N Bamford S et al COSMIC mining com-plete cancer genomes in the Catalogue of SomaticMutations in Cancer Nucleic Acids Res 201139D945ndash50

70 Chen YA Tripathi LP Mizuguchi K TargetMine an inte-grated data warehouse for candidate gene prioritisation andtarget discovery PLoS One 20116e17844

71 Wang K Li M Hakonarson H ANNOVAR functional annota-tion of genetic variants from high-throughput sequencingdata Nucleic Acids Res 201038e164

72 Danecek P Auton A Abecasis G et al The variant call formatand VCFtools Bioinformatics 2011272156ndash8

73 Macintyre G Jimeno Yepes A Ong CS et al Associating dis-ease-related genetic variants in intergenic regions to thegenes they impact Peer J 20142e639

74 Javed A Agrawal S Ng PC Phen-Gen combining phenotypeand genotype to analyze rare disorders Nat Methods201411935ndash7

75 Dai HJ Wu JC Tsai RT et al T-HOD a literature-based candi-date gene database for hypertension obesity and diabetesDatabase (Oxford) 20132013bas061

76 Cheng D Knox C Young N et al PolySearch a web-basedtext mining system for extracting relationships betweenhuman diseases genes mutations drugs and metabolitesNucleic Acids Res 200836W399ndash405

77 Do R Kathiresan S Abecasis GR Exome sequencing andcomplex disease practical aspects of rare variant associ-ation studies Hum Mol Genet 201221R1ndash9

78 Wagner MJ Rare-variant genome-wide association studiesa new frontier in genetic analysis of complex traitsPharmacogenomics 201314413ndash24

79 Genomes Project C Abecasis GR Auton A et al An inte-grated map of genetic variation from 1 092 human genomesNature 201249156ndash65

80 Anfinsen CB The formation and stabilization of proteinstructure Biochem J 1972128737ndash49

81 Anfinsen CB Principles that govern the folding of proteinchains Science 1973181223ndash30

Interpreting functional effects of coding variants | 19

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

82 Fay J Wu C Sequence divergence functional constraintand selection in protein evolution Annu Rev Genomics HumGenet 20034213ndash35

83 Ponting CP Russell RR The natural history of protein do-mains Annu Rev Biophys Biomol Struct 20023145ndash71

84 Ekman D Bjorklund AK Frey-Skott J et al Multi-domain pro-teins in the three kingdoms of life orphan domains andother unassigned regions J Mol Biol 2005348231ndash43

85 Sigrist CJ Cerutti L Hulo N et al PROSITE a documenteddatabase using patterns and profiles as motif descriptorsBrief Bioinform 20023265ndash74

86 Dinkel H Michael S Weatheritt RJ et al ELMndashthe database ofeukaryotic linear motifs Nucleic Acids Res 201240D242ndash51

87 Haberichter SL Budde U Obser T et al The mutation N528Sin the von Willebrand factor (VWF) propeptide causes de-fective multimerization and storage of VWF Blood20101154580ndash7

88 Kutz WE Wang LW Dagoneau N et al Functional analysis ofan ADAMTS10 signal peptide mutation in Weill-Marchesanisyndrome demonstrates a long-range effect on secretion ofthe full-length enzyme Hum Mutat 2008291425ndash34

89 Pugalenthi G Suganthan PN Sowdhamini R et al SMotif aserver for structural motifs in proteins Bioinformatics200723637ndash8

90 Sundaramurthy P Shameer K Sreenivasan R et al HORI aweb server to compute Higher Order Residue Interactions inprotein structures BMC Bioinformatics 201011(Suppl 1)S24

91 Porter CT Bartlett GJ Thornton JM The Catalytic Site Atlas aresource of catalytic sites and residues identified in enzymesusing structural data Nucleic Acids Res 200432D129ndash33

92 Stuart AC Ilyin VA Sali A LigBase a database of families ofaligned ligand binding sites in known protein sequencesand structures Bioinformatics 200218200ndash01

93 Swapna LS Mahajan S de Brevern A et al Comparison oftertiary structures of proteins in protein-protein complexeswith unbound forms suggests prevalence of allostery in sig-nalling proteins BMC Struct Biol 2012126

94 Mathivanan S Ahmed M Ahn NG et al Human Proteinpediaenables sharing of human protein data Nat Biotechnol200826164ndash7

95 Bhaskara RM Srinivasan N Stability of domain structures inmulti-domain proteins Sci Rep 2011140

96 Furnham N de Beer TA Thornton JM Current challenges ingenome annotation through structural biology and bioinfor-matics Curr Opin Struct Biol 201222594ndash601

97 Wang Y Cheng H Pan Z et al Reconfiguring phosphoryl-ation signaling by genetic polymorphisms affects cancersusceptibility J Mol Cell Biol 20157187ndash202

98 Gnad F Gunawardena J Mann M PHOSIDA 2011 the post-translational modification database Nucleic Acids Res201139D253ndash60

99 Nicolaou N Margadant C Kevelam SH et al Gain of glycosy-lation in integrin alpha3 causes lung disease and nephroticsyndrome J Clin Invest 20121224375ndash87

100 Radivojac P Baenziger PH Kann MG et al Gain and loss ofphosphorylation sites in human cancer Bioinformatics200824i241ndash7

101 Pan Y Karagiannis K Zhang H et al Human germline andpan-cancer variomes and their distinct functional profilesNucleic Acids Res 20144211570ndash88

102 Mazumder R Morampudi KS Motwani M et al Proteome-wide analysis of single-nucleotide variations in the N-glyco-sylation sequon of human genes PLoS One 20127e36212

103 Reddy CC Shameer K Offmann BO et al PURE a webserverfor the prediction of domains in unassigned regions in pro-teins BMC Bioinformatics 20089281

104 Coletta A Pinney JW Solis DY et al Low-complexity regionswithin protein sequences have position-dependent rolesBMC Syst Biol 2010443

105 Jones DT Swindells MB Getting the most from PSI-BLASTTrends Biochem Sci 200227161ndash4

106 Chograni M Rejeb I Jemaa LB et al The first missense muta-tion of NHS gene in a Tunisian family with clinical featuresof NHS syndrome including cardiac anomaly Eur J HumGenet 201119851ndash6

107 Jones S Thornton JM Principles of protein-protein inter-actions Proc Natl Acad Sci USA 19969313ndash20

108 Babu MM van der Lee R de Groot NS et al Intrinsically dis-ordered proteins regulation and disease Curr Opin Struct Biol201121432ndash40

109 Dunker AK Silman I Uversky VN et al Function and struc-ture of inherently disordered proteins Curr Opin Struct Biol200818756ndash64

110 Wright PE Dyson HJ Intrinsically unstructured proteins re-assessing the protein structure-function paradigm J Mol Biol1999293321ndash31

111 Dyson HJ Wright PE Coupling of folding and binding for un-structured proteins Curr Opin Struct Biol 20021254ndash60

112 Nguyen Ba AN Yeh BJ van Dyk D et al Proteome-wide dis-covery of evolutionary conserved sequences in disorderedregions Sci Signal 20125rs1

113 Das RK Mao AH Pappu RV Unmasking functionalmotifs within disordered regions of proteins Sci Signal20125pe17

114 Hu Y Liu Y Jung J et al Changes in predicted protein dis-order tendency may contribute to disease risk BMCGenomics 201112(Suppl 5)S2

115 Vacic V Iakoucheva LM Disease mutations in disordered re-gionsndashexception to the rule Mol BioSyst 2012827ndash32

116 Srivastava SK Gayathri S Manjasetty BA et al Analysis ofconformational variation in macromolecular structuralmodels PLoS One 20127e39993

117 Ward JJ Sodhi JS McGuffin LJ et al Prediction and functionalanalysis of native disorder in proteins from the three king-doms of life J Mol Biol 2004337635ndash45

118 Thomas PJ Qu BH Pedersen PL Defective protein folding asa basis of human disease Trends Biochem Sci 199520456ndash9

119 Welch W Role of quality control pathways in human dis-eases involving protein misfolding Semin Cell Dev Biol20041531ndash8

120 Fink AL Protein aggregation folding aggregates inclusionbodies and amyloid Fold Des 19983R9ndash23

121 Thirumalai D Klimov DK Dima RI Emerging ideas on themolecular basis of protein and peptide aggregation CurrOpin Struct Biol 200313146ndash59

122 Khare SD Dokholyan NV Molecular mechanisms of poly-peptide aggregation in human diseases Curr Protein Pept Sci20078573ndash9

123 Soto C Estrada L Castilla J Amyloids prions and the inher-ent infectious nature of misfolded protein aggregatesTrends Biochem Sci 200631150ndash5

124 Prusiner SB Molecular biology and pathogenesis of priondiseases Trends Biochem Sci 199621482ndash7

125 Shameer K Shingate PN Manjunath SC et al 3DSwap cura-ted knowledgebase of proteins involved in 3D domain swap-ping Database (Oxford) 20112011bar042

20 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

126 Bennett MJ Choe S Eisenberg D Domain swapping entan-gling alliances between proteins Proc Natl Acad Sci USA1994913127ndash31

127 Bennett MJ Sawaya MR Eisenberg D Deposition diseasesand 3D domain swapping Structure 200614811ndash24

128 Shameer K Sowdhamini R Functional repertoire molecularpathways and diseases associated with 3D domain swap-ping in the human proteome J Clin Bioinform 201228

129 Waters PJ Parniak MA Hewson AS et al Alterations in pro-tein aggregation and degradation due to mild and severemissense mutations (A104D R157N) in the human phenyl-alanine hydroxylase gene (PAH) Hum Mutat 199812344ndash54

130 OrsquoNeill JW Kim DE Johnsen K et al Single-site mutations in-duce 3D domain swapping in the B1 domain of protein Lfrom Peptostreptococcus magnus Structure 200191017ndash27

131 Seeliger MA Spichty M Kelly SE et al Role of conformationalheterogeneity in domain swapping and adapter function ofthe Cks proteins J Biol Chem 200528030448ndash59

132 Kirsten Frank M Dyda F Dobrodumov A et al Core muta-tions switch monomeric protein GB1 into an intertwinedtetramer Nat Struct Biol 20029877ndash85

133 Huang Y Cao H Liu Z Three-dimensional domain swappingin the protein structure space Proteins 2012801610ndash19

134 Gidalevitz T Wang N Deravaj T et al Natural genetic vari-ation determines susceptibility to aggregation or toxicity ina C elegans model for polyglutamine disease BMC Biol201311100

135 Ross CA Poirier MA Protein aggregation and neurodegener-ative disease Nat Med 200410(Suppl)S10ndash17

136 Murzin A Metamorphic proteins Science 20083201725ndash6137 Shortle D One sequence plus one mutation equals two

folds Proc Natl Acad Sci USA 200910621011ndash12138 Yadid I Kirshenbaum N Sharon M et al Metamorphic pro-

teins mediate evolutionary transitions of structure Proc NatlAcad Sci USA 20101077287ndash92

139 Yadid I Kirshenbaum N Sharon M et al Metamorphic pro-teins mediate evolutionary transitions of structure Proc NatlAcad Sci USA 20101077287ndash92

140 Pickrell J Marioni J Pai A et al Understanding mechanismsunderlying human gene expression variation with RNAsequencing Nature 2010464768ndash72

141 Matlin AJ Clark F Smith CW Understanding alternative splic-ing towards a cellular code Nat Rev Mol Cell Biol 20056386ndash98

142 Rickman DS Pflueger D Moss B et al SLC45A3-ELK4 is anovel and frequent erythroblast transformation-specific fu-sion transcript in prostate cancer Cancer Res 2009692734ndash8

143 Flomen R Makoff A Increased RNA editing in EAAT2 pre-mRNA from amyotrophic lateral sclerosis patients involve-ment of a cryptic polyadenylation site Neurosci Lett2011497139ndash43

144 Asmann YW Hossain A Necela BM et al A novel bioinfor-matics pipeline for identification and characterization of fu-sion transcripts in breast cancer and normal cell linesNucleic Acids Res 201139e100

145 Bundschuh R Computational prediction of RNA editingsites Bioinformatics 2004203214ndash20

146 Eisenberg E Li JB Levanon EY Sequence based identificationof RNA editing sites RNA Biol 20107248ndash52

147 Levanon EY Eisenberg E Algorithmic approaches for identi-fication of RNA editing sites Brief Funct Genomics Proteomic2006543ndash5

148 McCarthy DJ Humburg P Kanapin A et al Choice of tran-scripts and software has a large effect on variant annota-tion Genome Med 2014626

149 Tang X Baheti S Shameer K et al The eSNV-detect a com-putational system to identify expressed single nucleotidevariants from transcriptome sequencing data Nucleic AcidsRes 201442e172

150 Chorev M Carmel L The function of introns Front Genet2012355

151 Chen R Davydov EV Sirota M et al Non-synonymous andsynonymous coding SNPs show similar likelihood and effectsize of human disease association PLoS One 20105e13574

152 Rose AB Intron-mediated regulation of gene expressionCurr Top Microbiol Immunol 2008326277ndash90

153 Birnbaum RY Clowney EJ Agamy O et al Coding exonsfunction as tissue-specific enhancers of nearby genesGenome Res 2012221059ndash68

154 Nakaya HI Amaral PP Louro R et al Genome mapping andexpression analyses of human intronic noncoding RNAsreveal tissue-specific patterns and enrichment in genesrelated to regulation of transcription Genome Biol 20078R43

155 Rohs R Jin X West SM et al Origins of specificity in protein-DNA recognition Ann Rev Biochem 201079233ndash69

156 Karczewski KJ Dudley JT Kukurba KR et al Systematic func-tional regulatory assessment of disease-associated variantsProc Natl Acad Sci USA 20131109607ndash12

157 Khalil AM Rinn JL RNA-protein interactions in humanhealth and disease Semin Cell Dev Biol 201122359ndash65

158 Jones S van Heyningen P Berman HM et al Protein-DNAinteractions a structural analysis J Mol Biol 1999287877ndash96

159 Jones S Daley DT Luscombe NM et al Protein-RNAinteractions a structural analysis Nucleic Acids Res200129943ndash54

160 Barabasi AL Oltvai ZN Network biology understanding thecellrsquos functional organization Nat Rev Genet 20045101ndash13

161 Nooren IM Thornton JM Diversity of protein-protein inter-actions EMBO J 2003223486ndash92

162 Caspi R Altman T Dreher K et al The MetaCyc database ofmetabolic pathways and enzymes and the BioCyc collectionof pathwaygenome databases Nucleic Acids Res201240D742ndash53

163 Yamada T Bork P Evolution of biomolecular networks les-sons from metabolic and protein interactions Nat Rev MolCell Biol 200910791ndash803

164 Stark C Breitkreutz BJ Chatr-Aryamontri A et al TheBioGRID interaction database 2011 update Nucleic Acids Res201139D698ndash704

165 Szklarczyk D Franceschini A Kuhn M et al The STRINGdatabase in 2011 functional interaction networks of pro-teins globally integrated and scored Nucleic Acids Res201139D561ndash8

166 Bader GD Hogue CW An automated method for finding mo-lecular complexes in large protein interaction networksBMC Bioinformatics 200342

167 DallrsquoOlio GM Vahdati AR Bertranpetit J et alVCF2Networks applying Genotype Networks to SingleNucleotide Variants data Bioinformatics 201531438ndash9

168 Ding L Getz G Wheeler DA et al Somatic mutations affectkey pathways in lung adenocarcinoma Nature20084551069ndash75

169 Case M Matheson E Minto L et al Mutation of genes affect-ing the RAS pathway is common in childhood acutelymphoblastic leukemia Cancer Res 2008686803ndash9

170 Thorstensen L Lind GE Lovig T et al Genetic and epigeneticchanges of components affecting the WNT pathway in colo-rectal carcinomas stratified by microsatellite instabilityNeoplasia 2005799ndash108

Interpreting functional effects of coding variants | 21

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

171 Weng L Macciardi F Subramanian A et al SNP-based path-way enrichment analysis for genome-wide associationstudies BMC Bioinformatics 20111299

172 Wang K Li M Bucan M Pathway-based approaches for ana-lysis of genomewide association studies Am J Hum Genet2007811278ndash83

173 OrsquoDushlaine C Kenny E Heron EA et al The SNP ratio testpathway analysis of genome-wide association datasetsBioinformatics 2009252762ndash3

174 Tarca AL Draghici S Khatri P et al A novel signaling path-way impact analysis Bioinformatics 20092575ndash82

175 Ramanan VK Shen L Moore JH et al Pathway analysis ofgenomic data concepts methods and prospects for futuredevelopment Trends Genet 201228323ndash2

176 Li Y Agarwal P Rajagopalan D A global pathway crosstalknetwork Bioinformatics 2008241442ndash7

177 Rivals I Personnaz L Taing L et al Enrichment or depletionof a GO category within a class of genes which testBioinformatics 200723401ndash7

178 Rhee SY Wood V Dolinski K et al Use and misuse of thegene ontology annotations Nat Rev Genet 20089509ndash15

179 Chandonia JM Brenner SE The impact of structural gen-omics expectations and outcomes Science 2006311347ndash51

180 Johnson MS Srinivasan N Sowdhamini R et al Knowledge-based protein modeling Crit Rev Biochem Mol Biol 1994291ndash68

181 Sulkowska JI Morcos F Weigt M et al Genomics-aided struc-ture prediction Proc Natl Acad Sci USA 201210910340ndash5

182 Lee D Redfern O Orengo C Predicting protein function from se-quence and structure Nat Rev Mol Cell Biol 20078995ndash1005

183 Worth CL Gong S Blundell TL Structural and functionalconstraints in the evolution of protein families Nat Rev MolCell Biol 200910709ndash20

184 Karczewski KJ Fernald GH Martin AR et al STORMSeq anopen-source user-friendly pipeline for processing personalgenomics data in the cloud PLoS One 20149e84860

185 Kalari KR Rossell D Necela BM et al Deep sequence ana-lysis of non-small cell lung cancer integrated analysis ofgene expression alternative splicing and single nucleotidevariations in lung adenocarcinomas with and without onco-genic KRAS mutations Front Oncol 2012212

186 Patil S Upadhayay A Arya D et al A high throughput exomesequencing approach to analyse events within a good re-sponder CML patient under imatinib at diagnosis and underremission Blood 20131225161

187 Kullo IJ Shameer K Jouni H et al The ATXN2-SH2B3 locus isassociated with peripheral arterial disease an electronicmedical record-based genome-wide association study FrontGenet 20145166

188 Shameer K Denny JC Ding K et al A genome- and phe-nome-wide association study to identify genetic variantsinfluencing platelet count and volume and their pleiotropiceffects Hum Genet 201413395ndash109

189 Sabarinathan R Wenzel A Novotny P et al Transcriptome-wide analysis of UTRs in non-small cell lung cancer revealscancer-related genes with SNV-induced changes on RNAsecondary structure and miRNA target sites PLoS One20149e82699

22 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

  • bbv084-TF1
Page 14: challenges in proteome-scale prediction, annotation and assessment

were also cleared by similar mechanisms Functional variantsplay a crucial role in aggregation and protein degration path-ways [134 135] Computational approaches that can predict thedegradation properties of mutated proteins using sequence orstructural information will be useful for rapid characterizationof functionally active proteins Integrated models that combinefolding misfolding aggregation 3D domain swapping macro-molecular crowding and degradation pathways (Figure 6) in asystems biology approach are likely to provide additional in-sights into the regulation of these important mechanisms andthe role of coding SNVs in mediating such mechanisms

Coding SNVs and metamorphic proteins

The molecular paradigm of sequence-structure-function sug-gests that a diverse set of sequences could attain a similarstructural fold that may lead to a functional convergence Whilethis is generally true for the protein universe occasionally devi-ations are observed Metamorphic proteins refer to a relativelynew class of proteins in which a given sequence has beenshown to attain different folded conformations under nativeconditions while performing distinct functions [136ndash138]Metamorphic proteins such as human chemokine lymphotactinhave been shown to influence evolutionary transitions of struc-ture and coding SNVs which could play an important role inswitching between two or more folds and functions for a singlesequence [139] Computational approaches to catalog and pre-dict the metamorphic properties and the impact of coding SNVson metamorphic proteins would help us understand how muta-tions drive functional plasticity at the proteome level

Impact of coding SNVs on the transcriptomicdiversity

Dynamic features of the human proteome are driven by tran-script diversity and the average number of characterized tran-scripts per gene is rapidly expanding The recent adoption ofRNA-Seq as a tool for expression profiling has led to the charac-terization of a large number of novel transcripts includingfusion transcripts [140] Several novel tissue-specific orcell-type-specific transcripts were reported from RNA-Seqexperiments The cellular compartment-based functional rolesof such transcripts are determined by alternative splicingevents [141] RNA editing is a phenomenon where a pre-mRNAmolecule is altered through a chemical change in the basemakeup thereby adding to the diversity of the transcriptomeRNA editing events occur via two distinct mechanisms of sub-stitution editing and insertiondeletion editing leading to func-tional diversity Fusion transcripts [142] and RNA editing eventsare also associated with various diseases including prostatecancer and amyotrophic lateral sclerosis [143] Computationalapproaches are available for the identification of fusion tran-scripts [144] and RNA editing events [145ndash147] from sequencingdata Understanding the precise roles and the impact of variantson such a diverse set of transcripts using computationalapproaches is a challenging task Ascribing the functional roleand assessing the impact of coding SNVs for mediating noveland fusion transcripts using bioinformatics approaches is anemerging problem and being addressed in recent studies [148]Recently we designed an integrated method to simultaneousanalyses of genome and transcriptome from RNA-seq data TheeSNV-Detect method can precisely capture genetic variation(genotypes) from RNA-seq data and helps to design cost-

effective and sustainable experimental strategies [149] Analyticand interpretation strategies that rely on multiple data-typeswould provide greater confidence for variant calling annotationand interpretation and thus actionability for clinicalinterventions

Functional impact of synonymous variations

Synonymous mutations are defined as mutations that result ina variation at the DNA level that code for the same amino acidin the protein level owing to codon degeneracy Synonymousvariants may either be coding or noncoding and the functionalroles of such variants are emerging A recent study [150] sug-gests that introns are involved in functional mechanisms andintrons with positional conservation across eukaryotic lineageare classified as functional introns A comparative analysis ofsynonymous and non-synonymous variants associated withcomplex diseases has shown similar likelihood and effect sizewith disease association [151] Synonymous mutations mayalso influence the introns regulating gene expression [152] A re-cent report also suggests that the coding exons function as tis-sue-specific enhancers and synonymous variations in suchenhancer sites may influence the expression level of certaingenes [153] Synonymous variants could influence the expres-sion of intronic noncoding RNAs [154] perturb the transientproteinndashDNA [155 156] and proteinndashRNA interactions in the cellSuch perturbations can lead to diseases such as cancer neuro-logical disorders and cardiac disease [157] Developing bioinfor-matics approaches to explore the putative impact ofsynonymous mutations on different layers of the coding andnoncoding genome and their relationships will be an importantaspect in the detailed interpretation of the genomic variantsThe detailed structural exploration of proteinndashDNA [158] andproteinndashRNA interactions [159] will help to precisely map themechanistic bases of the loss or gain of interactions owing tosynonymous variations

Impact of coding SNVs on the interactomeof a protein

The individual interactome of proteins varies to a great extent[160] Proteins often interact with nucleotides (proteinndashDNA orproteinndashRNA interactions) proteins (protein complexes obligateor non-obligate) [161] and proteinndashprotein interactions (transientor permanent) [107]) and small molecules (molecular reactionsenzymatic reactions metabolic pathways) [162 163] CodingSNVs play an important role in mediating such interactions Aninteractome of a protein can be defined using information gath-ered from high-throughput experiments that systematicallyidentify interactants of proteins Public protein-protein databaseslike BioGRID [164] and STRING [165] provide large data sets forderiving a protein interactome Network-level investigations tounderstand the category of interactors that are perturbed owingto coding SNVs will help delineate the impact of such mutationson the protein interactomes Using network measures (centralitydegree stress betweenness closeness cliques radiality transi-tivity reciprocity assortativity structural equivalence networkheterogeneity network density clustering coefficients neighbor-hood connectivity shared neighbors network topology etc) asquantitative parameters in variant annotation pipelines will helpthe researchers gain better insights into the impact of codingSNVs on the protein interactome Biologically relevant networkproperties such as network modularity [166] network fragility

14 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

[166] and lethality [166] can also be derived from an interactomeIncorporating a comparative network analysis framework thatcompares wild-type and variant interactomes will help quantifythe impact of coding SNVs on a network scale Methods such asVCF2Networks employ genotype networks (ie all genotypesassociated with a single phenotype) to understand the relation-ships between the genotype space and clinical or biological phe-notypes [167] Table 7 summarizes the available tools toinvestigate gene lists and to derive global functional trends andnetwork properties

Pathway-level impact of coding SNVs

Multiple functional genomics studies have investigated the per-turbations of pathways as a result of mutations in diseasessuch as adenocarcinoma (MAPK signaling p53 signaling Wntsignaling cell cycle and mammalian target of rapamycin path-ways) [168] childhood acute lymphoblastic leukemia (RAS path-way) [169] colorectal carcinoma (WNT signaling pathway) [170]etc Several SNP-centric approaches for pathway-level inferencehave been designed for the interpretation of SNPs identifiedfrom GWA studies or differentially expressed genes from gene-expression studies [171ndash175] Coding SNVs may also influencethe cross talk between various signaling pathways Currentvariant interpretation algorithms strive to identify pathways

associated with genes harboring coding SNVs yet a detailedunderstanding of the impact of variants on biological pathwaysand pathway cross talks [176] is often lacking Incorporatinganalytical routines to quantify the effect of functional muta-tions on pathways and pathway cross talk will be useful in in-terpreting functional variants from a biological perspective

A classification schema for annotatingcoding SNVs

Identifying the entire spectrum of molecular perturbationsowing to coding variations at the level of sequence structureinteraction and function of proteins is considered the basis ofvariant interpretation Multiple automated prediction tools thatcan assess the functional effect of mutations are currentlyavailable [See Table 6] Several variant annotation tools focuson the task of predicting the type or effect of mutations provideextended annotations from precomputed databases or lookuptables and map a coding variant to its corresponding gene pro-tein functional domain or signaling pathway Theseapproaches have several limitations because additional layersof protein-centric information that can be derived from the pre-diction or computation of sequence-based features with codingvariants are lacking To deal with this challenge and to designbroad-spectrum tools for deep variant interpretation we

Table 7 Software libraries and tools for biological network analysis

Name Description URL

Bio4J Bio4j is a bioinformatics graph-based database httpwwwbio4jcomBioConductor (Graphs

and Networks view)Collection of BioConductor modules for biological net-

work analysis and visualizationhttpwwwbioconductororgpackagesrelease

BiocViewshtml___GraphsAndNetworksCytoscape Open source platform for complex network analysis and

visualization with a large collection of plug-ins forbiological network analysis

httpwwwcytoscapeorg

DAVID Integrated functional annotation tool httpdavidabccncifcrfgovhomejspFunDO Functional Disease Ontology server httpdjangonubicnorthwesternedufundogene2pathway R package for prediction of KEGG pathway membership

for individual genes based on InterPro domainsignatures

httpwwwbioconductororgpackagesreleasebiochtmlgene2pathwayhtml

GeneAnswers R package for biological or medical interpretation of thegiven one or more groups of genes by means of statis-tical test

httpwwwbioconductororgpackagesreleasebiochtmlGeneAnswershtml

GO Tools Tools for analysis of GO T term enrichment statisticalanalysis semantic similarity and functional similar-ity using GO terms derived from gene lists

httpwwwgeneontologyorgGOtoolsshtml

Gephi Open-source graph visualization and analysis software httpwwwgephiorgGremlin Graph-based programming language httpsgithubcomtinkerpopgremlinwikiiGraph Network analysis and visualization library in C Also

available R package and a Python extensionhttpigraphsourceforgenet

KEGGgraph R package for analysis of KEGG pathways httpwwwbioconductororgpackages210biochtmlKEGGgraphhtml

LEDA A broad-spectrum Cthornthorn class library for efficient datatypes and algorithms including large-scale networkanalysis

httpwwwalgorithmic-solutionscomledaaboutindexhtm

Neat Web-based network analysis tools httpgephiorgOntology Analysis

plugins for Cytoscapehttpchiantiucsdeducyto_webplugins Plugins for functional enrichment analysis using

network dataPANTHER Classification of genes and proteins httppantherdborgReactome Curated knowledgebase of molecular events and

pathwayshttpwwwreactomeorg

TargetMine Integrates different types of biological data and enableflexible queries export results and analyze lists ofdata

httptargetminemizuguchilaborg

Interpreting functional effects of coding variants | 15

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

recommend a three-level annotation schema for the interpret-ation of coding variants (Figure 7) Primary level annotation pro-vides an overview of the types of mutations and associationannotations Secondary level annotation enables the systematicinvestigation of multiple variants in a single gene Tertiary-levelannotations help to find the global characteristics of genes har-boring coding variants and various properties in a network-scale

Level 1 Primary annotation of coding SNVs

Tools that use position information from VCF files to derive thetype of mutation (synonymous or non-synonymous) the loca-tion of the corresponding gene and its mapping onto several an-notation resources using positional data may be viewed aslsquoprimary annotationsrsquo See Table 2 for a list of tools that useminimum input data to gather a diverse set of annotationsusing database lookup tables and identifier mapping

Level 2 Secondary annotation of coding SNVs- Impactof multiple variants in a single gene

Sequence and structural explorations of coding variants arereferred to as secondary annotations VCF files can have mul-tiple variants of the same gene associated with a given pheno-type or disease A systematic investigation of multiple variantsin a single gene and an assessment of the relative impact of cod-ing variants could help in classifying multiple variants in a geneand help to delineate the variants as pathogenic moderatelypathogenic or VUS Comparative analysis of wild-type and vari-ant sequences can be performed and quantified using the totalnumber of gains or losses in the sequence features Based on the

availability of the structural data a given variant can be modeledin a protein structure using in-silico mutagenesis experimentsIn the absence of experimental protein structural data a proteinstructure model can be obtained from the homology model data-base or a new model can be built using homology modelingapproaches Once the wild type and mutant type structures areobtained the impact of variation can be rapidly computed usingstructural feature analyses or using computationally expensivemolecular dynamics simulations Table 5 summarizes key se-quence and structure feature prediction tools

Level 3 Tertiary annotation of coding SNVs- Impact ofcoding SNVs on multiple genes pathways andinteractome

The global impact of multiple coding variants on a genome orexome can be assessed using a combination of both knowledge-based enrichment or depletion analysis and network or interac-tome analyses Functional profiling using GO terms have beenused as an effective approach for characterizing collective func-tional characteristics of a perturbed set of genes [177 178]Gene-list-based analyses can provide biologically relevant infor-mation about the variants Enrichment analysis is not just con-fined to a priori defined gene sets or GO annotationsenrichment can be performed using several types of annota-tions and could provide insights into the probable functional as-sociations of genes with coding variants Enrichment analysisusing protein annotations would help to identify functionallysignificant protein domains protein classes families (mem-brane proteins or kinases etc) protein superfamilies (angioten-sin receptors or G-protein coupled receptors) or protein folds(Rossman fold beta-propeller) associated with proteins

Figure 7 Three-level schema for annotation of functional variants

16 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

harboring coding variants Furthermore it is possible to identifyenriched molecular events mediated by genes enriched TFBS inthe upstream of genes and biological pathways mediated by thegenes using annotations from Reactome UCSC and KEGG repo-sitories Disease ontology can be used to find known genendashdisease associations Following the knowledge-based analysis anetwork-level analysis of genes harboring coding SNVs can beperformed using publicly available network analysis tools suchas Cytoscape and Gephi

Future directions

The rapidly increasing availability of sequence structure func-tional and interaction information offers an attractive means toobtain a detailed characterization of coding variants Howeverthe use of large data sets to systematically explore the functionalvariants is currently limited owing to the gaps in the availablesequencendashstructurendashfunction interaction data Sequence dataare available for the entire human proteome but the availabilityof protein structural data experimentally verified functional as-sociations and biomolecular interaction data is limited Theavailability of structural data of homologous protein structuresis a prerequisite for structural investigations of functional vari-ants and its impact on the structural environment Availabilityof high-quality interaction data will also help analyze the cellu-lar networks modulated by proteins harboring coding variantsThe expansion of structural data sets using structural genomics[179] homology modeling [180] and the application of computa-tional approaches such as genomics-aided structure prediction[181] is likely to lower the increasing chasm between sequencestructure function and interaction data and may help to ascer-tain the impact of variants at the biomolecular level [163 182183] Improving the annotation databases using high-qualitycurated data and integration with cloud-based or stand-alonepipelines [184ndash189] can also help facilitate such efforts

The current version of the Bioinformatics Link Directory en-lists gt900 bioinformatics tools that are capable of processing dif-ferent types of protein data (sequence structure interactionquantification and annotation) Unifying multiple resources andenabling programmatic access via application program interfacesand web services for rapid integration will significantly enhancethe efficacy of variant annotation pipelines The development ofnovel statistical methods that can quantify various residue-specific properties would enable a comparative analysis of wild-type and variant sequences and can thereby help to facilitate theidentification and prioritization of functional mutations The in-clusion of a diverse set of sequence structure and network fea-tures into a graphically depicted database and an assessment ofthe impact of various data types (sequence structure functionand interaction) using probabilistic or machine learning modelswould also help automate variant annotation interpretation

From a sequence perspective understanding the precisefunctional role of LCRs unassigned regions and disordered re-gions as well as the relationship between these features and cod-ing SNVs is a key step in variant annotation and interpretationIdentifying deviations in protein structural space that are likelyto lead to misfolding domain swapping aggregation degrad-ation or deviations that are metamorphic in nature owing to theimpact of coding SNVs is an emerging challenge Several algo-rithms are available to predict sequence or structural domainsassign distantly related domains to unassigned regions func-tional motifs structural motifs and structural features proteindisorder and aggregation propensities However algorithms andtools to predict 3D domain swapping or to analyze metamorphic

properties degradation pathways and the network-level impactof coding SNVs are less abundant This may be attributed to thefact that these concepts are continually evolving in the proteinuniverse and concerted efforts are needed to understand the im-pact of coding variants of such unique features from both experi-mental and computational biologists alike

This review summarizes the major bioinformatics chal-lenges involved in gaining a deeper understanding of codingSNVs Coding SNVs may impart a molecular or disease pheno-type by interacting with noncoding regulatory parts of the gen-ome and the structural variants may provide an importantcontribution to this phenomenon Analyzing functional andregulatory variants in a single analytical framework may furtherenhance the interpretation of the genomic variants

Conclusions

A rapid decline in sequencing costs using NGS technologies hasled to an exponential increase in the frequency of the sequenc-ing projects over the past decade A large number of personalgenomes and exomes clinical samples and cancer genomes arebeing sequenced as part of large-scale collaborative projectsThe identification of a plethora of sequence variants associatedwith diverse molecular and disease phenotypes Scalable com-putational approaches that integrate annotations using se-quence structure functional and interaction data are necessaryfor the rapid interpretation of coding SNVs We have discussedthe bioinformatics resources available for the prediction anno-tation and visualization of coding SNVs summarized the majorbioinformatics challenges into 10 different themes for a deeperinterpretation of coding SNVs and recommended a three-levelschema to assess the phenotypic impact of functional vari-ations on individual protein sequences structures differentfunctional categories biological pathways and interactomesWe envisage that addressing the key challenges discussed inthis review and adopting a comprehensive annotation schemafor variant annotation could improve genomic reports that aregenerated as part of genomic medicine investigations and ex-perimental studies to better understand the variations impli-cated in rare common and complex disease manifestations

Key Points

bull Interpretations of variants identified from next-gener-ation sequencing pose several challenges

bull Systems-level experimental investigation of the func-tional variants is expensive and time-consumingefficient computational techniques are required toidentify the impact of functional variants

bull We recommended an integrated approach that com-bines multiple data types and tools for the predictionannotation and visualization of functional variantsand we have proposed a systematic approach for func-tional variant annotation and interpretation

bull Significant challenges that need to be negotiated dur-ing the interpretation of coding single nucleotide vari-ants are presented with the help of various examples

bull A three-level annotation approach that combines theinformation at the level of an individual variant mul-tiple variants in a single protein and global trends ofmultiple genes harboring multiple variants is proposedfor an effective interpretation of coding single nucleo-tide variants

Interpreting functional effects of coding variants | 17

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

Acknowledgements

KS and JTD would like to thank Icahn Institute forGenomics and Multiscale Biology (httpicahnmssmedudepartments-and-institutesgenomics) Mount Sinai HealthSystem for infrastructural support KS would like to ac-knowledge Michael J Zimmermann and Jean-Pierre Kocher(Mayo Clinic Rochester) for useful discussions The MayoClinic Bioinformatics Core provided infrastructure and ana-lytical support to KRK RS would like to thank NationalCentre for Biological Sciences (TIFR) for infrastructuralsupport

Funding

KS and JTD was supported by grant R01-DK098242-03from the National Institute of Diabetes and Digestive andKidney Diseases KRK is funded in part by the Mayo ClinicCenter for Individualized Medicine the PharmacogenomicsResearch Network (PGRN) and the Mayo Clinic BreastSpecialized Program of Research Excellence (SPORE)

References1 Lander ES Initial impact of the sequencing of the human

genome Nature 2011470187ndash972 Mardis ER A decadersquos perspective on DNA sequencing tech-

nology Nature 2011470198ndash2033 Green ED Guyer MS Charting a course for genomic medi-

cine from base pairs to bedside Nature 2011470204ndash134 Maxmen A Exome sequencing deciphers rare diseases Cell

2011144635ndash75 Wheeler DA Srinivasan M Egholm M et al The complete

genome of an individual by massively parallel DNAsequencing Nature 2008452872ndash6

6 Weatheritt RJ Davey NE Gibson TJ Linear motifs conferfunctional diversity onto splice variants Nucleic Acids Res201240123ndash31

7 Goldstein DB Common genetic variation and human traitsN Engl J Med 20093601696ndash8

8 Manolio TA Collins FS Cox NJ et al Finding the missing her-itability of complex diseases Nature 2009461747ndash53

9 Genome-wide association study of 14 000 cases of sevencommon diseases and 3 000 shared controls Nature2007447661ndash78

10 Kullo IJ Ding K Shameer K et al Complement receptor 1gene variants are associated with erythrocyte sedimenta-tion rate Am J Hum Genet 201189131ndash8

11 Welter D MacArthur J Morales J et al The NHGRI GWASCatalog a curated resource of SNP-trait associations NucleicAcids Res 201442D1001ndash6

12 Leslie R OrsquoDonnell CJ Johnson AD GRASP analysis of geno-type-phenotype results from 1390 genome-wide associationstudies and corresponding open access databaseBioinformatics 201430i185ndash94

13 Nelson MR Wegmann D Ehm MG et al An abundance ofrare functional variants in 202 drug target genes sequencedin 14 002 people Science 2012337100ndash4

14 MacArthur DG Balasubramanian S Frankish A et al A sys-tematic survey of loss-of-function variants in human pro-tein-coding genes Science 2012335823ndash8

15 Raychaudhuri S Mapping rare and common causal allelesfor complex human diseases Cell 201114757ndash69

16 Stankiewicz P Lupski JR Structural variation in the humangenome and its role in disease Annu Rev Med201061437ndash455

17 McCarroll SA Altshuler DM Copy-number variation and as-sociation studies of human disease Nat Genet200739S37ndash42

18 Zhang F Gu W Hurles ME et al Copy number variation inhuman health disease and evolution Annu Rev GenomicsHum Genet 200910451ndash81

19 Medvedev P Stanciu M Brudno M Computational methodsfor discovering structural variation with next-generationsequencing Nat Methods 20096S13ndash20

20 Sherry ST Ward MH Kholodov M et al dbSNP the NCBI data-base of genetic variation Nucleic Acids Res 200129308ndash11

21 Rios D McLaren WM Chen Y et al A database and API forvariation dense genotyping and resequencing data BMCBioinformatics 201011238

22 Iafrate AJ Feuk L Rivera MN et al Detection of large-scale variation in the human genome Nat Genet200436949ndash51

23 Ashley EA Butte AJ Wheeler MT et al Clinical assess-ment incorporating a personal genome Lancet20103751525ndash35

24 Chen R Mias GI Li-Pook-Than J et al Personal omics profil-ing reveals dynamic molecular and medical phenotypesCell 20121481293ndash307

25 Lupski JR Reid JG Gonzaga-Jauregui C et al Whole-genomesequencing in a patient with Charcot-Marie-Tooth neur-opathy N Engl J Med 20103621181ndash91

26 Li Y Vinckenbosch N Tian G et al Resequencing of200 human exomes identifies an excess of low-fre-quency non-synonymous coding variants Nat Genet201042969ndash72

27 Clarke L Zheng-Bradley X Smith R et al The 1000 GenomesProject data management and community access NatMethods 20129459ndash462

28 Schunkert H Konig IR Kathiresan S et al Large-scale associ-ation analysis identifies 13 new susceptibility loci for coron-ary artery disease Nat Genet 201143333ndash8

29 Helgadottir A Thorleifsson G Manolescu A et al A commonvariant on chromosome 9p21 affects the risk of myocardialinfarction Science 20073161491ndash3

30 Visel A Zhu Y May D et al Targeted deletion of the 9p21non-coding coronary artery disease risk interval in miceNature 2010464409ndash12

31 Soranzo N Rendon A Gieger C et al A novel variant onchromosome 7q223 associated with mean platelet volumecounts and function Blood 20091133831ndash7

32 Paul DS Nisbet JP Yang TP et al Maps of open chromatinguide the functional follow-up of genome-wide associationsignals application to hematological traits PLoS Genet20117e1002139

33 Hull J Campino S Rowlands K et al Identification of com-mon genetic variation that modulates alternative splicingPLoS Genet 20073e99

34 Albert FW Kruglyak L The role of regulatory vari-ation in complex traits and disease Nat Rev Genet201516197ndash212

35 Battle A Khan Z Wang SH et al Genomic variation Impactof regulatory variation from RNA to protein Science2015347664ndash7

36 den Dunnen JT Antonarakis SE Nomenclature for the de-scription of human sequence variations Hum Genet2001109121ndash4

18 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

37 Eilbeck K Lewis SE Mungall CJ et al The sequence ontologya tool for the unification of genome annotations Genome Biol20056R44

38 Landrum MJ Lee JM Riley GR et al ClinVar public archive ofrelationships among sequence variation and human pheno-type Nucleic Acids Res 201442D980ndash5

39 Kircher M Witten DM Jain P et al A general framework forestimating the relative pathogenicity of human genetic vari-ants Nat Genet 201446310ndash15

40 Kumar P Henikoff S Ng PC Predicting the effects of codingnon-synonymous variants on protein function using theSIFT algorithm Nat Protoc 200941073ndash81

41 Adzhubei I Schmidt S Peshkin L et al A method and serverfor predicting damaging missense mutations Nat Meth20107248ndash9

42 Gonzalez-Perez A Lopez-Bigas N Improving the assessmentof the outcome of nonsynonymous SNVs with a consensusdeleteriousness score Condel Am J Hum Genet 201188440ndash9

43 Grantham R Amino acid difference formula to help explainprotein evolution Science 1974185862ndash4

44 Cooper GM Stone EA Asimenos G et al Distribution and in-tensity of constraint in mammalian genomic sequenceGenome Res 200515901ndash13

45 Pollard KS Hubisz MJ Rosenbloom KR et al Detection ofnonneutral substitution rates on mammalian phylogeniesGenome Res 201020110ndash21

46 Siepel A Bejerano G Pedersen JS et al Evolutionarily con-served elements in vertebrate insect worm and yeast gen-omes Genome Res 2005151034ndash50

47 Dayhoff MO Eck RV Park CM Model of evolutionary changein proteins In MO Dayhoff (ed) Atlas of Protein Sequence andStructure Washington DC National Biomedical ResearchFoundation 1972 89ndash99

48 Henikoff S Henikoff JG Amino acid substitution matricesfrom protein blocks Proc Natl Acad Sci USA 19928910915ndash19

49 Altschul SF Gish W Miller W et al Basic local alignmentsearch tool J Mol Biol 1990215403ndash10

50 Altschul SF Madden TL Schaffer AA et al Gapped BLASTand PSI-BLAST a new generation of protein database searchprograms Nucleic Acids Res 1997253389ndash402

51 Sunyaev S Ramensky V Koch I et al Prediction of deleteri-ous human alleles Hum Mol Genet 200110591ndash7

52 Yang Z Bielawski JP Statistical methods for detecting mo-lecular adaptation Trends Ecol Evol 200015496ndash503

53 Ng PC Henikoff S Predicting deleterious amino acid substi-tutions Genome Res 200111863ndash74

54 Wang Z Moult J SNPs protein structure and disease HumMutat 200117263ndash70

55 Thomas PD Kejariwal A Coding single-nucleotide poly-morphisms associated with complex vs Mendelian diseaseevolutionary evidence for differences in molecular effectsProc Natl Acad Sci USA 200410115398ndash403

56 Stone EA Sidow A Physicochemical constraint violation bymissense substitutions mediates impairment of proteinfunction and disease severity Genome Res 200515978ndash86

57 Miller MP Kumar S Understanding human disease muta-tions through the use of interspecific genetic variation HumMol Genet 2001102319ndash28

58 Johnson AD Single-nucleotide polymorphism bioinfor-matics a comprehensive review of resources Circ CardiovascGenet 20092530ndash6

59 Marjoram P Tavare S Modern computational approachesfor analysing molecular genetic variation data Nat Rev Genet20067759ndash70

60 Excoffier L Heckel G Computer programs for populationgenetics data analysis a survival guide Nat Rev Genet20067745ndash58

61 Peterson TA Nehrt NL Park D et al Incorporating molecularand functional context into the analysis and prioritizationof human variants associated with cancer J Am Med InformAssoc 201219275ndash83

62 Kann MG Advances in translational bioinformatics compu-tational approaches for the hunting of disease genes BriefBioinform 20101196ndash110

63 Cooper GM Shendure J Needles in stacks of needles findingdisease-causal variants in a wealth of genomic data Nat RevGenet 201112628ndash40

64 Stitziel NO Kiezun A Sunyaev S Computational and statis-tical approaches to analyzing variants identified by exomesequencing Genome Biol 201112227

65 Alexander RP Fang G Rozowsky J et al Annotating non-coding regions of the genome Nat Rev Genet 201011559ndash71

66 Mort M Evani US Krishnan VG et al In silico func-tional profiling of human disease-associated and poly-morphic amino acid substitutions Hum Mutat 201031335ndash46

67 Hamosh A Scott AF Amberger JS et al Online MendelianInheritance in Man (OMIM) a knowledgebase of humangenes and genetic disorders Nucleic Acids Res200533D514ndash17

68 Stenson PD Mort M Ball EV et al The human gene mutationdatabase 2008 update Genome Med 2009113

69 Forbes SA Bindal N Bamford S et al COSMIC mining com-plete cancer genomes in the Catalogue of SomaticMutations in Cancer Nucleic Acids Res 201139D945ndash50

70 Chen YA Tripathi LP Mizuguchi K TargetMine an inte-grated data warehouse for candidate gene prioritisation andtarget discovery PLoS One 20116e17844

71 Wang K Li M Hakonarson H ANNOVAR functional annota-tion of genetic variants from high-throughput sequencingdata Nucleic Acids Res 201038e164

72 Danecek P Auton A Abecasis G et al The variant call formatand VCFtools Bioinformatics 2011272156ndash8

73 Macintyre G Jimeno Yepes A Ong CS et al Associating dis-ease-related genetic variants in intergenic regions to thegenes they impact Peer J 20142e639

74 Javed A Agrawal S Ng PC Phen-Gen combining phenotypeand genotype to analyze rare disorders Nat Methods201411935ndash7

75 Dai HJ Wu JC Tsai RT et al T-HOD a literature-based candi-date gene database for hypertension obesity and diabetesDatabase (Oxford) 20132013bas061

76 Cheng D Knox C Young N et al PolySearch a web-basedtext mining system for extracting relationships betweenhuman diseases genes mutations drugs and metabolitesNucleic Acids Res 200836W399ndash405

77 Do R Kathiresan S Abecasis GR Exome sequencing andcomplex disease practical aspects of rare variant associ-ation studies Hum Mol Genet 201221R1ndash9

78 Wagner MJ Rare-variant genome-wide association studiesa new frontier in genetic analysis of complex traitsPharmacogenomics 201314413ndash24

79 Genomes Project C Abecasis GR Auton A et al An inte-grated map of genetic variation from 1 092 human genomesNature 201249156ndash65

80 Anfinsen CB The formation and stabilization of proteinstructure Biochem J 1972128737ndash49

81 Anfinsen CB Principles that govern the folding of proteinchains Science 1973181223ndash30

Interpreting functional effects of coding variants | 19

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

82 Fay J Wu C Sequence divergence functional constraintand selection in protein evolution Annu Rev Genomics HumGenet 20034213ndash35

83 Ponting CP Russell RR The natural history of protein do-mains Annu Rev Biophys Biomol Struct 20023145ndash71

84 Ekman D Bjorklund AK Frey-Skott J et al Multi-domain pro-teins in the three kingdoms of life orphan domains andother unassigned regions J Mol Biol 2005348231ndash43

85 Sigrist CJ Cerutti L Hulo N et al PROSITE a documenteddatabase using patterns and profiles as motif descriptorsBrief Bioinform 20023265ndash74

86 Dinkel H Michael S Weatheritt RJ et al ELMndashthe database ofeukaryotic linear motifs Nucleic Acids Res 201240D242ndash51

87 Haberichter SL Budde U Obser T et al The mutation N528Sin the von Willebrand factor (VWF) propeptide causes de-fective multimerization and storage of VWF Blood20101154580ndash7

88 Kutz WE Wang LW Dagoneau N et al Functional analysis ofan ADAMTS10 signal peptide mutation in Weill-Marchesanisyndrome demonstrates a long-range effect on secretion ofthe full-length enzyme Hum Mutat 2008291425ndash34

89 Pugalenthi G Suganthan PN Sowdhamini R et al SMotif aserver for structural motifs in proteins Bioinformatics200723637ndash8

90 Sundaramurthy P Shameer K Sreenivasan R et al HORI aweb server to compute Higher Order Residue Interactions inprotein structures BMC Bioinformatics 201011(Suppl 1)S24

91 Porter CT Bartlett GJ Thornton JM The Catalytic Site Atlas aresource of catalytic sites and residues identified in enzymesusing structural data Nucleic Acids Res 200432D129ndash33

92 Stuart AC Ilyin VA Sali A LigBase a database of families ofaligned ligand binding sites in known protein sequencesand structures Bioinformatics 200218200ndash01

93 Swapna LS Mahajan S de Brevern A et al Comparison oftertiary structures of proteins in protein-protein complexeswith unbound forms suggests prevalence of allostery in sig-nalling proteins BMC Struct Biol 2012126

94 Mathivanan S Ahmed M Ahn NG et al Human Proteinpediaenables sharing of human protein data Nat Biotechnol200826164ndash7

95 Bhaskara RM Srinivasan N Stability of domain structures inmulti-domain proteins Sci Rep 2011140

96 Furnham N de Beer TA Thornton JM Current challenges ingenome annotation through structural biology and bioinfor-matics Curr Opin Struct Biol 201222594ndash601

97 Wang Y Cheng H Pan Z et al Reconfiguring phosphoryl-ation signaling by genetic polymorphisms affects cancersusceptibility J Mol Cell Biol 20157187ndash202

98 Gnad F Gunawardena J Mann M PHOSIDA 2011 the post-translational modification database Nucleic Acids Res201139D253ndash60

99 Nicolaou N Margadant C Kevelam SH et al Gain of glycosy-lation in integrin alpha3 causes lung disease and nephroticsyndrome J Clin Invest 20121224375ndash87

100 Radivojac P Baenziger PH Kann MG et al Gain and loss ofphosphorylation sites in human cancer Bioinformatics200824i241ndash7

101 Pan Y Karagiannis K Zhang H et al Human germline andpan-cancer variomes and their distinct functional profilesNucleic Acids Res 20144211570ndash88

102 Mazumder R Morampudi KS Motwani M et al Proteome-wide analysis of single-nucleotide variations in the N-glyco-sylation sequon of human genes PLoS One 20127e36212

103 Reddy CC Shameer K Offmann BO et al PURE a webserverfor the prediction of domains in unassigned regions in pro-teins BMC Bioinformatics 20089281

104 Coletta A Pinney JW Solis DY et al Low-complexity regionswithin protein sequences have position-dependent rolesBMC Syst Biol 2010443

105 Jones DT Swindells MB Getting the most from PSI-BLASTTrends Biochem Sci 200227161ndash4

106 Chograni M Rejeb I Jemaa LB et al The first missense muta-tion of NHS gene in a Tunisian family with clinical featuresof NHS syndrome including cardiac anomaly Eur J HumGenet 201119851ndash6

107 Jones S Thornton JM Principles of protein-protein inter-actions Proc Natl Acad Sci USA 19969313ndash20

108 Babu MM van der Lee R de Groot NS et al Intrinsically dis-ordered proteins regulation and disease Curr Opin Struct Biol201121432ndash40

109 Dunker AK Silman I Uversky VN et al Function and struc-ture of inherently disordered proteins Curr Opin Struct Biol200818756ndash64

110 Wright PE Dyson HJ Intrinsically unstructured proteins re-assessing the protein structure-function paradigm J Mol Biol1999293321ndash31

111 Dyson HJ Wright PE Coupling of folding and binding for un-structured proteins Curr Opin Struct Biol 20021254ndash60

112 Nguyen Ba AN Yeh BJ van Dyk D et al Proteome-wide dis-covery of evolutionary conserved sequences in disorderedregions Sci Signal 20125rs1

113 Das RK Mao AH Pappu RV Unmasking functionalmotifs within disordered regions of proteins Sci Signal20125pe17

114 Hu Y Liu Y Jung J et al Changes in predicted protein dis-order tendency may contribute to disease risk BMCGenomics 201112(Suppl 5)S2

115 Vacic V Iakoucheva LM Disease mutations in disordered re-gionsndashexception to the rule Mol BioSyst 2012827ndash32

116 Srivastava SK Gayathri S Manjasetty BA et al Analysis ofconformational variation in macromolecular structuralmodels PLoS One 20127e39993

117 Ward JJ Sodhi JS McGuffin LJ et al Prediction and functionalanalysis of native disorder in proteins from the three king-doms of life J Mol Biol 2004337635ndash45

118 Thomas PJ Qu BH Pedersen PL Defective protein folding asa basis of human disease Trends Biochem Sci 199520456ndash9

119 Welch W Role of quality control pathways in human dis-eases involving protein misfolding Semin Cell Dev Biol20041531ndash8

120 Fink AL Protein aggregation folding aggregates inclusionbodies and amyloid Fold Des 19983R9ndash23

121 Thirumalai D Klimov DK Dima RI Emerging ideas on themolecular basis of protein and peptide aggregation CurrOpin Struct Biol 200313146ndash59

122 Khare SD Dokholyan NV Molecular mechanisms of poly-peptide aggregation in human diseases Curr Protein Pept Sci20078573ndash9

123 Soto C Estrada L Castilla J Amyloids prions and the inher-ent infectious nature of misfolded protein aggregatesTrends Biochem Sci 200631150ndash5

124 Prusiner SB Molecular biology and pathogenesis of priondiseases Trends Biochem Sci 199621482ndash7

125 Shameer K Shingate PN Manjunath SC et al 3DSwap cura-ted knowledgebase of proteins involved in 3D domain swap-ping Database (Oxford) 20112011bar042

20 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

126 Bennett MJ Choe S Eisenberg D Domain swapping entan-gling alliances between proteins Proc Natl Acad Sci USA1994913127ndash31

127 Bennett MJ Sawaya MR Eisenberg D Deposition diseasesand 3D domain swapping Structure 200614811ndash24

128 Shameer K Sowdhamini R Functional repertoire molecularpathways and diseases associated with 3D domain swap-ping in the human proteome J Clin Bioinform 201228

129 Waters PJ Parniak MA Hewson AS et al Alterations in pro-tein aggregation and degradation due to mild and severemissense mutations (A104D R157N) in the human phenyl-alanine hydroxylase gene (PAH) Hum Mutat 199812344ndash54

130 OrsquoNeill JW Kim DE Johnsen K et al Single-site mutations in-duce 3D domain swapping in the B1 domain of protein Lfrom Peptostreptococcus magnus Structure 200191017ndash27

131 Seeliger MA Spichty M Kelly SE et al Role of conformationalheterogeneity in domain swapping and adapter function ofthe Cks proteins J Biol Chem 200528030448ndash59

132 Kirsten Frank M Dyda F Dobrodumov A et al Core muta-tions switch monomeric protein GB1 into an intertwinedtetramer Nat Struct Biol 20029877ndash85

133 Huang Y Cao H Liu Z Three-dimensional domain swappingin the protein structure space Proteins 2012801610ndash19

134 Gidalevitz T Wang N Deravaj T et al Natural genetic vari-ation determines susceptibility to aggregation or toxicity ina C elegans model for polyglutamine disease BMC Biol201311100

135 Ross CA Poirier MA Protein aggregation and neurodegener-ative disease Nat Med 200410(Suppl)S10ndash17

136 Murzin A Metamorphic proteins Science 20083201725ndash6137 Shortle D One sequence plus one mutation equals two

folds Proc Natl Acad Sci USA 200910621011ndash12138 Yadid I Kirshenbaum N Sharon M et al Metamorphic pro-

teins mediate evolutionary transitions of structure Proc NatlAcad Sci USA 20101077287ndash92

139 Yadid I Kirshenbaum N Sharon M et al Metamorphic pro-teins mediate evolutionary transitions of structure Proc NatlAcad Sci USA 20101077287ndash92

140 Pickrell J Marioni J Pai A et al Understanding mechanismsunderlying human gene expression variation with RNAsequencing Nature 2010464768ndash72

141 Matlin AJ Clark F Smith CW Understanding alternative splic-ing towards a cellular code Nat Rev Mol Cell Biol 20056386ndash98

142 Rickman DS Pflueger D Moss B et al SLC45A3-ELK4 is anovel and frequent erythroblast transformation-specific fu-sion transcript in prostate cancer Cancer Res 2009692734ndash8

143 Flomen R Makoff A Increased RNA editing in EAAT2 pre-mRNA from amyotrophic lateral sclerosis patients involve-ment of a cryptic polyadenylation site Neurosci Lett2011497139ndash43

144 Asmann YW Hossain A Necela BM et al A novel bioinfor-matics pipeline for identification and characterization of fu-sion transcripts in breast cancer and normal cell linesNucleic Acids Res 201139e100

145 Bundschuh R Computational prediction of RNA editingsites Bioinformatics 2004203214ndash20

146 Eisenberg E Li JB Levanon EY Sequence based identificationof RNA editing sites RNA Biol 20107248ndash52

147 Levanon EY Eisenberg E Algorithmic approaches for identi-fication of RNA editing sites Brief Funct Genomics Proteomic2006543ndash5

148 McCarthy DJ Humburg P Kanapin A et al Choice of tran-scripts and software has a large effect on variant annota-tion Genome Med 2014626

149 Tang X Baheti S Shameer K et al The eSNV-detect a com-putational system to identify expressed single nucleotidevariants from transcriptome sequencing data Nucleic AcidsRes 201442e172

150 Chorev M Carmel L The function of introns Front Genet2012355

151 Chen R Davydov EV Sirota M et al Non-synonymous andsynonymous coding SNPs show similar likelihood and effectsize of human disease association PLoS One 20105e13574

152 Rose AB Intron-mediated regulation of gene expressionCurr Top Microbiol Immunol 2008326277ndash90

153 Birnbaum RY Clowney EJ Agamy O et al Coding exonsfunction as tissue-specific enhancers of nearby genesGenome Res 2012221059ndash68

154 Nakaya HI Amaral PP Louro R et al Genome mapping andexpression analyses of human intronic noncoding RNAsreveal tissue-specific patterns and enrichment in genesrelated to regulation of transcription Genome Biol 20078R43

155 Rohs R Jin X West SM et al Origins of specificity in protein-DNA recognition Ann Rev Biochem 201079233ndash69

156 Karczewski KJ Dudley JT Kukurba KR et al Systematic func-tional regulatory assessment of disease-associated variantsProc Natl Acad Sci USA 20131109607ndash12

157 Khalil AM Rinn JL RNA-protein interactions in humanhealth and disease Semin Cell Dev Biol 201122359ndash65

158 Jones S van Heyningen P Berman HM et al Protein-DNAinteractions a structural analysis J Mol Biol 1999287877ndash96

159 Jones S Daley DT Luscombe NM et al Protein-RNAinteractions a structural analysis Nucleic Acids Res200129943ndash54

160 Barabasi AL Oltvai ZN Network biology understanding thecellrsquos functional organization Nat Rev Genet 20045101ndash13

161 Nooren IM Thornton JM Diversity of protein-protein inter-actions EMBO J 2003223486ndash92

162 Caspi R Altman T Dreher K et al The MetaCyc database ofmetabolic pathways and enzymes and the BioCyc collectionof pathwaygenome databases Nucleic Acids Res201240D742ndash53

163 Yamada T Bork P Evolution of biomolecular networks les-sons from metabolic and protein interactions Nat Rev MolCell Biol 200910791ndash803

164 Stark C Breitkreutz BJ Chatr-Aryamontri A et al TheBioGRID interaction database 2011 update Nucleic Acids Res201139D698ndash704

165 Szklarczyk D Franceschini A Kuhn M et al The STRINGdatabase in 2011 functional interaction networks of pro-teins globally integrated and scored Nucleic Acids Res201139D561ndash8

166 Bader GD Hogue CW An automated method for finding mo-lecular complexes in large protein interaction networksBMC Bioinformatics 200342

167 DallrsquoOlio GM Vahdati AR Bertranpetit J et alVCF2Networks applying Genotype Networks to SingleNucleotide Variants data Bioinformatics 201531438ndash9

168 Ding L Getz G Wheeler DA et al Somatic mutations affectkey pathways in lung adenocarcinoma Nature20084551069ndash75

169 Case M Matheson E Minto L et al Mutation of genes affect-ing the RAS pathway is common in childhood acutelymphoblastic leukemia Cancer Res 2008686803ndash9

170 Thorstensen L Lind GE Lovig T et al Genetic and epigeneticchanges of components affecting the WNT pathway in colo-rectal carcinomas stratified by microsatellite instabilityNeoplasia 2005799ndash108

Interpreting functional effects of coding variants | 21

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

171 Weng L Macciardi F Subramanian A et al SNP-based path-way enrichment analysis for genome-wide associationstudies BMC Bioinformatics 20111299

172 Wang K Li M Bucan M Pathway-based approaches for ana-lysis of genomewide association studies Am J Hum Genet2007811278ndash83

173 OrsquoDushlaine C Kenny E Heron EA et al The SNP ratio testpathway analysis of genome-wide association datasetsBioinformatics 2009252762ndash3

174 Tarca AL Draghici S Khatri P et al A novel signaling path-way impact analysis Bioinformatics 20092575ndash82

175 Ramanan VK Shen L Moore JH et al Pathway analysis ofgenomic data concepts methods and prospects for futuredevelopment Trends Genet 201228323ndash2

176 Li Y Agarwal P Rajagopalan D A global pathway crosstalknetwork Bioinformatics 2008241442ndash7

177 Rivals I Personnaz L Taing L et al Enrichment or depletionof a GO category within a class of genes which testBioinformatics 200723401ndash7

178 Rhee SY Wood V Dolinski K et al Use and misuse of thegene ontology annotations Nat Rev Genet 20089509ndash15

179 Chandonia JM Brenner SE The impact of structural gen-omics expectations and outcomes Science 2006311347ndash51

180 Johnson MS Srinivasan N Sowdhamini R et al Knowledge-based protein modeling Crit Rev Biochem Mol Biol 1994291ndash68

181 Sulkowska JI Morcos F Weigt M et al Genomics-aided struc-ture prediction Proc Natl Acad Sci USA 201210910340ndash5

182 Lee D Redfern O Orengo C Predicting protein function from se-quence and structure Nat Rev Mol Cell Biol 20078995ndash1005

183 Worth CL Gong S Blundell TL Structural and functionalconstraints in the evolution of protein families Nat Rev MolCell Biol 200910709ndash20

184 Karczewski KJ Fernald GH Martin AR et al STORMSeq anopen-source user-friendly pipeline for processing personalgenomics data in the cloud PLoS One 20149e84860

185 Kalari KR Rossell D Necela BM et al Deep sequence ana-lysis of non-small cell lung cancer integrated analysis ofgene expression alternative splicing and single nucleotidevariations in lung adenocarcinomas with and without onco-genic KRAS mutations Front Oncol 2012212

186 Patil S Upadhayay A Arya D et al A high throughput exomesequencing approach to analyse events within a good re-sponder CML patient under imatinib at diagnosis and underremission Blood 20131225161

187 Kullo IJ Shameer K Jouni H et al The ATXN2-SH2B3 locus isassociated with peripheral arterial disease an electronicmedical record-based genome-wide association study FrontGenet 20145166

188 Shameer K Denny JC Ding K et al A genome- and phe-nome-wide association study to identify genetic variantsinfluencing platelet count and volume and their pleiotropiceffects Hum Genet 201413395ndash109

189 Sabarinathan R Wenzel A Novotny P et al Transcriptome-wide analysis of UTRs in non-small cell lung cancer revealscancer-related genes with SNV-induced changes on RNAsecondary structure and miRNA target sites PLoS One20149e82699

22 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

  • bbv084-TF1
Page 15: challenges in proteome-scale prediction, annotation and assessment

[166] and lethality [166] can also be derived from an interactomeIncorporating a comparative network analysis framework thatcompares wild-type and variant interactomes will help quantifythe impact of coding SNVs on a network scale Methods such asVCF2Networks employ genotype networks (ie all genotypesassociated with a single phenotype) to understand the relation-ships between the genotype space and clinical or biological phe-notypes [167] Table 7 summarizes the available tools toinvestigate gene lists and to derive global functional trends andnetwork properties

Pathway-level impact of coding SNVs

Multiple functional genomics studies have investigated the per-turbations of pathways as a result of mutations in diseasessuch as adenocarcinoma (MAPK signaling p53 signaling Wntsignaling cell cycle and mammalian target of rapamycin path-ways) [168] childhood acute lymphoblastic leukemia (RAS path-way) [169] colorectal carcinoma (WNT signaling pathway) [170]etc Several SNP-centric approaches for pathway-level inferencehave been designed for the interpretation of SNPs identifiedfrom GWA studies or differentially expressed genes from gene-expression studies [171ndash175] Coding SNVs may also influencethe cross talk between various signaling pathways Currentvariant interpretation algorithms strive to identify pathways

associated with genes harboring coding SNVs yet a detailedunderstanding of the impact of variants on biological pathwaysand pathway cross talks [176] is often lacking Incorporatinganalytical routines to quantify the effect of functional muta-tions on pathways and pathway cross talk will be useful in in-terpreting functional variants from a biological perspective

A classification schema for annotatingcoding SNVs

Identifying the entire spectrum of molecular perturbationsowing to coding variations at the level of sequence structureinteraction and function of proteins is considered the basis ofvariant interpretation Multiple automated prediction tools thatcan assess the functional effect of mutations are currentlyavailable [See Table 6] Several variant annotation tools focuson the task of predicting the type or effect of mutations provideextended annotations from precomputed databases or lookuptables and map a coding variant to its corresponding gene pro-tein functional domain or signaling pathway Theseapproaches have several limitations because additional layersof protein-centric information that can be derived from the pre-diction or computation of sequence-based features with codingvariants are lacking To deal with this challenge and to designbroad-spectrum tools for deep variant interpretation we

Table 7 Software libraries and tools for biological network analysis

Name Description URL

Bio4J Bio4j is a bioinformatics graph-based database httpwwwbio4jcomBioConductor (Graphs

and Networks view)Collection of BioConductor modules for biological net-

work analysis and visualizationhttpwwwbioconductororgpackagesrelease

BiocViewshtml___GraphsAndNetworksCytoscape Open source platform for complex network analysis and

visualization with a large collection of plug-ins forbiological network analysis

httpwwwcytoscapeorg

DAVID Integrated functional annotation tool httpdavidabccncifcrfgovhomejspFunDO Functional Disease Ontology server httpdjangonubicnorthwesternedufundogene2pathway R package for prediction of KEGG pathway membership

for individual genes based on InterPro domainsignatures

httpwwwbioconductororgpackagesreleasebiochtmlgene2pathwayhtml

GeneAnswers R package for biological or medical interpretation of thegiven one or more groups of genes by means of statis-tical test

httpwwwbioconductororgpackagesreleasebiochtmlGeneAnswershtml

GO Tools Tools for analysis of GO T term enrichment statisticalanalysis semantic similarity and functional similar-ity using GO terms derived from gene lists

httpwwwgeneontologyorgGOtoolsshtml

Gephi Open-source graph visualization and analysis software httpwwwgephiorgGremlin Graph-based programming language httpsgithubcomtinkerpopgremlinwikiiGraph Network analysis and visualization library in C Also

available R package and a Python extensionhttpigraphsourceforgenet

KEGGgraph R package for analysis of KEGG pathways httpwwwbioconductororgpackages210biochtmlKEGGgraphhtml

LEDA A broad-spectrum Cthornthorn class library for efficient datatypes and algorithms including large-scale networkanalysis

httpwwwalgorithmic-solutionscomledaaboutindexhtm

Neat Web-based network analysis tools httpgephiorgOntology Analysis

plugins for Cytoscapehttpchiantiucsdeducyto_webplugins Plugins for functional enrichment analysis using

network dataPANTHER Classification of genes and proteins httppantherdborgReactome Curated knowledgebase of molecular events and

pathwayshttpwwwreactomeorg

TargetMine Integrates different types of biological data and enableflexible queries export results and analyze lists ofdata

httptargetminemizuguchilaborg

Interpreting functional effects of coding variants | 15

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

recommend a three-level annotation schema for the interpret-ation of coding variants (Figure 7) Primary level annotation pro-vides an overview of the types of mutations and associationannotations Secondary level annotation enables the systematicinvestigation of multiple variants in a single gene Tertiary-levelannotations help to find the global characteristics of genes har-boring coding variants and various properties in a network-scale

Level 1 Primary annotation of coding SNVs

Tools that use position information from VCF files to derive thetype of mutation (synonymous or non-synonymous) the loca-tion of the corresponding gene and its mapping onto several an-notation resources using positional data may be viewed aslsquoprimary annotationsrsquo See Table 2 for a list of tools that useminimum input data to gather a diverse set of annotationsusing database lookup tables and identifier mapping

Level 2 Secondary annotation of coding SNVs- Impactof multiple variants in a single gene

Sequence and structural explorations of coding variants arereferred to as secondary annotations VCF files can have mul-tiple variants of the same gene associated with a given pheno-type or disease A systematic investigation of multiple variantsin a single gene and an assessment of the relative impact of cod-ing variants could help in classifying multiple variants in a geneand help to delineate the variants as pathogenic moderatelypathogenic or VUS Comparative analysis of wild-type and vari-ant sequences can be performed and quantified using the totalnumber of gains or losses in the sequence features Based on the

availability of the structural data a given variant can be modeledin a protein structure using in-silico mutagenesis experimentsIn the absence of experimental protein structural data a proteinstructure model can be obtained from the homology model data-base or a new model can be built using homology modelingapproaches Once the wild type and mutant type structures areobtained the impact of variation can be rapidly computed usingstructural feature analyses or using computationally expensivemolecular dynamics simulations Table 5 summarizes key se-quence and structure feature prediction tools

Level 3 Tertiary annotation of coding SNVs- Impact ofcoding SNVs on multiple genes pathways andinteractome

The global impact of multiple coding variants on a genome orexome can be assessed using a combination of both knowledge-based enrichment or depletion analysis and network or interac-tome analyses Functional profiling using GO terms have beenused as an effective approach for characterizing collective func-tional characteristics of a perturbed set of genes [177 178]Gene-list-based analyses can provide biologically relevant infor-mation about the variants Enrichment analysis is not just con-fined to a priori defined gene sets or GO annotationsenrichment can be performed using several types of annota-tions and could provide insights into the probable functional as-sociations of genes with coding variants Enrichment analysisusing protein annotations would help to identify functionallysignificant protein domains protein classes families (mem-brane proteins or kinases etc) protein superfamilies (angioten-sin receptors or G-protein coupled receptors) or protein folds(Rossman fold beta-propeller) associated with proteins

Figure 7 Three-level schema for annotation of functional variants

16 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

harboring coding variants Furthermore it is possible to identifyenriched molecular events mediated by genes enriched TFBS inthe upstream of genes and biological pathways mediated by thegenes using annotations from Reactome UCSC and KEGG repo-sitories Disease ontology can be used to find known genendashdisease associations Following the knowledge-based analysis anetwork-level analysis of genes harboring coding SNVs can beperformed using publicly available network analysis tools suchas Cytoscape and Gephi

Future directions

The rapidly increasing availability of sequence structure func-tional and interaction information offers an attractive means toobtain a detailed characterization of coding variants Howeverthe use of large data sets to systematically explore the functionalvariants is currently limited owing to the gaps in the availablesequencendashstructurendashfunction interaction data Sequence dataare available for the entire human proteome but the availabilityof protein structural data experimentally verified functional as-sociations and biomolecular interaction data is limited Theavailability of structural data of homologous protein structuresis a prerequisite for structural investigations of functional vari-ants and its impact on the structural environment Availabilityof high-quality interaction data will also help analyze the cellu-lar networks modulated by proteins harboring coding variantsThe expansion of structural data sets using structural genomics[179] homology modeling [180] and the application of computa-tional approaches such as genomics-aided structure prediction[181] is likely to lower the increasing chasm between sequencestructure function and interaction data and may help to ascer-tain the impact of variants at the biomolecular level [163 182183] Improving the annotation databases using high-qualitycurated data and integration with cloud-based or stand-alonepipelines [184ndash189] can also help facilitate such efforts

The current version of the Bioinformatics Link Directory en-lists gt900 bioinformatics tools that are capable of processing dif-ferent types of protein data (sequence structure interactionquantification and annotation) Unifying multiple resources andenabling programmatic access via application program interfacesand web services for rapid integration will significantly enhancethe efficacy of variant annotation pipelines The development ofnovel statistical methods that can quantify various residue-specific properties would enable a comparative analysis of wild-type and variant sequences and can thereby help to facilitate theidentification and prioritization of functional mutations The in-clusion of a diverse set of sequence structure and network fea-tures into a graphically depicted database and an assessment ofthe impact of various data types (sequence structure functionand interaction) using probabilistic or machine learning modelswould also help automate variant annotation interpretation

From a sequence perspective understanding the precisefunctional role of LCRs unassigned regions and disordered re-gions as well as the relationship between these features and cod-ing SNVs is a key step in variant annotation and interpretationIdentifying deviations in protein structural space that are likelyto lead to misfolding domain swapping aggregation degrad-ation or deviations that are metamorphic in nature owing to theimpact of coding SNVs is an emerging challenge Several algo-rithms are available to predict sequence or structural domainsassign distantly related domains to unassigned regions func-tional motifs structural motifs and structural features proteindisorder and aggregation propensities However algorithms andtools to predict 3D domain swapping or to analyze metamorphic

properties degradation pathways and the network-level impactof coding SNVs are less abundant This may be attributed to thefact that these concepts are continually evolving in the proteinuniverse and concerted efforts are needed to understand the im-pact of coding variants of such unique features from both experi-mental and computational biologists alike

This review summarizes the major bioinformatics chal-lenges involved in gaining a deeper understanding of codingSNVs Coding SNVs may impart a molecular or disease pheno-type by interacting with noncoding regulatory parts of the gen-ome and the structural variants may provide an importantcontribution to this phenomenon Analyzing functional andregulatory variants in a single analytical framework may furtherenhance the interpretation of the genomic variants

Conclusions

A rapid decline in sequencing costs using NGS technologies hasled to an exponential increase in the frequency of the sequenc-ing projects over the past decade A large number of personalgenomes and exomes clinical samples and cancer genomes arebeing sequenced as part of large-scale collaborative projectsThe identification of a plethora of sequence variants associatedwith diverse molecular and disease phenotypes Scalable com-putational approaches that integrate annotations using se-quence structure functional and interaction data are necessaryfor the rapid interpretation of coding SNVs We have discussedthe bioinformatics resources available for the prediction anno-tation and visualization of coding SNVs summarized the majorbioinformatics challenges into 10 different themes for a deeperinterpretation of coding SNVs and recommended a three-levelschema to assess the phenotypic impact of functional vari-ations on individual protein sequences structures differentfunctional categories biological pathways and interactomesWe envisage that addressing the key challenges discussed inthis review and adopting a comprehensive annotation schemafor variant annotation could improve genomic reports that aregenerated as part of genomic medicine investigations and ex-perimental studies to better understand the variations impli-cated in rare common and complex disease manifestations

Key Points

bull Interpretations of variants identified from next-gener-ation sequencing pose several challenges

bull Systems-level experimental investigation of the func-tional variants is expensive and time-consumingefficient computational techniques are required toidentify the impact of functional variants

bull We recommended an integrated approach that com-bines multiple data types and tools for the predictionannotation and visualization of functional variantsand we have proposed a systematic approach for func-tional variant annotation and interpretation

bull Significant challenges that need to be negotiated dur-ing the interpretation of coding single nucleotide vari-ants are presented with the help of various examples

bull A three-level annotation approach that combines theinformation at the level of an individual variant mul-tiple variants in a single protein and global trends ofmultiple genes harboring multiple variants is proposedfor an effective interpretation of coding single nucleo-tide variants

Interpreting functional effects of coding variants | 17

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

Acknowledgements

KS and JTD would like to thank Icahn Institute forGenomics and Multiscale Biology (httpicahnmssmedudepartments-and-institutesgenomics) Mount Sinai HealthSystem for infrastructural support KS would like to ac-knowledge Michael J Zimmermann and Jean-Pierre Kocher(Mayo Clinic Rochester) for useful discussions The MayoClinic Bioinformatics Core provided infrastructure and ana-lytical support to KRK RS would like to thank NationalCentre for Biological Sciences (TIFR) for infrastructuralsupport

Funding

KS and JTD was supported by grant R01-DK098242-03from the National Institute of Diabetes and Digestive andKidney Diseases KRK is funded in part by the Mayo ClinicCenter for Individualized Medicine the PharmacogenomicsResearch Network (PGRN) and the Mayo Clinic BreastSpecialized Program of Research Excellence (SPORE)

References1 Lander ES Initial impact of the sequencing of the human

genome Nature 2011470187ndash972 Mardis ER A decadersquos perspective on DNA sequencing tech-

nology Nature 2011470198ndash2033 Green ED Guyer MS Charting a course for genomic medi-

cine from base pairs to bedside Nature 2011470204ndash134 Maxmen A Exome sequencing deciphers rare diseases Cell

2011144635ndash75 Wheeler DA Srinivasan M Egholm M et al The complete

genome of an individual by massively parallel DNAsequencing Nature 2008452872ndash6

6 Weatheritt RJ Davey NE Gibson TJ Linear motifs conferfunctional diversity onto splice variants Nucleic Acids Res201240123ndash31

7 Goldstein DB Common genetic variation and human traitsN Engl J Med 20093601696ndash8

8 Manolio TA Collins FS Cox NJ et al Finding the missing her-itability of complex diseases Nature 2009461747ndash53

9 Genome-wide association study of 14 000 cases of sevencommon diseases and 3 000 shared controls Nature2007447661ndash78

10 Kullo IJ Ding K Shameer K et al Complement receptor 1gene variants are associated with erythrocyte sedimenta-tion rate Am J Hum Genet 201189131ndash8

11 Welter D MacArthur J Morales J et al The NHGRI GWASCatalog a curated resource of SNP-trait associations NucleicAcids Res 201442D1001ndash6

12 Leslie R OrsquoDonnell CJ Johnson AD GRASP analysis of geno-type-phenotype results from 1390 genome-wide associationstudies and corresponding open access databaseBioinformatics 201430i185ndash94

13 Nelson MR Wegmann D Ehm MG et al An abundance ofrare functional variants in 202 drug target genes sequencedin 14 002 people Science 2012337100ndash4

14 MacArthur DG Balasubramanian S Frankish A et al A sys-tematic survey of loss-of-function variants in human pro-tein-coding genes Science 2012335823ndash8

15 Raychaudhuri S Mapping rare and common causal allelesfor complex human diseases Cell 201114757ndash69

16 Stankiewicz P Lupski JR Structural variation in the humangenome and its role in disease Annu Rev Med201061437ndash455

17 McCarroll SA Altshuler DM Copy-number variation and as-sociation studies of human disease Nat Genet200739S37ndash42

18 Zhang F Gu W Hurles ME et al Copy number variation inhuman health disease and evolution Annu Rev GenomicsHum Genet 200910451ndash81

19 Medvedev P Stanciu M Brudno M Computational methodsfor discovering structural variation with next-generationsequencing Nat Methods 20096S13ndash20

20 Sherry ST Ward MH Kholodov M et al dbSNP the NCBI data-base of genetic variation Nucleic Acids Res 200129308ndash11

21 Rios D McLaren WM Chen Y et al A database and API forvariation dense genotyping and resequencing data BMCBioinformatics 201011238

22 Iafrate AJ Feuk L Rivera MN et al Detection of large-scale variation in the human genome Nat Genet200436949ndash51

23 Ashley EA Butte AJ Wheeler MT et al Clinical assess-ment incorporating a personal genome Lancet20103751525ndash35

24 Chen R Mias GI Li-Pook-Than J et al Personal omics profil-ing reveals dynamic molecular and medical phenotypesCell 20121481293ndash307

25 Lupski JR Reid JG Gonzaga-Jauregui C et al Whole-genomesequencing in a patient with Charcot-Marie-Tooth neur-opathy N Engl J Med 20103621181ndash91

26 Li Y Vinckenbosch N Tian G et al Resequencing of200 human exomes identifies an excess of low-fre-quency non-synonymous coding variants Nat Genet201042969ndash72

27 Clarke L Zheng-Bradley X Smith R et al The 1000 GenomesProject data management and community access NatMethods 20129459ndash462

28 Schunkert H Konig IR Kathiresan S et al Large-scale associ-ation analysis identifies 13 new susceptibility loci for coron-ary artery disease Nat Genet 201143333ndash8

29 Helgadottir A Thorleifsson G Manolescu A et al A commonvariant on chromosome 9p21 affects the risk of myocardialinfarction Science 20073161491ndash3

30 Visel A Zhu Y May D et al Targeted deletion of the 9p21non-coding coronary artery disease risk interval in miceNature 2010464409ndash12

31 Soranzo N Rendon A Gieger C et al A novel variant onchromosome 7q223 associated with mean platelet volumecounts and function Blood 20091133831ndash7

32 Paul DS Nisbet JP Yang TP et al Maps of open chromatinguide the functional follow-up of genome-wide associationsignals application to hematological traits PLoS Genet20117e1002139

33 Hull J Campino S Rowlands K et al Identification of com-mon genetic variation that modulates alternative splicingPLoS Genet 20073e99

34 Albert FW Kruglyak L The role of regulatory vari-ation in complex traits and disease Nat Rev Genet201516197ndash212

35 Battle A Khan Z Wang SH et al Genomic variation Impactof regulatory variation from RNA to protein Science2015347664ndash7

36 den Dunnen JT Antonarakis SE Nomenclature for the de-scription of human sequence variations Hum Genet2001109121ndash4

18 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

37 Eilbeck K Lewis SE Mungall CJ et al The sequence ontologya tool for the unification of genome annotations Genome Biol20056R44

38 Landrum MJ Lee JM Riley GR et al ClinVar public archive ofrelationships among sequence variation and human pheno-type Nucleic Acids Res 201442D980ndash5

39 Kircher M Witten DM Jain P et al A general framework forestimating the relative pathogenicity of human genetic vari-ants Nat Genet 201446310ndash15

40 Kumar P Henikoff S Ng PC Predicting the effects of codingnon-synonymous variants on protein function using theSIFT algorithm Nat Protoc 200941073ndash81

41 Adzhubei I Schmidt S Peshkin L et al A method and serverfor predicting damaging missense mutations Nat Meth20107248ndash9

42 Gonzalez-Perez A Lopez-Bigas N Improving the assessmentof the outcome of nonsynonymous SNVs with a consensusdeleteriousness score Condel Am J Hum Genet 201188440ndash9

43 Grantham R Amino acid difference formula to help explainprotein evolution Science 1974185862ndash4

44 Cooper GM Stone EA Asimenos G et al Distribution and in-tensity of constraint in mammalian genomic sequenceGenome Res 200515901ndash13

45 Pollard KS Hubisz MJ Rosenbloom KR et al Detection ofnonneutral substitution rates on mammalian phylogeniesGenome Res 201020110ndash21

46 Siepel A Bejerano G Pedersen JS et al Evolutionarily con-served elements in vertebrate insect worm and yeast gen-omes Genome Res 2005151034ndash50

47 Dayhoff MO Eck RV Park CM Model of evolutionary changein proteins In MO Dayhoff (ed) Atlas of Protein Sequence andStructure Washington DC National Biomedical ResearchFoundation 1972 89ndash99

48 Henikoff S Henikoff JG Amino acid substitution matricesfrom protein blocks Proc Natl Acad Sci USA 19928910915ndash19

49 Altschul SF Gish W Miller W et al Basic local alignmentsearch tool J Mol Biol 1990215403ndash10

50 Altschul SF Madden TL Schaffer AA et al Gapped BLASTand PSI-BLAST a new generation of protein database searchprograms Nucleic Acids Res 1997253389ndash402

51 Sunyaev S Ramensky V Koch I et al Prediction of deleteri-ous human alleles Hum Mol Genet 200110591ndash7

52 Yang Z Bielawski JP Statistical methods for detecting mo-lecular adaptation Trends Ecol Evol 200015496ndash503

53 Ng PC Henikoff S Predicting deleterious amino acid substi-tutions Genome Res 200111863ndash74

54 Wang Z Moult J SNPs protein structure and disease HumMutat 200117263ndash70

55 Thomas PD Kejariwal A Coding single-nucleotide poly-morphisms associated with complex vs Mendelian diseaseevolutionary evidence for differences in molecular effectsProc Natl Acad Sci USA 200410115398ndash403

56 Stone EA Sidow A Physicochemical constraint violation bymissense substitutions mediates impairment of proteinfunction and disease severity Genome Res 200515978ndash86

57 Miller MP Kumar S Understanding human disease muta-tions through the use of interspecific genetic variation HumMol Genet 2001102319ndash28

58 Johnson AD Single-nucleotide polymorphism bioinfor-matics a comprehensive review of resources Circ CardiovascGenet 20092530ndash6

59 Marjoram P Tavare S Modern computational approachesfor analysing molecular genetic variation data Nat Rev Genet20067759ndash70

60 Excoffier L Heckel G Computer programs for populationgenetics data analysis a survival guide Nat Rev Genet20067745ndash58

61 Peterson TA Nehrt NL Park D et al Incorporating molecularand functional context into the analysis and prioritizationof human variants associated with cancer J Am Med InformAssoc 201219275ndash83

62 Kann MG Advances in translational bioinformatics compu-tational approaches for the hunting of disease genes BriefBioinform 20101196ndash110

63 Cooper GM Shendure J Needles in stacks of needles findingdisease-causal variants in a wealth of genomic data Nat RevGenet 201112628ndash40

64 Stitziel NO Kiezun A Sunyaev S Computational and statis-tical approaches to analyzing variants identified by exomesequencing Genome Biol 201112227

65 Alexander RP Fang G Rozowsky J et al Annotating non-coding regions of the genome Nat Rev Genet 201011559ndash71

66 Mort M Evani US Krishnan VG et al In silico func-tional profiling of human disease-associated and poly-morphic amino acid substitutions Hum Mutat 201031335ndash46

67 Hamosh A Scott AF Amberger JS et al Online MendelianInheritance in Man (OMIM) a knowledgebase of humangenes and genetic disorders Nucleic Acids Res200533D514ndash17

68 Stenson PD Mort M Ball EV et al The human gene mutationdatabase 2008 update Genome Med 2009113

69 Forbes SA Bindal N Bamford S et al COSMIC mining com-plete cancer genomes in the Catalogue of SomaticMutations in Cancer Nucleic Acids Res 201139D945ndash50

70 Chen YA Tripathi LP Mizuguchi K TargetMine an inte-grated data warehouse for candidate gene prioritisation andtarget discovery PLoS One 20116e17844

71 Wang K Li M Hakonarson H ANNOVAR functional annota-tion of genetic variants from high-throughput sequencingdata Nucleic Acids Res 201038e164

72 Danecek P Auton A Abecasis G et al The variant call formatand VCFtools Bioinformatics 2011272156ndash8

73 Macintyre G Jimeno Yepes A Ong CS et al Associating dis-ease-related genetic variants in intergenic regions to thegenes they impact Peer J 20142e639

74 Javed A Agrawal S Ng PC Phen-Gen combining phenotypeand genotype to analyze rare disorders Nat Methods201411935ndash7

75 Dai HJ Wu JC Tsai RT et al T-HOD a literature-based candi-date gene database for hypertension obesity and diabetesDatabase (Oxford) 20132013bas061

76 Cheng D Knox C Young N et al PolySearch a web-basedtext mining system for extracting relationships betweenhuman diseases genes mutations drugs and metabolitesNucleic Acids Res 200836W399ndash405

77 Do R Kathiresan S Abecasis GR Exome sequencing andcomplex disease practical aspects of rare variant associ-ation studies Hum Mol Genet 201221R1ndash9

78 Wagner MJ Rare-variant genome-wide association studiesa new frontier in genetic analysis of complex traitsPharmacogenomics 201314413ndash24

79 Genomes Project C Abecasis GR Auton A et al An inte-grated map of genetic variation from 1 092 human genomesNature 201249156ndash65

80 Anfinsen CB The formation and stabilization of proteinstructure Biochem J 1972128737ndash49

81 Anfinsen CB Principles that govern the folding of proteinchains Science 1973181223ndash30

Interpreting functional effects of coding variants | 19

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

82 Fay J Wu C Sequence divergence functional constraintand selection in protein evolution Annu Rev Genomics HumGenet 20034213ndash35

83 Ponting CP Russell RR The natural history of protein do-mains Annu Rev Biophys Biomol Struct 20023145ndash71

84 Ekman D Bjorklund AK Frey-Skott J et al Multi-domain pro-teins in the three kingdoms of life orphan domains andother unassigned regions J Mol Biol 2005348231ndash43

85 Sigrist CJ Cerutti L Hulo N et al PROSITE a documenteddatabase using patterns and profiles as motif descriptorsBrief Bioinform 20023265ndash74

86 Dinkel H Michael S Weatheritt RJ et al ELMndashthe database ofeukaryotic linear motifs Nucleic Acids Res 201240D242ndash51

87 Haberichter SL Budde U Obser T et al The mutation N528Sin the von Willebrand factor (VWF) propeptide causes de-fective multimerization and storage of VWF Blood20101154580ndash7

88 Kutz WE Wang LW Dagoneau N et al Functional analysis ofan ADAMTS10 signal peptide mutation in Weill-Marchesanisyndrome demonstrates a long-range effect on secretion ofthe full-length enzyme Hum Mutat 2008291425ndash34

89 Pugalenthi G Suganthan PN Sowdhamini R et al SMotif aserver for structural motifs in proteins Bioinformatics200723637ndash8

90 Sundaramurthy P Shameer K Sreenivasan R et al HORI aweb server to compute Higher Order Residue Interactions inprotein structures BMC Bioinformatics 201011(Suppl 1)S24

91 Porter CT Bartlett GJ Thornton JM The Catalytic Site Atlas aresource of catalytic sites and residues identified in enzymesusing structural data Nucleic Acids Res 200432D129ndash33

92 Stuart AC Ilyin VA Sali A LigBase a database of families ofaligned ligand binding sites in known protein sequencesand structures Bioinformatics 200218200ndash01

93 Swapna LS Mahajan S de Brevern A et al Comparison oftertiary structures of proteins in protein-protein complexeswith unbound forms suggests prevalence of allostery in sig-nalling proteins BMC Struct Biol 2012126

94 Mathivanan S Ahmed M Ahn NG et al Human Proteinpediaenables sharing of human protein data Nat Biotechnol200826164ndash7

95 Bhaskara RM Srinivasan N Stability of domain structures inmulti-domain proteins Sci Rep 2011140

96 Furnham N de Beer TA Thornton JM Current challenges ingenome annotation through structural biology and bioinfor-matics Curr Opin Struct Biol 201222594ndash601

97 Wang Y Cheng H Pan Z et al Reconfiguring phosphoryl-ation signaling by genetic polymorphisms affects cancersusceptibility J Mol Cell Biol 20157187ndash202

98 Gnad F Gunawardena J Mann M PHOSIDA 2011 the post-translational modification database Nucleic Acids Res201139D253ndash60

99 Nicolaou N Margadant C Kevelam SH et al Gain of glycosy-lation in integrin alpha3 causes lung disease and nephroticsyndrome J Clin Invest 20121224375ndash87

100 Radivojac P Baenziger PH Kann MG et al Gain and loss ofphosphorylation sites in human cancer Bioinformatics200824i241ndash7

101 Pan Y Karagiannis K Zhang H et al Human germline andpan-cancer variomes and their distinct functional profilesNucleic Acids Res 20144211570ndash88

102 Mazumder R Morampudi KS Motwani M et al Proteome-wide analysis of single-nucleotide variations in the N-glyco-sylation sequon of human genes PLoS One 20127e36212

103 Reddy CC Shameer K Offmann BO et al PURE a webserverfor the prediction of domains in unassigned regions in pro-teins BMC Bioinformatics 20089281

104 Coletta A Pinney JW Solis DY et al Low-complexity regionswithin protein sequences have position-dependent rolesBMC Syst Biol 2010443

105 Jones DT Swindells MB Getting the most from PSI-BLASTTrends Biochem Sci 200227161ndash4

106 Chograni M Rejeb I Jemaa LB et al The first missense muta-tion of NHS gene in a Tunisian family with clinical featuresof NHS syndrome including cardiac anomaly Eur J HumGenet 201119851ndash6

107 Jones S Thornton JM Principles of protein-protein inter-actions Proc Natl Acad Sci USA 19969313ndash20

108 Babu MM van der Lee R de Groot NS et al Intrinsically dis-ordered proteins regulation and disease Curr Opin Struct Biol201121432ndash40

109 Dunker AK Silman I Uversky VN et al Function and struc-ture of inherently disordered proteins Curr Opin Struct Biol200818756ndash64

110 Wright PE Dyson HJ Intrinsically unstructured proteins re-assessing the protein structure-function paradigm J Mol Biol1999293321ndash31

111 Dyson HJ Wright PE Coupling of folding and binding for un-structured proteins Curr Opin Struct Biol 20021254ndash60

112 Nguyen Ba AN Yeh BJ van Dyk D et al Proteome-wide dis-covery of evolutionary conserved sequences in disorderedregions Sci Signal 20125rs1

113 Das RK Mao AH Pappu RV Unmasking functionalmotifs within disordered regions of proteins Sci Signal20125pe17

114 Hu Y Liu Y Jung J et al Changes in predicted protein dis-order tendency may contribute to disease risk BMCGenomics 201112(Suppl 5)S2

115 Vacic V Iakoucheva LM Disease mutations in disordered re-gionsndashexception to the rule Mol BioSyst 2012827ndash32

116 Srivastava SK Gayathri S Manjasetty BA et al Analysis ofconformational variation in macromolecular structuralmodels PLoS One 20127e39993

117 Ward JJ Sodhi JS McGuffin LJ et al Prediction and functionalanalysis of native disorder in proteins from the three king-doms of life J Mol Biol 2004337635ndash45

118 Thomas PJ Qu BH Pedersen PL Defective protein folding asa basis of human disease Trends Biochem Sci 199520456ndash9

119 Welch W Role of quality control pathways in human dis-eases involving protein misfolding Semin Cell Dev Biol20041531ndash8

120 Fink AL Protein aggregation folding aggregates inclusionbodies and amyloid Fold Des 19983R9ndash23

121 Thirumalai D Klimov DK Dima RI Emerging ideas on themolecular basis of protein and peptide aggregation CurrOpin Struct Biol 200313146ndash59

122 Khare SD Dokholyan NV Molecular mechanisms of poly-peptide aggregation in human diseases Curr Protein Pept Sci20078573ndash9

123 Soto C Estrada L Castilla J Amyloids prions and the inher-ent infectious nature of misfolded protein aggregatesTrends Biochem Sci 200631150ndash5

124 Prusiner SB Molecular biology and pathogenesis of priondiseases Trends Biochem Sci 199621482ndash7

125 Shameer K Shingate PN Manjunath SC et al 3DSwap cura-ted knowledgebase of proteins involved in 3D domain swap-ping Database (Oxford) 20112011bar042

20 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

126 Bennett MJ Choe S Eisenberg D Domain swapping entan-gling alliances between proteins Proc Natl Acad Sci USA1994913127ndash31

127 Bennett MJ Sawaya MR Eisenberg D Deposition diseasesand 3D domain swapping Structure 200614811ndash24

128 Shameer K Sowdhamini R Functional repertoire molecularpathways and diseases associated with 3D domain swap-ping in the human proteome J Clin Bioinform 201228

129 Waters PJ Parniak MA Hewson AS et al Alterations in pro-tein aggregation and degradation due to mild and severemissense mutations (A104D R157N) in the human phenyl-alanine hydroxylase gene (PAH) Hum Mutat 199812344ndash54

130 OrsquoNeill JW Kim DE Johnsen K et al Single-site mutations in-duce 3D domain swapping in the B1 domain of protein Lfrom Peptostreptococcus magnus Structure 200191017ndash27

131 Seeliger MA Spichty M Kelly SE et al Role of conformationalheterogeneity in domain swapping and adapter function ofthe Cks proteins J Biol Chem 200528030448ndash59

132 Kirsten Frank M Dyda F Dobrodumov A et al Core muta-tions switch monomeric protein GB1 into an intertwinedtetramer Nat Struct Biol 20029877ndash85

133 Huang Y Cao H Liu Z Three-dimensional domain swappingin the protein structure space Proteins 2012801610ndash19

134 Gidalevitz T Wang N Deravaj T et al Natural genetic vari-ation determines susceptibility to aggregation or toxicity ina C elegans model for polyglutamine disease BMC Biol201311100

135 Ross CA Poirier MA Protein aggregation and neurodegener-ative disease Nat Med 200410(Suppl)S10ndash17

136 Murzin A Metamorphic proteins Science 20083201725ndash6137 Shortle D One sequence plus one mutation equals two

folds Proc Natl Acad Sci USA 200910621011ndash12138 Yadid I Kirshenbaum N Sharon M et al Metamorphic pro-

teins mediate evolutionary transitions of structure Proc NatlAcad Sci USA 20101077287ndash92

139 Yadid I Kirshenbaum N Sharon M et al Metamorphic pro-teins mediate evolutionary transitions of structure Proc NatlAcad Sci USA 20101077287ndash92

140 Pickrell J Marioni J Pai A et al Understanding mechanismsunderlying human gene expression variation with RNAsequencing Nature 2010464768ndash72

141 Matlin AJ Clark F Smith CW Understanding alternative splic-ing towards a cellular code Nat Rev Mol Cell Biol 20056386ndash98

142 Rickman DS Pflueger D Moss B et al SLC45A3-ELK4 is anovel and frequent erythroblast transformation-specific fu-sion transcript in prostate cancer Cancer Res 2009692734ndash8

143 Flomen R Makoff A Increased RNA editing in EAAT2 pre-mRNA from amyotrophic lateral sclerosis patients involve-ment of a cryptic polyadenylation site Neurosci Lett2011497139ndash43

144 Asmann YW Hossain A Necela BM et al A novel bioinfor-matics pipeline for identification and characterization of fu-sion transcripts in breast cancer and normal cell linesNucleic Acids Res 201139e100

145 Bundschuh R Computational prediction of RNA editingsites Bioinformatics 2004203214ndash20

146 Eisenberg E Li JB Levanon EY Sequence based identificationof RNA editing sites RNA Biol 20107248ndash52

147 Levanon EY Eisenberg E Algorithmic approaches for identi-fication of RNA editing sites Brief Funct Genomics Proteomic2006543ndash5

148 McCarthy DJ Humburg P Kanapin A et al Choice of tran-scripts and software has a large effect on variant annota-tion Genome Med 2014626

149 Tang X Baheti S Shameer K et al The eSNV-detect a com-putational system to identify expressed single nucleotidevariants from transcriptome sequencing data Nucleic AcidsRes 201442e172

150 Chorev M Carmel L The function of introns Front Genet2012355

151 Chen R Davydov EV Sirota M et al Non-synonymous andsynonymous coding SNPs show similar likelihood and effectsize of human disease association PLoS One 20105e13574

152 Rose AB Intron-mediated regulation of gene expressionCurr Top Microbiol Immunol 2008326277ndash90

153 Birnbaum RY Clowney EJ Agamy O et al Coding exonsfunction as tissue-specific enhancers of nearby genesGenome Res 2012221059ndash68

154 Nakaya HI Amaral PP Louro R et al Genome mapping andexpression analyses of human intronic noncoding RNAsreveal tissue-specific patterns and enrichment in genesrelated to regulation of transcription Genome Biol 20078R43

155 Rohs R Jin X West SM et al Origins of specificity in protein-DNA recognition Ann Rev Biochem 201079233ndash69

156 Karczewski KJ Dudley JT Kukurba KR et al Systematic func-tional regulatory assessment of disease-associated variantsProc Natl Acad Sci USA 20131109607ndash12

157 Khalil AM Rinn JL RNA-protein interactions in humanhealth and disease Semin Cell Dev Biol 201122359ndash65

158 Jones S van Heyningen P Berman HM et al Protein-DNAinteractions a structural analysis J Mol Biol 1999287877ndash96

159 Jones S Daley DT Luscombe NM et al Protein-RNAinteractions a structural analysis Nucleic Acids Res200129943ndash54

160 Barabasi AL Oltvai ZN Network biology understanding thecellrsquos functional organization Nat Rev Genet 20045101ndash13

161 Nooren IM Thornton JM Diversity of protein-protein inter-actions EMBO J 2003223486ndash92

162 Caspi R Altman T Dreher K et al The MetaCyc database ofmetabolic pathways and enzymes and the BioCyc collectionof pathwaygenome databases Nucleic Acids Res201240D742ndash53

163 Yamada T Bork P Evolution of biomolecular networks les-sons from metabolic and protein interactions Nat Rev MolCell Biol 200910791ndash803

164 Stark C Breitkreutz BJ Chatr-Aryamontri A et al TheBioGRID interaction database 2011 update Nucleic Acids Res201139D698ndash704

165 Szklarczyk D Franceschini A Kuhn M et al The STRINGdatabase in 2011 functional interaction networks of pro-teins globally integrated and scored Nucleic Acids Res201139D561ndash8

166 Bader GD Hogue CW An automated method for finding mo-lecular complexes in large protein interaction networksBMC Bioinformatics 200342

167 DallrsquoOlio GM Vahdati AR Bertranpetit J et alVCF2Networks applying Genotype Networks to SingleNucleotide Variants data Bioinformatics 201531438ndash9

168 Ding L Getz G Wheeler DA et al Somatic mutations affectkey pathways in lung adenocarcinoma Nature20084551069ndash75

169 Case M Matheson E Minto L et al Mutation of genes affect-ing the RAS pathway is common in childhood acutelymphoblastic leukemia Cancer Res 2008686803ndash9

170 Thorstensen L Lind GE Lovig T et al Genetic and epigeneticchanges of components affecting the WNT pathway in colo-rectal carcinomas stratified by microsatellite instabilityNeoplasia 2005799ndash108

Interpreting functional effects of coding variants | 21

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

171 Weng L Macciardi F Subramanian A et al SNP-based path-way enrichment analysis for genome-wide associationstudies BMC Bioinformatics 20111299

172 Wang K Li M Bucan M Pathway-based approaches for ana-lysis of genomewide association studies Am J Hum Genet2007811278ndash83

173 OrsquoDushlaine C Kenny E Heron EA et al The SNP ratio testpathway analysis of genome-wide association datasetsBioinformatics 2009252762ndash3

174 Tarca AL Draghici S Khatri P et al A novel signaling path-way impact analysis Bioinformatics 20092575ndash82

175 Ramanan VK Shen L Moore JH et al Pathway analysis ofgenomic data concepts methods and prospects for futuredevelopment Trends Genet 201228323ndash2

176 Li Y Agarwal P Rajagopalan D A global pathway crosstalknetwork Bioinformatics 2008241442ndash7

177 Rivals I Personnaz L Taing L et al Enrichment or depletionof a GO category within a class of genes which testBioinformatics 200723401ndash7

178 Rhee SY Wood V Dolinski K et al Use and misuse of thegene ontology annotations Nat Rev Genet 20089509ndash15

179 Chandonia JM Brenner SE The impact of structural gen-omics expectations and outcomes Science 2006311347ndash51

180 Johnson MS Srinivasan N Sowdhamini R et al Knowledge-based protein modeling Crit Rev Biochem Mol Biol 1994291ndash68

181 Sulkowska JI Morcos F Weigt M et al Genomics-aided struc-ture prediction Proc Natl Acad Sci USA 201210910340ndash5

182 Lee D Redfern O Orengo C Predicting protein function from se-quence and structure Nat Rev Mol Cell Biol 20078995ndash1005

183 Worth CL Gong S Blundell TL Structural and functionalconstraints in the evolution of protein families Nat Rev MolCell Biol 200910709ndash20

184 Karczewski KJ Fernald GH Martin AR et al STORMSeq anopen-source user-friendly pipeline for processing personalgenomics data in the cloud PLoS One 20149e84860

185 Kalari KR Rossell D Necela BM et al Deep sequence ana-lysis of non-small cell lung cancer integrated analysis ofgene expression alternative splicing and single nucleotidevariations in lung adenocarcinomas with and without onco-genic KRAS mutations Front Oncol 2012212

186 Patil S Upadhayay A Arya D et al A high throughput exomesequencing approach to analyse events within a good re-sponder CML patient under imatinib at diagnosis and underremission Blood 20131225161

187 Kullo IJ Shameer K Jouni H et al The ATXN2-SH2B3 locus isassociated with peripheral arterial disease an electronicmedical record-based genome-wide association study FrontGenet 20145166

188 Shameer K Denny JC Ding K et al A genome- and phe-nome-wide association study to identify genetic variantsinfluencing platelet count and volume and their pleiotropiceffects Hum Genet 201413395ndash109

189 Sabarinathan R Wenzel A Novotny P et al Transcriptome-wide analysis of UTRs in non-small cell lung cancer revealscancer-related genes with SNV-induced changes on RNAsecondary structure and miRNA target sites PLoS One20149e82699

22 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

  • bbv084-TF1
Page 16: challenges in proteome-scale prediction, annotation and assessment

recommend a three-level annotation schema for the interpret-ation of coding variants (Figure 7) Primary level annotation pro-vides an overview of the types of mutations and associationannotations Secondary level annotation enables the systematicinvestigation of multiple variants in a single gene Tertiary-levelannotations help to find the global characteristics of genes har-boring coding variants and various properties in a network-scale

Level 1 Primary annotation of coding SNVs

Tools that use position information from VCF files to derive thetype of mutation (synonymous or non-synonymous) the loca-tion of the corresponding gene and its mapping onto several an-notation resources using positional data may be viewed aslsquoprimary annotationsrsquo See Table 2 for a list of tools that useminimum input data to gather a diverse set of annotationsusing database lookup tables and identifier mapping

Level 2 Secondary annotation of coding SNVs- Impactof multiple variants in a single gene

Sequence and structural explorations of coding variants arereferred to as secondary annotations VCF files can have mul-tiple variants of the same gene associated with a given pheno-type or disease A systematic investigation of multiple variantsin a single gene and an assessment of the relative impact of cod-ing variants could help in classifying multiple variants in a geneand help to delineate the variants as pathogenic moderatelypathogenic or VUS Comparative analysis of wild-type and vari-ant sequences can be performed and quantified using the totalnumber of gains or losses in the sequence features Based on the

availability of the structural data a given variant can be modeledin a protein structure using in-silico mutagenesis experimentsIn the absence of experimental protein structural data a proteinstructure model can be obtained from the homology model data-base or a new model can be built using homology modelingapproaches Once the wild type and mutant type structures areobtained the impact of variation can be rapidly computed usingstructural feature analyses or using computationally expensivemolecular dynamics simulations Table 5 summarizes key se-quence and structure feature prediction tools

Level 3 Tertiary annotation of coding SNVs- Impact ofcoding SNVs on multiple genes pathways andinteractome

The global impact of multiple coding variants on a genome orexome can be assessed using a combination of both knowledge-based enrichment or depletion analysis and network or interac-tome analyses Functional profiling using GO terms have beenused as an effective approach for characterizing collective func-tional characteristics of a perturbed set of genes [177 178]Gene-list-based analyses can provide biologically relevant infor-mation about the variants Enrichment analysis is not just con-fined to a priori defined gene sets or GO annotationsenrichment can be performed using several types of annota-tions and could provide insights into the probable functional as-sociations of genes with coding variants Enrichment analysisusing protein annotations would help to identify functionallysignificant protein domains protein classes families (mem-brane proteins or kinases etc) protein superfamilies (angioten-sin receptors or G-protein coupled receptors) or protein folds(Rossman fold beta-propeller) associated with proteins

Figure 7 Three-level schema for annotation of functional variants

16 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

harboring coding variants Furthermore it is possible to identifyenriched molecular events mediated by genes enriched TFBS inthe upstream of genes and biological pathways mediated by thegenes using annotations from Reactome UCSC and KEGG repo-sitories Disease ontology can be used to find known genendashdisease associations Following the knowledge-based analysis anetwork-level analysis of genes harboring coding SNVs can beperformed using publicly available network analysis tools suchas Cytoscape and Gephi

Future directions

The rapidly increasing availability of sequence structure func-tional and interaction information offers an attractive means toobtain a detailed characterization of coding variants Howeverthe use of large data sets to systematically explore the functionalvariants is currently limited owing to the gaps in the availablesequencendashstructurendashfunction interaction data Sequence dataare available for the entire human proteome but the availabilityof protein structural data experimentally verified functional as-sociations and biomolecular interaction data is limited Theavailability of structural data of homologous protein structuresis a prerequisite for structural investigations of functional vari-ants and its impact on the structural environment Availabilityof high-quality interaction data will also help analyze the cellu-lar networks modulated by proteins harboring coding variantsThe expansion of structural data sets using structural genomics[179] homology modeling [180] and the application of computa-tional approaches such as genomics-aided structure prediction[181] is likely to lower the increasing chasm between sequencestructure function and interaction data and may help to ascer-tain the impact of variants at the biomolecular level [163 182183] Improving the annotation databases using high-qualitycurated data and integration with cloud-based or stand-alonepipelines [184ndash189] can also help facilitate such efforts

The current version of the Bioinformatics Link Directory en-lists gt900 bioinformatics tools that are capable of processing dif-ferent types of protein data (sequence structure interactionquantification and annotation) Unifying multiple resources andenabling programmatic access via application program interfacesand web services for rapid integration will significantly enhancethe efficacy of variant annotation pipelines The development ofnovel statistical methods that can quantify various residue-specific properties would enable a comparative analysis of wild-type and variant sequences and can thereby help to facilitate theidentification and prioritization of functional mutations The in-clusion of a diverse set of sequence structure and network fea-tures into a graphically depicted database and an assessment ofthe impact of various data types (sequence structure functionand interaction) using probabilistic or machine learning modelswould also help automate variant annotation interpretation

From a sequence perspective understanding the precisefunctional role of LCRs unassigned regions and disordered re-gions as well as the relationship between these features and cod-ing SNVs is a key step in variant annotation and interpretationIdentifying deviations in protein structural space that are likelyto lead to misfolding domain swapping aggregation degrad-ation or deviations that are metamorphic in nature owing to theimpact of coding SNVs is an emerging challenge Several algo-rithms are available to predict sequence or structural domainsassign distantly related domains to unassigned regions func-tional motifs structural motifs and structural features proteindisorder and aggregation propensities However algorithms andtools to predict 3D domain swapping or to analyze metamorphic

properties degradation pathways and the network-level impactof coding SNVs are less abundant This may be attributed to thefact that these concepts are continually evolving in the proteinuniverse and concerted efforts are needed to understand the im-pact of coding variants of such unique features from both experi-mental and computational biologists alike

This review summarizes the major bioinformatics chal-lenges involved in gaining a deeper understanding of codingSNVs Coding SNVs may impart a molecular or disease pheno-type by interacting with noncoding regulatory parts of the gen-ome and the structural variants may provide an importantcontribution to this phenomenon Analyzing functional andregulatory variants in a single analytical framework may furtherenhance the interpretation of the genomic variants

Conclusions

A rapid decline in sequencing costs using NGS technologies hasled to an exponential increase in the frequency of the sequenc-ing projects over the past decade A large number of personalgenomes and exomes clinical samples and cancer genomes arebeing sequenced as part of large-scale collaborative projectsThe identification of a plethora of sequence variants associatedwith diverse molecular and disease phenotypes Scalable com-putational approaches that integrate annotations using se-quence structure functional and interaction data are necessaryfor the rapid interpretation of coding SNVs We have discussedthe bioinformatics resources available for the prediction anno-tation and visualization of coding SNVs summarized the majorbioinformatics challenges into 10 different themes for a deeperinterpretation of coding SNVs and recommended a three-levelschema to assess the phenotypic impact of functional vari-ations on individual protein sequences structures differentfunctional categories biological pathways and interactomesWe envisage that addressing the key challenges discussed inthis review and adopting a comprehensive annotation schemafor variant annotation could improve genomic reports that aregenerated as part of genomic medicine investigations and ex-perimental studies to better understand the variations impli-cated in rare common and complex disease manifestations

Key Points

bull Interpretations of variants identified from next-gener-ation sequencing pose several challenges

bull Systems-level experimental investigation of the func-tional variants is expensive and time-consumingefficient computational techniques are required toidentify the impact of functional variants

bull We recommended an integrated approach that com-bines multiple data types and tools for the predictionannotation and visualization of functional variantsand we have proposed a systematic approach for func-tional variant annotation and interpretation

bull Significant challenges that need to be negotiated dur-ing the interpretation of coding single nucleotide vari-ants are presented with the help of various examples

bull A three-level annotation approach that combines theinformation at the level of an individual variant mul-tiple variants in a single protein and global trends ofmultiple genes harboring multiple variants is proposedfor an effective interpretation of coding single nucleo-tide variants

Interpreting functional effects of coding variants | 17

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

Acknowledgements

KS and JTD would like to thank Icahn Institute forGenomics and Multiscale Biology (httpicahnmssmedudepartments-and-institutesgenomics) Mount Sinai HealthSystem for infrastructural support KS would like to ac-knowledge Michael J Zimmermann and Jean-Pierre Kocher(Mayo Clinic Rochester) for useful discussions The MayoClinic Bioinformatics Core provided infrastructure and ana-lytical support to KRK RS would like to thank NationalCentre for Biological Sciences (TIFR) for infrastructuralsupport

Funding

KS and JTD was supported by grant R01-DK098242-03from the National Institute of Diabetes and Digestive andKidney Diseases KRK is funded in part by the Mayo ClinicCenter for Individualized Medicine the PharmacogenomicsResearch Network (PGRN) and the Mayo Clinic BreastSpecialized Program of Research Excellence (SPORE)

References1 Lander ES Initial impact of the sequencing of the human

genome Nature 2011470187ndash972 Mardis ER A decadersquos perspective on DNA sequencing tech-

nology Nature 2011470198ndash2033 Green ED Guyer MS Charting a course for genomic medi-

cine from base pairs to bedside Nature 2011470204ndash134 Maxmen A Exome sequencing deciphers rare diseases Cell

2011144635ndash75 Wheeler DA Srinivasan M Egholm M et al The complete

genome of an individual by massively parallel DNAsequencing Nature 2008452872ndash6

6 Weatheritt RJ Davey NE Gibson TJ Linear motifs conferfunctional diversity onto splice variants Nucleic Acids Res201240123ndash31

7 Goldstein DB Common genetic variation and human traitsN Engl J Med 20093601696ndash8

8 Manolio TA Collins FS Cox NJ et al Finding the missing her-itability of complex diseases Nature 2009461747ndash53

9 Genome-wide association study of 14 000 cases of sevencommon diseases and 3 000 shared controls Nature2007447661ndash78

10 Kullo IJ Ding K Shameer K et al Complement receptor 1gene variants are associated with erythrocyte sedimenta-tion rate Am J Hum Genet 201189131ndash8

11 Welter D MacArthur J Morales J et al The NHGRI GWASCatalog a curated resource of SNP-trait associations NucleicAcids Res 201442D1001ndash6

12 Leslie R OrsquoDonnell CJ Johnson AD GRASP analysis of geno-type-phenotype results from 1390 genome-wide associationstudies and corresponding open access databaseBioinformatics 201430i185ndash94

13 Nelson MR Wegmann D Ehm MG et al An abundance ofrare functional variants in 202 drug target genes sequencedin 14 002 people Science 2012337100ndash4

14 MacArthur DG Balasubramanian S Frankish A et al A sys-tematic survey of loss-of-function variants in human pro-tein-coding genes Science 2012335823ndash8

15 Raychaudhuri S Mapping rare and common causal allelesfor complex human diseases Cell 201114757ndash69

16 Stankiewicz P Lupski JR Structural variation in the humangenome and its role in disease Annu Rev Med201061437ndash455

17 McCarroll SA Altshuler DM Copy-number variation and as-sociation studies of human disease Nat Genet200739S37ndash42

18 Zhang F Gu W Hurles ME et al Copy number variation inhuman health disease and evolution Annu Rev GenomicsHum Genet 200910451ndash81

19 Medvedev P Stanciu M Brudno M Computational methodsfor discovering structural variation with next-generationsequencing Nat Methods 20096S13ndash20

20 Sherry ST Ward MH Kholodov M et al dbSNP the NCBI data-base of genetic variation Nucleic Acids Res 200129308ndash11

21 Rios D McLaren WM Chen Y et al A database and API forvariation dense genotyping and resequencing data BMCBioinformatics 201011238

22 Iafrate AJ Feuk L Rivera MN et al Detection of large-scale variation in the human genome Nat Genet200436949ndash51

23 Ashley EA Butte AJ Wheeler MT et al Clinical assess-ment incorporating a personal genome Lancet20103751525ndash35

24 Chen R Mias GI Li-Pook-Than J et al Personal omics profil-ing reveals dynamic molecular and medical phenotypesCell 20121481293ndash307

25 Lupski JR Reid JG Gonzaga-Jauregui C et al Whole-genomesequencing in a patient with Charcot-Marie-Tooth neur-opathy N Engl J Med 20103621181ndash91

26 Li Y Vinckenbosch N Tian G et al Resequencing of200 human exomes identifies an excess of low-fre-quency non-synonymous coding variants Nat Genet201042969ndash72

27 Clarke L Zheng-Bradley X Smith R et al The 1000 GenomesProject data management and community access NatMethods 20129459ndash462

28 Schunkert H Konig IR Kathiresan S et al Large-scale associ-ation analysis identifies 13 new susceptibility loci for coron-ary artery disease Nat Genet 201143333ndash8

29 Helgadottir A Thorleifsson G Manolescu A et al A commonvariant on chromosome 9p21 affects the risk of myocardialinfarction Science 20073161491ndash3

30 Visel A Zhu Y May D et al Targeted deletion of the 9p21non-coding coronary artery disease risk interval in miceNature 2010464409ndash12

31 Soranzo N Rendon A Gieger C et al A novel variant onchromosome 7q223 associated with mean platelet volumecounts and function Blood 20091133831ndash7

32 Paul DS Nisbet JP Yang TP et al Maps of open chromatinguide the functional follow-up of genome-wide associationsignals application to hematological traits PLoS Genet20117e1002139

33 Hull J Campino S Rowlands K et al Identification of com-mon genetic variation that modulates alternative splicingPLoS Genet 20073e99

34 Albert FW Kruglyak L The role of regulatory vari-ation in complex traits and disease Nat Rev Genet201516197ndash212

35 Battle A Khan Z Wang SH et al Genomic variation Impactof regulatory variation from RNA to protein Science2015347664ndash7

36 den Dunnen JT Antonarakis SE Nomenclature for the de-scription of human sequence variations Hum Genet2001109121ndash4

18 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

37 Eilbeck K Lewis SE Mungall CJ et al The sequence ontologya tool for the unification of genome annotations Genome Biol20056R44

38 Landrum MJ Lee JM Riley GR et al ClinVar public archive ofrelationships among sequence variation and human pheno-type Nucleic Acids Res 201442D980ndash5

39 Kircher M Witten DM Jain P et al A general framework forestimating the relative pathogenicity of human genetic vari-ants Nat Genet 201446310ndash15

40 Kumar P Henikoff S Ng PC Predicting the effects of codingnon-synonymous variants on protein function using theSIFT algorithm Nat Protoc 200941073ndash81

41 Adzhubei I Schmidt S Peshkin L et al A method and serverfor predicting damaging missense mutations Nat Meth20107248ndash9

42 Gonzalez-Perez A Lopez-Bigas N Improving the assessmentof the outcome of nonsynonymous SNVs with a consensusdeleteriousness score Condel Am J Hum Genet 201188440ndash9

43 Grantham R Amino acid difference formula to help explainprotein evolution Science 1974185862ndash4

44 Cooper GM Stone EA Asimenos G et al Distribution and in-tensity of constraint in mammalian genomic sequenceGenome Res 200515901ndash13

45 Pollard KS Hubisz MJ Rosenbloom KR et al Detection ofnonneutral substitution rates on mammalian phylogeniesGenome Res 201020110ndash21

46 Siepel A Bejerano G Pedersen JS et al Evolutionarily con-served elements in vertebrate insect worm and yeast gen-omes Genome Res 2005151034ndash50

47 Dayhoff MO Eck RV Park CM Model of evolutionary changein proteins In MO Dayhoff (ed) Atlas of Protein Sequence andStructure Washington DC National Biomedical ResearchFoundation 1972 89ndash99

48 Henikoff S Henikoff JG Amino acid substitution matricesfrom protein blocks Proc Natl Acad Sci USA 19928910915ndash19

49 Altschul SF Gish W Miller W et al Basic local alignmentsearch tool J Mol Biol 1990215403ndash10

50 Altschul SF Madden TL Schaffer AA et al Gapped BLASTand PSI-BLAST a new generation of protein database searchprograms Nucleic Acids Res 1997253389ndash402

51 Sunyaev S Ramensky V Koch I et al Prediction of deleteri-ous human alleles Hum Mol Genet 200110591ndash7

52 Yang Z Bielawski JP Statistical methods for detecting mo-lecular adaptation Trends Ecol Evol 200015496ndash503

53 Ng PC Henikoff S Predicting deleterious amino acid substi-tutions Genome Res 200111863ndash74

54 Wang Z Moult J SNPs protein structure and disease HumMutat 200117263ndash70

55 Thomas PD Kejariwal A Coding single-nucleotide poly-morphisms associated with complex vs Mendelian diseaseevolutionary evidence for differences in molecular effectsProc Natl Acad Sci USA 200410115398ndash403

56 Stone EA Sidow A Physicochemical constraint violation bymissense substitutions mediates impairment of proteinfunction and disease severity Genome Res 200515978ndash86

57 Miller MP Kumar S Understanding human disease muta-tions through the use of interspecific genetic variation HumMol Genet 2001102319ndash28

58 Johnson AD Single-nucleotide polymorphism bioinfor-matics a comprehensive review of resources Circ CardiovascGenet 20092530ndash6

59 Marjoram P Tavare S Modern computational approachesfor analysing molecular genetic variation data Nat Rev Genet20067759ndash70

60 Excoffier L Heckel G Computer programs for populationgenetics data analysis a survival guide Nat Rev Genet20067745ndash58

61 Peterson TA Nehrt NL Park D et al Incorporating molecularand functional context into the analysis and prioritizationof human variants associated with cancer J Am Med InformAssoc 201219275ndash83

62 Kann MG Advances in translational bioinformatics compu-tational approaches for the hunting of disease genes BriefBioinform 20101196ndash110

63 Cooper GM Shendure J Needles in stacks of needles findingdisease-causal variants in a wealth of genomic data Nat RevGenet 201112628ndash40

64 Stitziel NO Kiezun A Sunyaev S Computational and statis-tical approaches to analyzing variants identified by exomesequencing Genome Biol 201112227

65 Alexander RP Fang G Rozowsky J et al Annotating non-coding regions of the genome Nat Rev Genet 201011559ndash71

66 Mort M Evani US Krishnan VG et al In silico func-tional profiling of human disease-associated and poly-morphic amino acid substitutions Hum Mutat 201031335ndash46

67 Hamosh A Scott AF Amberger JS et al Online MendelianInheritance in Man (OMIM) a knowledgebase of humangenes and genetic disorders Nucleic Acids Res200533D514ndash17

68 Stenson PD Mort M Ball EV et al The human gene mutationdatabase 2008 update Genome Med 2009113

69 Forbes SA Bindal N Bamford S et al COSMIC mining com-plete cancer genomes in the Catalogue of SomaticMutations in Cancer Nucleic Acids Res 201139D945ndash50

70 Chen YA Tripathi LP Mizuguchi K TargetMine an inte-grated data warehouse for candidate gene prioritisation andtarget discovery PLoS One 20116e17844

71 Wang K Li M Hakonarson H ANNOVAR functional annota-tion of genetic variants from high-throughput sequencingdata Nucleic Acids Res 201038e164

72 Danecek P Auton A Abecasis G et al The variant call formatand VCFtools Bioinformatics 2011272156ndash8

73 Macintyre G Jimeno Yepes A Ong CS et al Associating dis-ease-related genetic variants in intergenic regions to thegenes they impact Peer J 20142e639

74 Javed A Agrawal S Ng PC Phen-Gen combining phenotypeand genotype to analyze rare disorders Nat Methods201411935ndash7

75 Dai HJ Wu JC Tsai RT et al T-HOD a literature-based candi-date gene database for hypertension obesity and diabetesDatabase (Oxford) 20132013bas061

76 Cheng D Knox C Young N et al PolySearch a web-basedtext mining system for extracting relationships betweenhuman diseases genes mutations drugs and metabolitesNucleic Acids Res 200836W399ndash405

77 Do R Kathiresan S Abecasis GR Exome sequencing andcomplex disease practical aspects of rare variant associ-ation studies Hum Mol Genet 201221R1ndash9

78 Wagner MJ Rare-variant genome-wide association studiesa new frontier in genetic analysis of complex traitsPharmacogenomics 201314413ndash24

79 Genomes Project C Abecasis GR Auton A et al An inte-grated map of genetic variation from 1 092 human genomesNature 201249156ndash65

80 Anfinsen CB The formation and stabilization of proteinstructure Biochem J 1972128737ndash49

81 Anfinsen CB Principles that govern the folding of proteinchains Science 1973181223ndash30

Interpreting functional effects of coding variants | 19

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

82 Fay J Wu C Sequence divergence functional constraintand selection in protein evolution Annu Rev Genomics HumGenet 20034213ndash35

83 Ponting CP Russell RR The natural history of protein do-mains Annu Rev Biophys Biomol Struct 20023145ndash71

84 Ekman D Bjorklund AK Frey-Skott J et al Multi-domain pro-teins in the three kingdoms of life orphan domains andother unassigned regions J Mol Biol 2005348231ndash43

85 Sigrist CJ Cerutti L Hulo N et al PROSITE a documenteddatabase using patterns and profiles as motif descriptorsBrief Bioinform 20023265ndash74

86 Dinkel H Michael S Weatheritt RJ et al ELMndashthe database ofeukaryotic linear motifs Nucleic Acids Res 201240D242ndash51

87 Haberichter SL Budde U Obser T et al The mutation N528Sin the von Willebrand factor (VWF) propeptide causes de-fective multimerization and storage of VWF Blood20101154580ndash7

88 Kutz WE Wang LW Dagoneau N et al Functional analysis ofan ADAMTS10 signal peptide mutation in Weill-Marchesanisyndrome demonstrates a long-range effect on secretion ofthe full-length enzyme Hum Mutat 2008291425ndash34

89 Pugalenthi G Suganthan PN Sowdhamini R et al SMotif aserver for structural motifs in proteins Bioinformatics200723637ndash8

90 Sundaramurthy P Shameer K Sreenivasan R et al HORI aweb server to compute Higher Order Residue Interactions inprotein structures BMC Bioinformatics 201011(Suppl 1)S24

91 Porter CT Bartlett GJ Thornton JM The Catalytic Site Atlas aresource of catalytic sites and residues identified in enzymesusing structural data Nucleic Acids Res 200432D129ndash33

92 Stuart AC Ilyin VA Sali A LigBase a database of families ofaligned ligand binding sites in known protein sequencesand structures Bioinformatics 200218200ndash01

93 Swapna LS Mahajan S de Brevern A et al Comparison oftertiary structures of proteins in protein-protein complexeswith unbound forms suggests prevalence of allostery in sig-nalling proteins BMC Struct Biol 2012126

94 Mathivanan S Ahmed M Ahn NG et al Human Proteinpediaenables sharing of human protein data Nat Biotechnol200826164ndash7

95 Bhaskara RM Srinivasan N Stability of domain structures inmulti-domain proteins Sci Rep 2011140

96 Furnham N de Beer TA Thornton JM Current challenges ingenome annotation through structural biology and bioinfor-matics Curr Opin Struct Biol 201222594ndash601

97 Wang Y Cheng H Pan Z et al Reconfiguring phosphoryl-ation signaling by genetic polymorphisms affects cancersusceptibility J Mol Cell Biol 20157187ndash202

98 Gnad F Gunawardena J Mann M PHOSIDA 2011 the post-translational modification database Nucleic Acids Res201139D253ndash60

99 Nicolaou N Margadant C Kevelam SH et al Gain of glycosy-lation in integrin alpha3 causes lung disease and nephroticsyndrome J Clin Invest 20121224375ndash87

100 Radivojac P Baenziger PH Kann MG et al Gain and loss ofphosphorylation sites in human cancer Bioinformatics200824i241ndash7

101 Pan Y Karagiannis K Zhang H et al Human germline andpan-cancer variomes and their distinct functional profilesNucleic Acids Res 20144211570ndash88

102 Mazumder R Morampudi KS Motwani M et al Proteome-wide analysis of single-nucleotide variations in the N-glyco-sylation sequon of human genes PLoS One 20127e36212

103 Reddy CC Shameer K Offmann BO et al PURE a webserverfor the prediction of domains in unassigned regions in pro-teins BMC Bioinformatics 20089281

104 Coletta A Pinney JW Solis DY et al Low-complexity regionswithin protein sequences have position-dependent rolesBMC Syst Biol 2010443

105 Jones DT Swindells MB Getting the most from PSI-BLASTTrends Biochem Sci 200227161ndash4

106 Chograni M Rejeb I Jemaa LB et al The first missense muta-tion of NHS gene in a Tunisian family with clinical featuresof NHS syndrome including cardiac anomaly Eur J HumGenet 201119851ndash6

107 Jones S Thornton JM Principles of protein-protein inter-actions Proc Natl Acad Sci USA 19969313ndash20

108 Babu MM van der Lee R de Groot NS et al Intrinsically dis-ordered proteins regulation and disease Curr Opin Struct Biol201121432ndash40

109 Dunker AK Silman I Uversky VN et al Function and struc-ture of inherently disordered proteins Curr Opin Struct Biol200818756ndash64

110 Wright PE Dyson HJ Intrinsically unstructured proteins re-assessing the protein structure-function paradigm J Mol Biol1999293321ndash31

111 Dyson HJ Wright PE Coupling of folding and binding for un-structured proteins Curr Opin Struct Biol 20021254ndash60

112 Nguyen Ba AN Yeh BJ van Dyk D et al Proteome-wide dis-covery of evolutionary conserved sequences in disorderedregions Sci Signal 20125rs1

113 Das RK Mao AH Pappu RV Unmasking functionalmotifs within disordered regions of proteins Sci Signal20125pe17

114 Hu Y Liu Y Jung J et al Changes in predicted protein dis-order tendency may contribute to disease risk BMCGenomics 201112(Suppl 5)S2

115 Vacic V Iakoucheva LM Disease mutations in disordered re-gionsndashexception to the rule Mol BioSyst 2012827ndash32

116 Srivastava SK Gayathri S Manjasetty BA et al Analysis ofconformational variation in macromolecular structuralmodels PLoS One 20127e39993

117 Ward JJ Sodhi JS McGuffin LJ et al Prediction and functionalanalysis of native disorder in proteins from the three king-doms of life J Mol Biol 2004337635ndash45

118 Thomas PJ Qu BH Pedersen PL Defective protein folding asa basis of human disease Trends Biochem Sci 199520456ndash9

119 Welch W Role of quality control pathways in human dis-eases involving protein misfolding Semin Cell Dev Biol20041531ndash8

120 Fink AL Protein aggregation folding aggregates inclusionbodies and amyloid Fold Des 19983R9ndash23

121 Thirumalai D Klimov DK Dima RI Emerging ideas on themolecular basis of protein and peptide aggregation CurrOpin Struct Biol 200313146ndash59

122 Khare SD Dokholyan NV Molecular mechanisms of poly-peptide aggregation in human diseases Curr Protein Pept Sci20078573ndash9

123 Soto C Estrada L Castilla J Amyloids prions and the inher-ent infectious nature of misfolded protein aggregatesTrends Biochem Sci 200631150ndash5

124 Prusiner SB Molecular biology and pathogenesis of priondiseases Trends Biochem Sci 199621482ndash7

125 Shameer K Shingate PN Manjunath SC et al 3DSwap cura-ted knowledgebase of proteins involved in 3D domain swap-ping Database (Oxford) 20112011bar042

20 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

126 Bennett MJ Choe S Eisenberg D Domain swapping entan-gling alliances between proteins Proc Natl Acad Sci USA1994913127ndash31

127 Bennett MJ Sawaya MR Eisenberg D Deposition diseasesand 3D domain swapping Structure 200614811ndash24

128 Shameer K Sowdhamini R Functional repertoire molecularpathways and diseases associated with 3D domain swap-ping in the human proteome J Clin Bioinform 201228

129 Waters PJ Parniak MA Hewson AS et al Alterations in pro-tein aggregation and degradation due to mild and severemissense mutations (A104D R157N) in the human phenyl-alanine hydroxylase gene (PAH) Hum Mutat 199812344ndash54

130 OrsquoNeill JW Kim DE Johnsen K et al Single-site mutations in-duce 3D domain swapping in the B1 domain of protein Lfrom Peptostreptococcus magnus Structure 200191017ndash27

131 Seeliger MA Spichty M Kelly SE et al Role of conformationalheterogeneity in domain swapping and adapter function ofthe Cks proteins J Biol Chem 200528030448ndash59

132 Kirsten Frank M Dyda F Dobrodumov A et al Core muta-tions switch monomeric protein GB1 into an intertwinedtetramer Nat Struct Biol 20029877ndash85

133 Huang Y Cao H Liu Z Three-dimensional domain swappingin the protein structure space Proteins 2012801610ndash19

134 Gidalevitz T Wang N Deravaj T et al Natural genetic vari-ation determines susceptibility to aggregation or toxicity ina C elegans model for polyglutamine disease BMC Biol201311100

135 Ross CA Poirier MA Protein aggregation and neurodegener-ative disease Nat Med 200410(Suppl)S10ndash17

136 Murzin A Metamorphic proteins Science 20083201725ndash6137 Shortle D One sequence plus one mutation equals two

folds Proc Natl Acad Sci USA 200910621011ndash12138 Yadid I Kirshenbaum N Sharon M et al Metamorphic pro-

teins mediate evolutionary transitions of structure Proc NatlAcad Sci USA 20101077287ndash92

139 Yadid I Kirshenbaum N Sharon M et al Metamorphic pro-teins mediate evolutionary transitions of structure Proc NatlAcad Sci USA 20101077287ndash92

140 Pickrell J Marioni J Pai A et al Understanding mechanismsunderlying human gene expression variation with RNAsequencing Nature 2010464768ndash72

141 Matlin AJ Clark F Smith CW Understanding alternative splic-ing towards a cellular code Nat Rev Mol Cell Biol 20056386ndash98

142 Rickman DS Pflueger D Moss B et al SLC45A3-ELK4 is anovel and frequent erythroblast transformation-specific fu-sion transcript in prostate cancer Cancer Res 2009692734ndash8

143 Flomen R Makoff A Increased RNA editing in EAAT2 pre-mRNA from amyotrophic lateral sclerosis patients involve-ment of a cryptic polyadenylation site Neurosci Lett2011497139ndash43

144 Asmann YW Hossain A Necela BM et al A novel bioinfor-matics pipeline for identification and characterization of fu-sion transcripts in breast cancer and normal cell linesNucleic Acids Res 201139e100

145 Bundschuh R Computational prediction of RNA editingsites Bioinformatics 2004203214ndash20

146 Eisenberg E Li JB Levanon EY Sequence based identificationof RNA editing sites RNA Biol 20107248ndash52

147 Levanon EY Eisenberg E Algorithmic approaches for identi-fication of RNA editing sites Brief Funct Genomics Proteomic2006543ndash5

148 McCarthy DJ Humburg P Kanapin A et al Choice of tran-scripts and software has a large effect on variant annota-tion Genome Med 2014626

149 Tang X Baheti S Shameer K et al The eSNV-detect a com-putational system to identify expressed single nucleotidevariants from transcriptome sequencing data Nucleic AcidsRes 201442e172

150 Chorev M Carmel L The function of introns Front Genet2012355

151 Chen R Davydov EV Sirota M et al Non-synonymous andsynonymous coding SNPs show similar likelihood and effectsize of human disease association PLoS One 20105e13574

152 Rose AB Intron-mediated regulation of gene expressionCurr Top Microbiol Immunol 2008326277ndash90

153 Birnbaum RY Clowney EJ Agamy O et al Coding exonsfunction as tissue-specific enhancers of nearby genesGenome Res 2012221059ndash68

154 Nakaya HI Amaral PP Louro R et al Genome mapping andexpression analyses of human intronic noncoding RNAsreveal tissue-specific patterns and enrichment in genesrelated to regulation of transcription Genome Biol 20078R43

155 Rohs R Jin X West SM et al Origins of specificity in protein-DNA recognition Ann Rev Biochem 201079233ndash69

156 Karczewski KJ Dudley JT Kukurba KR et al Systematic func-tional regulatory assessment of disease-associated variantsProc Natl Acad Sci USA 20131109607ndash12

157 Khalil AM Rinn JL RNA-protein interactions in humanhealth and disease Semin Cell Dev Biol 201122359ndash65

158 Jones S van Heyningen P Berman HM et al Protein-DNAinteractions a structural analysis J Mol Biol 1999287877ndash96

159 Jones S Daley DT Luscombe NM et al Protein-RNAinteractions a structural analysis Nucleic Acids Res200129943ndash54

160 Barabasi AL Oltvai ZN Network biology understanding thecellrsquos functional organization Nat Rev Genet 20045101ndash13

161 Nooren IM Thornton JM Diversity of protein-protein inter-actions EMBO J 2003223486ndash92

162 Caspi R Altman T Dreher K et al The MetaCyc database ofmetabolic pathways and enzymes and the BioCyc collectionof pathwaygenome databases Nucleic Acids Res201240D742ndash53

163 Yamada T Bork P Evolution of biomolecular networks les-sons from metabolic and protein interactions Nat Rev MolCell Biol 200910791ndash803

164 Stark C Breitkreutz BJ Chatr-Aryamontri A et al TheBioGRID interaction database 2011 update Nucleic Acids Res201139D698ndash704

165 Szklarczyk D Franceschini A Kuhn M et al The STRINGdatabase in 2011 functional interaction networks of pro-teins globally integrated and scored Nucleic Acids Res201139D561ndash8

166 Bader GD Hogue CW An automated method for finding mo-lecular complexes in large protein interaction networksBMC Bioinformatics 200342

167 DallrsquoOlio GM Vahdati AR Bertranpetit J et alVCF2Networks applying Genotype Networks to SingleNucleotide Variants data Bioinformatics 201531438ndash9

168 Ding L Getz G Wheeler DA et al Somatic mutations affectkey pathways in lung adenocarcinoma Nature20084551069ndash75

169 Case M Matheson E Minto L et al Mutation of genes affect-ing the RAS pathway is common in childhood acutelymphoblastic leukemia Cancer Res 2008686803ndash9

170 Thorstensen L Lind GE Lovig T et al Genetic and epigeneticchanges of components affecting the WNT pathway in colo-rectal carcinomas stratified by microsatellite instabilityNeoplasia 2005799ndash108

Interpreting functional effects of coding variants | 21

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

171 Weng L Macciardi F Subramanian A et al SNP-based path-way enrichment analysis for genome-wide associationstudies BMC Bioinformatics 20111299

172 Wang K Li M Bucan M Pathway-based approaches for ana-lysis of genomewide association studies Am J Hum Genet2007811278ndash83

173 OrsquoDushlaine C Kenny E Heron EA et al The SNP ratio testpathway analysis of genome-wide association datasetsBioinformatics 2009252762ndash3

174 Tarca AL Draghici S Khatri P et al A novel signaling path-way impact analysis Bioinformatics 20092575ndash82

175 Ramanan VK Shen L Moore JH et al Pathway analysis ofgenomic data concepts methods and prospects for futuredevelopment Trends Genet 201228323ndash2

176 Li Y Agarwal P Rajagopalan D A global pathway crosstalknetwork Bioinformatics 2008241442ndash7

177 Rivals I Personnaz L Taing L et al Enrichment or depletionof a GO category within a class of genes which testBioinformatics 200723401ndash7

178 Rhee SY Wood V Dolinski K et al Use and misuse of thegene ontology annotations Nat Rev Genet 20089509ndash15

179 Chandonia JM Brenner SE The impact of structural gen-omics expectations and outcomes Science 2006311347ndash51

180 Johnson MS Srinivasan N Sowdhamini R et al Knowledge-based protein modeling Crit Rev Biochem Mol Biol 1994291ndash68

181 Sulkowska JI Morcos F Weigt M et al Genomics-aided struc-ture prediction Proc Natl Acad Sci USA 201210910340ndash5

182 Lee D Redfern O Orengo C Predicting protein function from se-quence and structure Nat Rev Mol Cell Biol 20078995ndash1005

183 Worth CL Gong S Blundell TL Structural and functionalconstraints in the evolution of protein families Nat Rev MolCell Biol 200910709ndash20

184 Karczewski KJ Fernald GH Martin AR et al STORMSeq anopen-source user-friendly pipeline for processing personalgenomics data in the cloud PLoS One 20149e84860

185 Kalari KR Rossell D Necela BM et al Deep sequence ana-lysis of non-small cell lung cancer integrated analysis ofgene expression alternative splicing and single nucleotidevariations in lung adenocarcinomas with and without onco-genic KRAS mutations Front Oncol 2012212

186 Patil S Upadhayay A Arya D et al A high throughput exomesequencing approach to analyse events within a good re-sponder CML patient under imatinib at diagnosis and underremission Blood 20131225161

187 Kullo IJ Shameer K Jouni H et al The ATXN2-SH2B3 locus isassociated with peripheral arterial disease an electronicmedical record-based genome-wide association study FrontGenet 20145166

188 Shameer K Denny JC Ding K et al A genome- and phe-nome-wide association study to identify genetic variantsinfluencing platelet count and volume and their pleiotropiceffects Hum Genet 201413395ndash109

189 Sabarinathan R Wenzel A Novotny P et al Transcriptome-wide analysis of UTRs in non-small cell lung cancer revealscancer-related genes with SNV-induced changes on RNAsecondary structure and miRNA target sites PLoS One20149e82699

22 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

  • bbv084-TF1
Page 17: challenges in proteome-scale prediction, annotation and assessment

harboring coding variants Furthermore it is possible to identifyenriched molecular events mediated by genes enriched TFBS inthe upstream of genes and biological pathways mediated by thegenes using annotations from Reactome UCSC and KEGG repo-sitories Disease ontology can be used to find known genendashdisease associations Following the knowledge-based analysis anetwork-level analysis of genes harboring coding SNVs can beperformed using publicly available network analysis tools suchas Cytoscape and Gephi

Future directions

The rapidly increasing availability of sequence structure func-tional and interaction information offers an attractive means toobtain a detailed characterization of coding variants Howeverthe use of large data sets to systematically explore the functionalvariants is currently limited owing to the gaps in the availablesequencendashstructurendashfunction interaction data Sequence dataare available for the entire human proteome but the availabilityof protein structural data experimentally verified functional as-sociations and biomolecular interaction data is limited Theavailability of structural data of homologous protein structuresis a prerequisite for structural investigations of functional vari-ants and its impact on the structural environment Availabilityof high-quality interaction data will also help analyze the cellu-lar networks modulated by proteins harboring coding variantsThe expansion of structural data sets using structural genomics[179] homology modeling [180] and the application of computa-tional approaches such as genomics-aided structure prediction[181] is likely to lower the increasing chasm between sequencestructure function and interaction data and may help to ascer-tain the impact of variants at the biomolecular level [163 182183] Improving the annotation databases using high-qualitycurated data and integration with cloud-based or stand-alonepipelines [184ndash189] can also help facilitate such efforts

The current version of the Bioinformatics Link Directory en-lists gt900 bioinformatics tools that are capable of processing dif-ferent types of protein data (sequence structure interactionquantification and annotation) Unifying multiple resources andenabling programmatic access via application program interfacesand web services for rapid integration will significantly enhancethe efficacy of variant annotation pipelines The development ofnovel statistical methods that can quantify various residue-specific properties would enable a comparative analysis of wild-type and variant sequences and can thereby help to facilitate theidentification and prioritization of functional mutations The in-clusion of a diverse set of sequence structure and network fea-tures into a graphically depicted database and an assessment ofthe impact of various data types (sequence structure functionand interaction) using probabilistic or machine learning modelswould also help automate variant annotation interpretation

From a sequence perspective understanding the precisefunctional role of LCRs unassigned regions and disordered re-gions as well as the relationship between these features and cod-ing SNVs is a key step in variant annotation and interpretationIdentifying deviations in protein structural space that are likelyto lead to misfolding domain swapping aggregation degrad-ation or deviations that are metamorphic in nature owing to theimpact of coding SNVs is an emerging challenge Several algo-rithms are available to predict sequence or structural domainsassign distantly related domains to unassigned regions func-tional motifs structural motifs and structural features proteindisorder and aggregation propensities However algorithms andtools to predict 3D domain swapping or to analyze metamorphic

properties degradation pathways and the network-level impactof coding SNVs are less abundant This may be attributed to thefact that these concepts are continually evolving in the proteinuniverse and concerted efforts are needed to understand the im-pact of coding variants of such unique features from both experi-mental and computational biologists alike

This review summarizes the major bioinformatics chal-lenges involved in gaining a deeper understanding of codingSNVs Coding SNVs may impart a molecular or disease pheno-type by interacting with noncoding regulatory parts of the gen-ome and the structural variants may provide an importantcontribution to this phenomenon Analyzing functional andregulatory variants in a single analytical framework may furtherenhance the interpretation of the genomic variants

Conclusions

A rapid decline in sequencing costs using NGS technologies hasled to an exponential increase in the frequency of the sequenc-ing projects over the past decade A large number of personalgenomes and exomes clinical samples and cancer genomes arebeing sequenced as part of large-scale collaborative projectsThe identification of a plethora of sequence variants associatedwith diverse molecular and disease phenotypes Scalable com-putational approaches that integrate annotations using se-quence structure functional and interaction data are necessaryfor the rapid interpretation of coding SNVs We have discussedthe bioinformatics resources available for the prediction anno-tation and visualization of coding SNVs summarized the majorbioinformatics challenges into 10 different themes for a deeperinterpretation of coding SNVs and recommended a three-levelschema to assess the phenotypic impact of functional vari-ations on individual protein sequences structures differentfunctional categories biological pathways and interactomesWe envisage that addressing the key challenges discussed inthis review and adopting a comprehensive annotation schemafor variant annotation could improve genomic reports that aregenerated as part of genomic medicine investigations and ex-perimental studies to better understand the variations impli-cated in rare common and complex disease manifestations

Key Points

bull Interpretations of variants identified from next-gener-ation sequencing pose several challenges

bull Systems-level experimental investigation of the func-tional variants is expensive and time-consumingefficient computational techniques are required toidentify the impact of functional variants

bull We recommended an integrated approach that com-bines multiple data types and tools for the predictionannotation and visualization of functional variantsand we have proposed a systematic approach for func-tional variant annotation and interpretation

bull Significant challenges that need to be negotiated dur-ing the interpretation of coding single nucleotide vari-ants are presented with the help of various examples

bull A three-level annotation approach that combines theinformation at the level of an individual variant mul-tiple variants in a single protein and global trends ofmultiple genes harboring multiple variants is proposedfor an effective interpretation of coding single nucleo-tide variants

Interpreting functional effects of coding variants | 17

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

Acknowledgements

KS and JTD would like to thank Icahn Institute forGenomics and Multiscale Biology (httpicahnmssmedudepartments-and-institutesgenomics) Mount Sinai HealthSystem for infrastructural support KS would like to ac-knowledge Michael J Zimmermann and Jean-Pierre Kocher(Mayo Clinic Rochester) for useful discussions The MayoClinic Bioinformatics Core provided infrastructure and ana-lytical support to KRK RS would like to thank NationalCentre for Biological Sciences (TIFR) for infrastructuralsupport

Funding

KS and JTD was supported by grant R01-DK098242-03from the National Institute of Diabetes and Digestive andKidney Diseases KRK is funded in part by the Mayo ClinicCenter for Individualized Medicine the PharmacogenomicsResearch Network (PGRN) and the Mayo Clinic BreastSpecialized Program of Research Excellence (SPORE)

References1 Lander ES Initial impact of the sequencing of the human

genome Nature 2011470187ndash972 Mardis ER A decadersquos perspective on DNA sequencing tech-

nology Nature 2011470198ndash2033 Green ED Guyer MS Charting a course for genomic medi-

cine from base pairs to bedside Nature 2011470204ndash134 Maxmen A Exome sequencing deciphers rare diseases Cell

2011144635ndash75 Wheeler DA Srinivasan M Egholm M et al The complete

genome of an individual by massively parallel DNAsequencing Nature 2008452872ndash6

6 Weatheritt RJ Davey NE Gibson TJ Linear motifs conferfunctional diversity onto splice variants Nucleic Acids Res201240123ndash31

7 Goldstein DB Common genetic variation and human traitsN Engl J Med 20093601696ndash8

8 Manolio TA Collins FS Cox NJ et al Finding the missing her-itability of complex diseases Nature 2009461747ndash53

9 Genome-wide association study of 14 000 cases of sevencommon diseases and 3 000 shared controls Nature2007447661ndash78

10 Kullo IJ Ding K Shameer K et al Complement receptor 1gene variants are associated with erythrocyte sedimenta-tion rate Am J Hum Genet 201189131ndash8

11 Welter D MacArthur J Morales J et al The NHGRI GWASCatalog a curated resource of SNP-trait associations NucleicAcids Res 201442D1001ndash6

12 Leslie R OrsquoDonnell CJ Johnson AD GRASP analysis of geno-type-phenotype results from 1390 genome-wide associationstudies and corresponding open access databaseBioinformatics 201430i185ndash94

13 Nelson MR Wegmann D Ehm MG et al An abundance ofrare functional variants in 202 drug target genes sequencedin 14 002 people Science 2012337100ndash4

14 MacArthur DG Balasubramanian S Frankish A et al A sys-tematic survey of loss-of-function variants in human pro-tein-coding genes Science 2012335823ndash8

15 Raychaudhuri S Mapping rare and common causal allelesfor complex human diseases Cell 201114757ndash69

16 Stankiewicz P Lupski JR Structural variation in the humangenome and its role in disease Annu Rev Med201061437ndash455

17 McCarroll SA Altshuler DM Copy-number variation and as-sociation studies of human disease Nat Genet200739S37ndash42

18 Zhang F Gu W Hurles ME et al Copy number variation inhuman health disease and evolution Annu Rev GenomicsHum Genet 200910451ndash81

19 Medvedev P Stanciu M Brudno M Computational methodsfor discovering structural variation with next-generationsequencing Nat Methods 20096S13ndash20

20 Sherry ST Ward MH Kholodov M et al dbSNP the NCBI data-base of genetic variation Nucleic Acids Res 200129308ndash11

21 Rios D McLaren WM Chen Y et al A database and API forvariation dense genotyping and resequencing data BMCBioinformatics 201011238

22 Iafrate AJ Feuk L Rivera MN et al Detection of large-scale variation in the human genome Nat Genet200436949ndash51

23 Ashley EA Butte AJ Wheeler MT et al Clinical assess-ment incorporating a personal genome Lancet20103751525ndash35

24 Chen R Mias GI Li-Pook-Than J et al Personal omics profil-ing reveals dynamic molecular and medical phenotypesCell 20121481293ndash307

25 Lupski JR Reid JG Gonzaga-Jauregui C et al Whole-genomesequencing in a patient with Charcot-Marie-Tooth neur-opathy N Engl J Med 20103621181ndash91

26 Li Y Vinckenbosch N Tian G et al Resequencing of200 human exomes identifies an excess of low-fre-quency non-synonymous coding variants Nat Genet201042969ndash72

27 Clarke L Zheng-Bradley X Smith R et al The 1000 GenomesProject data management and community access NatMethods 20129459ndash462

28 Schunkert H Konig IR Kathiresan S et al Large-scale associ-ation analysis identifies 13 new susceptibility loci for coron-ary artery disease Nat Genet 201143333ndash8

29 Helgadottir A Thorleifsson G Manolescu A et al A commonvariant on chromosome 9p21 affects the risk of myocardialinfarction Science 20073161491ndash3

30 Visel A Zhu Y May D et al Targeted deletion of the 9p21non-coding coronary artery disease risk interval in miceNature 2010464409ndash12

31 Soranzo N Rendon A Gieger C et al A novel variant onchromosome 7q223 associated with mean platelet volumecounts and function Blood 20091133831ndash7

32 Paul DS Nisbet JP Yang TP et al Maps of open chromatinguide the functional follow-up of genome-wide associationsignals application to hematological traits PLoS Genet20117e1002139

33 Hull J Campino S Rowlands K et al Identification of com-mon genetic variation that modulates alternative splicingPLoS Genet 20073e99

34 Albert FW Kruglyak L The role of regulatory vari-ation in complex traits and disease Nat Rev Genet201516197ndash212

35 Battle A Khan Z Wang SH et al Genomic variation Impactof regulatory variation from RNA to protein Science2015347664ndash7

36 den Dunnen JT Antonarakis SE Nomenclature for the de-scription of human sequence variations Hum Genet2001109121ndash4

18 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

37 Eilbeck K Lewis SE Mungall CJ et al The sequence ontologya tool for the unification of genome annotations Genome Biol20056R44

38 Landrum MJ Lee JM Riley GR et al ClinVar public archive ofrelationships among sequence variation and human pheno-type Nucleic Acids Res 201442D980ndash5

39 Kircher M Witten DM Jain P et al A general framework forestimating the relative pathogenicity of human genetic vari-ants Nat Genet 201446310ndash15

40 Kumar P Henikoff S Ng PC Predicting the effects of codingnon-synonymous variants on protein function using theSIFT algorithm Nat Protoc 200941073ndash81

41 Adzhubei I Schmidt S Peshkin L et al A method and serverfor predicting damaging missense mutations Nat Meth20107248ndash9

42 Gonzalez-Perez A Lopez-Bigas N Improving the assessmentof the outcome of nonsynonymous SNVs with a consensusdeleteriousness score Condel Am J Hum Genet 201188440ndash9

43 Grantham R Amino acid difference formula to help explainprotein evolution Science 1974185862ndash4

44 Cooper GM Stone EA Asimenos G et al Distribution and in-tensity of constraint in mammalian genomic sequenceGenome Res 200515901ndash13

45 Pollard KS Hubisz MJ Rosenbloom KR et al Detection ofnonneutral substitution rates on mammalian phylogeniesGenome Res 201020110ndash21

46 Siepel A Bejerano G Pedersen JS et al Evolutionarily con-served elements in vertebrate insect worm and yeast gen-omes Genome Res 2005151034ndash50

47 Dayhoff MO Eck RV Park CM Model of evolutionary changein proteins In MO Dayhoff (ed) Atlas of Protein Sequence andStructure Washington DC National Biomedical ResearchFoundation 1972 89ndash99

48 Henikoff S Henikoff JG Amino acid substitution matricesfrom protein blocks Proc Natl Acad Sci USA 19928910915ndash19

49 Altschul SF Gish W Miller W et al Basic local alignmentsearch tool J Mol Biol 1990215403ndash10

50 Altschul SF Madden TL Schaffer AA et al Gapped BLASTand PSI-BLAST a new generation of protein database searchprograms Nucleic Acids Res 1997253389ndash402

51 Sunyaev S Ramensky V Koch I et al Prediction of deleteri-ous human alleles Hum Mol Genet 200110591ndash7

52 Yang Z Bielawski JP Statistical methods for detecting mo-lecular adaptation Trends Ecol Evol 200015496ndash503

53 Ng PC Henikoff S Predicting deleterious amino acid substi-tutions Genome Res 200111863ndash74

54 Wang Z Moult J SNPs protein structure and disease HumMutat 200117263ndash70

55 Thomas PD Kejariwal A Coding single-nucleotide poly-morphisms associated with complex vs Mendelian diseaseevolutionary evidence for differences in molecular effectsProc Natl Acad Sci USA 200410115398ndash403

56 Stone EA Sidow A Physicochemical constraint violation bymissense substitutions mediates impairment of proteinfunction and disease severity Genome Res 200515978ndash86

57 Miller MP Kumar S Understanding human disease muta-tions through the use of interspecific genetic variation HumMol Genet 2001102319ndash28

58 Johnson AD Single-nucleotide polymorphism bioinfor-matics a comprehensive review of resources Circ CardiovascGenet 20092530ndash6

59 Marjoram P Tavare S Modern computational approachesfor analysing molecular genetic variation data Nat Rev Genet20067759ndash70

60 Excoffier L Heckel G Computer programs for populationgenetics data analysis a survival guide Nat Rev Genet20067745ndash58

61 Peterson TA Nehrt NL Park D et al Incorporating molecularand functional context into the analysis and prioritizationof human variants associated with cancer J Am Med InformAssoc 201219275ndash83

62 Kann MG Advances in translational bioinformatics compu-tational approaches for the hunting of disease genes BriefBioinform 20101196ndash110

63 Cooper GM Shendure J Needles in stacks of needles findingdisease-causal variants in a wealth of genomic data Nat RevGenet 201112628ndash40

64 Stitziel NO Kiezun A Sunyaev S Computational and statis-tical approaches to analyzing variants identified by exomesequencing Genome Biol 201112227

65 Alexander RP Fang G Rozowsky J et al Annotating non-coding regions of the genome Nat Rev Genet 201011559ndash71

66 Mort M Evani US Krishnan VG et al In silico func-tional profiling of human disease-associated and poly-morphic amino acid substitutions Hum Mutat 201031335ndash46

67 Hamosh A Scott AF Amberger JS et al Online MendelianInheritance in Man (OMIM) a knowledgebase of humangenes and genetic disorders Nucleic Acids Res200533D514ndash17

68 Stenson PD Mort M Ball EV et al The human gene mutationdatabase 2008 update Genome Med 2009113

69 Forbes SA Bindal N Bamford S et al COSMIC mining com-plete cancer genomes in the Catalogue of SomaticMutations in Cancer Nucleic Acids Res 201139D945ndash50

70 Chen YA Tripathi LP Mizuguchi K TargetMine an inte-grated data warehouse for candidate gene prioritisation andtarget discovery PLoS One 20116e17844

71 Wang K Li M Hakonarson H ANNOVAR functional annota-tion of genetic variants from high-throughput sequencingdata Nucleic Acids Res 201038e164

72 Danecek P Auton A Abecasis G et al The variant call formatand VCFtools Bioinformatics 2011272156ndash8

73 Macintyre G Jimeno Yepes A Ong CS et al Associating dis-ease-related genetic variants in intergenic regions to thegenes they impact Peer J 20142e639

74 Javed A Agrawal S Ng PC Phen-Gen combining phenotypeand genotype to analyze rare disorders Nat Methods201411935ndash7

75 Dai HJ Wu JC Tsai RT et al T-HOD a literature-based candi-date gene database for hypertension obesity and diabetesDatabase (Oxford) 20132013bas061

76 Cheng D Knox C Young N et al PolySearch a web-basedtext mining system for extracting relationships betweenhuman diseases genes mutations drugs and metabolitesNucleic Acids Res 200836W399ndash405

77 Do R Kathiresan S Abecasis GR Exome sequencing andcomplex disease practical aspects of rare variant associ-ation studies Hum Mol Genet 201221R1ndash9

78 Wagner MJ Rare-variant genome-wide association studiesa new frontier in genetic analysis of complex traitsPharmacogenomics 201314413ndash24

79 Genomes Project C Abecasis GR Auton A et al An inte-grated map of genetic variation from 1 092 human genomesNature 201249156ndash65

80 Anfinsen CB The formation and stabilization of proteinstructure Biochem J 1972128737ndash49

81 Anfinsen CB Principles that govern the folding of proteinchains Science 1973181223ndash30

Interpreting functional effects of coding variants | 19

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

82 Fay J Wu C Sequence divergence functional constraintand selection in protein evolution Annu Rev Genomics HumGenet 20034213ndash35

83 Ponting CP Russell RR The natural history of protein do-mains Annu Rev Biophys Biomol Struct 20023145ndash71

84 Ekman D Bjorklund AK Frey-Skott J et al Multi-domain pro-teins in the three kingdoms of life orphan domains andother unassigned regions J Mol Biol 2005348231ndash43

85 Sigrist CJ Cerutti L Hulo N et al PROSITE a documenteddatabase using patterns and profiles as motif descriptorsBrief Bioinform 20023265ndash74

86 Dinkel H Michael S Weatheritt RJ et al ELMndashthe database ofeukaryotic linear motifs Nucleic Acids Res 201240D242ndash51

87 Haberichter SL Budde U Obser T et al The mutation N528Sin the von Willebrand factor (VWF) propeptide causes de-fective multimerization and storage of VWF Blood20101154580ndash7

88 Kutz WE Wang LW Dagoneau N et al Functional analysis ofan ADAMTS10 signal peptide mutation in Weill-Marchesanisyndrome demonstrates a long-range effect on secretion ofthe full-length enzyme Hum Mutat 2008291425ndash34

89 Pugalenthi G Suganthan PN Sowdhamini R et al SMotif aserver for structural motifs in proteins Bioinformatics200723637ndash8

90 Sundaramurthy P Shameer K Sreenivasan R et al HORI aweb server to compute Higher Order Residue Interactions inprotein structures BMC Bioinformatics 201011(Suppl 1)S24

91 Porter CT Bartlett GJ Thornton JM The Catalytic Site Atlas aresource of catalytic sites and residues identified in enzymesusing structural data Nucleic Acids Res 200432D129ndash33

92 Stuart AC Ilyin VA Sali A LigBase a database of families ofaligned ligand binding sites in known protein sequencesand structures Bioinformatics 200218200ndash01

93 Swapna LS Mahajan S de Brevern A et al Comparison oftertiary structures of proteins in protein-protein complexeswith unbound forms suggests prevalence of allostery in sig-nalling proteins BMC Struct Biol 2012126

94 Mathivanan S Ahmed M Ahn NG et al Human Proteinpediaenables sharing of human protein data Nat Biotechnol200826164ndash7

95 Bhaskara RM Srinivasan N Stability of domain structures inmulti-domain proteins Sci Rep 2011140

96 Furnham N de Beer TA Thornton JM Current challenges ingenome annotation through structural biology and bioinfor-matics Curr Opin Struct Biol 201222594ndash601

97 Wang Y Cheng H Pan Z et al Reconfiguring phosphoryl-ation signaling by genetic polymorphisms affects cancersusceptibility J Mol Cell Biol 20157187ndash202

98 Gnad F Gunawardena J Mann M PHOSIDA 2011 the post-translational modification database Nucleic Acids Res201139D253ndash60

99 Nicolaou N Margadant C Kevelam SH et al Gain of glycosy-lation in integrin alpha3 causes lung disease and nephroticsyndrome J Clin Invest 20121224375ndash87

100 Radivojac P Baenziger PH Kann MG et al Gain and loss ofphosphorylation sites in human cancer Bioinformatics200824i241ndash7

101 Pan Y Karagiannis K Zhang H et al Human germline andpan-cancer variomes and their distinct functional profilesNucleic Acids Res 20144211570ndash88

102 Mazumder R Morampudi KS Motwani M et al Proteome-wide analysis of single-nucleotide variations in the N-glyco-sylation sequon of human genes PLoS One 20127e36212

103 Reddy CC Shameer K Offmann BO et al PURE a webserverfor the prediction of domains in unassigned regions in pro-teins BMC Bioinformatics 20089281

104 Coletta A Pinney JW Solis DY et al Low-complexity regionswithin protein sequences have position-dependent rolesBMC Syst Biol 2010443

105 Jones DT Swindells MB Getting the most from PSI-BLASTTrends Biochem Sci 200227161ndash4

106 Chograni M Rejeb I Jemaa LB et al The first missense muta-tion of NHS gene in a Tunisian family with clinical featuresof NHS syndrome including cardiac anomaly Eur J HumGenet 201119851ndash6

107 Jones S Thornton JM Principles of protein-protein inter-actions Proc Natl Acad Sci USA 19969313ndash20

108 Babu MM van der Lee R de Groot NS et al Intrinsically dis-ordered proteins regulation and disease Curr Opin Struct Biol201121432ndash40

109 Dunker AK Silman I Uversky VN et al Function and struc-ture of inherently disordered proteins Curr Opin Struct Biol200818756ndash64

110 Wright PE Dyson HJ Intrinsically unstructured proteins re-assessing the protein structure-function paradigm J Mol Biol1999293321ndash31

111 Dyson HJ Wright PE Coupling of folding and binding for un-structured proteins Curr Opin Struct Biol 20021254ndash60

112 Nguyen Ba AN Yeh BJ van Dyk D et al Proteome-wide dis-covery of evolutionary conserved sequences in disorderedregions Sci Signal 20125rs1

113 Das RK Mao AH Pappu RV Unmasking functionalmotifs within disordered regions of proteins Sci Signal20125pe17

114 Hu Y Liu Y Jung J et al Changes in predicted protein dis-order tendency may contribute to disease risk BMCGenomics 201112(Suppl 5)S2

115 Vacic V Iakoucheva LM Disease mutations in disordered re-gionsndashexception to the rule Mol BioSyst 2012827ndash32

116 Srivastava SK Gayathri S Manjasetty BA et al Analysis ofconformational variation in macromolecular structuralmodels PLoS One 20127e39993

117 Ward JJ Sodhi JS McGuffin LJ et al Prediction and functionalanalysis of native disorder in proteins from the three king-doms of life J Mol Biol 2004337635ndash45

118 Thomas PJ Qu BH Pedersen PL Defective protein folding asa basis of human disease Trends Biochem Sci 199520456ndash9

119 Welch W Role of quality control pathways in human dis-eases involving protein misfolding Semin Cell Dev Biol20041531ndash8

120 Fink AL Protein aggregation folding aggregates inclusionbodies and amyloid Fold Des 19983R9ndash23

121 Thirumalai D Klimov DK Dima RI Emerging ideas on themolecular basis of protein and peptide aggregation CurrOpin Struct Biol 200313146ndash59

122 Khare SD Dokholyan NV Molecular mechanisms of poly-peptide aggregation in human diseases Curr Protein Pept Sci20078573ndash9

123 Soto C Estrada L Castilla J Amyloids prions and the inher-ent infectious nature of misfolded protein aggregatesTrends Biochem Sci 200631150ndash5

124 Prusiner SB Molecular biology and pathogenesis of priondiseases Trends Biochem Sci 199621482ndash7

125 Shameer K Shingate PN Manjunath SC et al 3DSwap cura-ted knowledgebase of proteins involved in 3D domain swap-ping Database (Oxford) 20112011bar042

20 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

126 Bennett MJ Choe S Eisenberg D Domain swapping entan-gling alliances between proteins Proc Natl Acad Sci USA1994913127ndash31

127 Bennett MJ Sawaya MR Eisenberg D Deposition diseasesand 3D domain swapping Structure 200614811ndash24

128 Shameer K Sowdhamini R Functional repertoire molecularpathways and diseases associated with 3D domain swap-ping in the human proteome J Clin Bioinform 201228

129 Waters PJ Parniak MA Hewson AS et al Alterations in pro-tein aggregation and degradation due to mild and severemissense mutations (A104D R157N) in the human phenyl-alanine hydroxylase gene (PAH) Hum Mutat 199812344ndash54

130 OrsquoNeill JW Kim DE Johnsen K et al Single-site mutations in-duce 3D domain swapping in the B1 domain of protein Lfrom Peptostreptococcus magnus Structure 200191017ndash27

131 Seeliger MA Spichty M Kelly SE et al Role of conformationalheterogeneity in domain swapping and adapter function ofthe Cks proteins J Biol Chem 200528030448ndash59

132 Kirsten Frank M Dyda F Dobrodumov A et al Core muta-tions switch monomeric protein GB1 into an intertwinedtetramer Nat Struct Biol 20029877ndash85

133 Huang Y Cao H Liu Z Three-dimensional domain swappingin the protein structure space Proteins 2012801610ndash19

134 Gidalevitz T Wang N Deravaj T et al Natural genetic vari-ation determines susceptibility to aggregation or toxicity ina C elegans model for polyglutamine disease BMC Biol201311100

135 Ross CA Poirier MA Protein aggregation and neurodegener-ative disease Nat Med 200410(Suppl)S10ndash17

136 Murzin A Metamorphic proteins Science 20083201725ndash6137 Shortle D One sequence plus one mutation equals two

folds Proc Natl Acad Sci USA 200910621011ndash12138 Yadid I Kirshenbaum N Sharon M et al Metamorphic pro-

teins mediate evolutionary transitions of structure Proc NatlAcad Sci USA 20101077287ndash92

139 Yadid I Kirshenbaum N Sharon M et al Metamorphic pro-teins mediate evolutionary transitions of structure Proc NatlAcad Sci USA 20101077287ndash92

140 Pickrell J Marioni J Pai A et al Understanding mechanismsunderlying human gene expression variation with RNAsequencing Nature 2010464768ndash72

141 Matlin AJ Clark F Smith CW Understanding alternative splic-ing towards a cellular code Nat Rev Mol Cell Biol 20056386ndash98

142 Rickman DS Pflueger D Moss B et al SLC45A3-ELK4 is anovel and frequent erythroblast transformation-specific fu-sion transcript in prostate cancer Cancer Res 2009692734ndash8

143 Flomen R Makoff A Increased RNA editing in EAAT2 pre-mRNA from amyotrophic lateral sclerosis patients involve-ment of a cryptic polyadenylation site Neurosci Lett2011497139ndash43

144 Asmann YW Hossain A Necela BM et al A novel bioinfor-matics pipeline for identification and characterization of fu-sion transcripts in breast cancer and normal cell linesNucleic Acids Res 201139e100

145 Bundschuh R Computational prediction of RNA editingsites Bioinformatics 2004203214ndash20

146 Eisenberg E Li JB Levanon EY Sequence based identificationof RNA editing sites RNA Biol 20107248ndash52

147 Levanon EY Eisenberg E Algorithmic approaches for identi-fication of RNA editing sites Brief Funct Genomics Proteomic2006543ndash5

148 McCarthy DJ Humburg P Kanapin A et al Choice of tran-scripts and software has a large effect on variant annota-tion Genome Med 2014626

149 Tang X Baheti S Shameer K et al The eSNV-detect a com-putational system to identify expressed single nucleotidevariants from transcriptome sequencing data Nucleic AcidsRes 201442e172

150 Chorev M Carmel L The function of introns Front Genet2012355

151 Chen R Davydov EV Sirota M et al Non-synonymous andsynonymous coding SNPs show similar likelihood and effectsize of human disease association PLoS One 20105e13574

152 Rose AB Intron-mediated regulation of gene expressionCurr Top Microbiol Immunol 2008326277ndash90

153 Birnbaum RY Clowney EJ Agamy O et al Coding exonsfunction as tissue-specific enhancers of nearby genesGenome Res 2012221059ndash68

154 Nakaya HI Amaral PP Louro R et al Genome mapping andexpression analyses of human intronic noncoding RNAsreveal tissue-specific patterns and enrichment in genesrelated to regulation of transcription Genome Biol 20078R43

155 Rohs R Jin X West SM et al Origins of specificity in protein-DNA recognition Ann Rev Biochem 201079233ndash69

156 Karczewski KJ Dudley JT Kukurba KR et al Systematic func-tional regulatory assessment of disease-associated variantsProc Natl Acad Sci USA 20131109607ndash12

157 Khalil AM Rinn JL RNA-protein interactions in humanhealth and disease Semin Cell Dev Biol 201122359ndash65

158 Jones S van Heyningen P Berman HM et al Protein-DNAinteractions a structural analysis J Mol Biol 1999287877ndash96

159 Jones S Daley DT Luscombe NM et al Protein-RNAinteractions a structural analysis Nucleic Acids Res200129943ndash54

160 Barabasi AL Oltvai ZN Network biology understanding thecellrsquos functional organization Nat Rev Genet 20045101ndash13

161 Nooren IM Thornton JM Diversity of protein-protein inter-actions EMBO J 2003223486ndash92

162 Caspi R Altman T Dreher K et al The MetaCyc database ofmetabolic pathways and enzymes and the BioCyc collectionof pathwaygenome databases Nucleic Acids Res201240D742ndash53

163 Yamada T Bork P Evolution of biomolecular networks les-sons from metabolic and protein interactions Nat Rev MolCell Biol 200910791ndash803

164 Stark C Breitkreutz BJ Chatr-Aryamontri A et al TheBioGRID interaction database 2011 update Nucleic Acids Res201139D698ndash704

165 Szklarczyk D Franceschini A Kuhn M et al The STRINGdatabase in 2011 functional interaction networks of pro-teins globally integrated and scored Nucleic Acids Res201139D561ndash8

166 Bader GD Hogue CW An automated method for finding mo-lecular complexes in large protein interaction networksBMC Bioinformatics 200342

167 DallrsquoOlio GM Vahdati AR Bertranpetit J et alVCF2Networks applying Genotype Networks to SingleNucleotide Variants data Bioinformatics 201531438ndash9

168 Ding L Getz G Wheeler DA et al Somatic mutations affectkey pathways in lung adenocarcinoma Nature20084551069ndash75

169 Case M Matheson E Minto L et al Mutation of genes affect-ing the RAS pathway is common in childhood acutelymphoblastic leukemia Cancer Res 2008686803ndash9

170 Thorstensen L Lind GE Lovig T et al Genetic and epigeneticchanges of components affecting the WNT pathway in colo-rectal carcinomas stratified by microsatellite instabilityNeoplasia 2005799ndash108

Interpreting functional effects of coding variants | 21

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

171 Weng L Macciardi F Subramanian A et al SNP-based path-way enrichment analysis for genome-wide associationstudies BMC Bioinformatics 20111299

172 Wang K Li M Bucan M Pathway-based approaches for ana-lysis of genomewide association studies Am J Hum Genet2007811278ndash83

173 OrsquoDushlaine C Kenny E Heron EA et al The SNP ratio testpathway analysis of genome-wide association datasetsBioinformatics 2009252762ndash3

174 Tarca AL Draghici S Khatri P et al A novel signaling path-way impact analysis Bioinformatics 20092575ndash82

175 Ramanan VK Shen L Moore JH et al Pathway analysis ofgenomic data concepts methods and prospects for futuredevelopment Trends Genet 201228323ndash2

176 Li Y Agarwal P Rajagopalan D A global pathway crosstalknetwork Bioinformatics 2008241442ndash7

177 Rivals I Personnaz L Taing L et al Enrichment or depletionof a GO category within a class of genes which testBioinformatics 200723401ndash7

178 Rhee SY Wood V Dolinski K et al Use and misuse of thegene ontology annotations Nat Rev Genet 20089509ndash15

179 Chandonia JM Brenner SE The impact of structural gen-omics expectations and outcomes Science 2006311347ndash51

180 Johnson MS Srinivasan N Sowdhamini R et al Knowledge-based protein modeling Crit Rev Biochem Mol Biol 1994291ndash68

181 Sulkowska JI Morcos F Weigt M et al Genomics-aided struc-ture prediction Proc Natl Acad Sci USA 201210910340ndash5

182 Lee D Redfern O Orengo C Predicting protein function from se-quence and structure Nat Rev Mol Cell Biol 20078995ndash1005

183 Worth CL Gong S Blundell TL Structural and functionalconstraints in the evolution of protein families Nat Rev MolCell Biol 200910709ndash20

184 Karczewski KJ Fernald GH Martin AR et al STORMSeq anopen-source user-friendly pipeline for processing personalgenomics data in the cloud PLoS One 20149e84860

185 Kalari KR Rossell D Necela BM et al Deep sequence ana-lysis of non-small cell lung cancer integrated analysis ofgene expression alternative splicing and single nucleotidevariations in lung adenocarcinomas with and without onco-genic KRAS mutations Front Oncol 2012212

186 Patil S Upadhayay A Arya D et al A high throughput exomesequencing approach to analyse events within a good re-sponder CML patient under imatinib at diagnosis and underremission Blood 20131225161

187 Kullo IJ Shameer K Jouni H et al The ATXN2-SH2B3 locus isassociated with peripheral arterial disease an electronicmedical record-based genome-wide association study FrontGenet 20145166

188 Shameer K Denny JC Ding K et al A genome- and phe-nome-wide association study to identify genetic variantsinfluencing platelet count and volume and their pleiotropiceffects Hum Genet 201413395ndash109

189 Sabarinathan R Wenzel A Novotny P et al Transcriptome-wide analysis of UTRs in non-small cell lung cancer revealscancer-related genes with SNV-induced changes on RNAsecondary structure and miRNA target sites PLoS One20149e82699

22 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

  • bbv084-TF1
Page 18: challenges in proteome-scale prediction, annotation and assessment

Acknowledgements

KS and JTD would like to thank Icahn Institute forGenomics and Multiscale Biology (httpicahnmssmedudepartments-and-institutesgenomics) Mount Sinai HealthSystem for infrastructural support KS would like to ac-knowledge Michael J Zimmermann and Jean-Pierre Kocher(Mayo Clinic Rochester) for useful discussions The MayoClinic Bioinformatics Core provided infrastructure and ana-lytical support to KRK RS would like to thank NationalCentre for Biological Sciences (TIFR) for infrastructuralsupport

Funding

KS and JTD was supported by grant R01-DK098242-03from the National Institute of Diabetes and Digestive andKidney Diseases KRK is funded in part by the Mayo ClinicCenter for Individualized Medicine the PharmacogenomicsResearch Network (PGRN) and the Mayo Clinic BreastSpecialized Program of Research Excellence (SPORE)

References1 Lander ES Initial impact of the sequencing of the human

genome Nature 2011470187ndash972 Mardis ER A decadersquos perspective on DNA sequencing tech-

nology Nature 2011470198ndash2033 Green ED Guyer MS Charting a course for genomic medi-

cine from base pairs to bedside Nature 2011470204ndash134 Maxmen A Exome sequencing deciphers rare diseases Cell

2011144635ndash75 Wheeler DA Srinivasan M Egholm M et al The complete

genome of an individual by massively parallel DNAsequencing Nature 2008452872ndash6

6 Weatheritt RJ Davey NE Gibson TJ Linear motifs conferfunctional diversity onto splice variants Nucleic Acids Res201240123ndash31

7 Goldstein DB Common genetic variation and human traitsN Engl J Med 20093601696ndash8

8 Manolio TA Collins FS Cox NJ et al Finding the missing her-itability of complex diseases Nature 2009461747ndash53

9 Genome-wide association study of 14 000 cases of sevencommon diseases and 3 000 shared controls Nature2007447661ndash78

10 Kullo IJ Ding K Shameer K et al Complement receptor 1gene variants are associated with erythrocyte sedimenta-tion rate Am J Hum Genet 201189131ndash8

11 Welter D MacArthur J Morales J et al The NHGRI GWASCatalog a curated resource of SNP-trait associations NucleicAcids Res 201442D1001ndash6

12 Leslie R OrsquoDonnell CJ Johnson AD GRASP analysis of geno-type-phenotype results from 1390 genome-wide associationstudies and corresponding open access databaseBioinformatics 201430i185ndash94

13 Nelson MR Wegmann D Ehm MG et al An abundance ofrare functional variants in 202 drug target genes sequencedin 14 002 people Science 2012337100ndash4

14 MacArthur DG Balasubramanian S Frankish A et al A sys-tematic survey of loss-of-function variants in human pro-tein-coding genes Science 2012335823ndash8

15 Raychaudhuri S Mapping rare and common causal allelesfor complex human diseases Cell 201114757ndash69

16 Stankiewicz P Lupski JR Structural variation in the humangenome and its role in disease Annu Rev Med201061437ndash455

17 McCarroll SA Altshuler DM Copy-number variation and as-sociation studies of human disease Nat Genet200739S37ndash42

18 Zhang F Gu W Hurles ME et al Copy number variation inhuman health disease and evolution Annu Rev GenomicsHum Genet 200910451ndash81

19 Medvedev P Stanciu M Brudno M Computational methodsfor discovering structural variation with next-generationsequencing Nat Methods 20096S13ndash20

20 Sherry ST Ward MH Kholodov M et al dbSNP the NCBI data-base of genetic variation Nucleic Acids Res 200129308ndash11

21 Rios D McLaren WM Chen Y et al A database and API forvariation dense genotyping and resequencing data BMCBioinformatics 201011238

22 Iafrate AJ Feuk L Rivera MN et al Detection of large-scale variation in the human genome Nat Genet200436949ndash51

23 Ashley EA Butte AJ Wheeler MT et al Clinical assess-ment incorporating a personal genome Lancet20103751525ndash35

24 Chen R Mias GI Li-Pook-Than J et al Personal omics profil-ing reveals dynamic molecular and medical phenotypesCell 20121481293ndash307

25 Lupski JR Reid JG Gonzaga-Jauregui C et al Whole-genomesequencing in a patient with Charcot-Marie-Tooth neur-opathy N Engl J Med 20103621181ndash91

26 Li Y Vinckenbosch N Tian G et al Resequencing of200 human exomes identifies an excess of low-fre-quency non-synonymous coding variants Nat Genet201042969ndash72

27 Clarke L Zheng-Bradley X Smith R et al The 1000 GenomesProject data management and community access NatMethods 20129459ndash462

28 Schunkert H Konig IR Kathiresan S et al Large-scale associ-ation analysis identifies 13 new susceptibility loci for coron-ary artery disease Nat Genet 201143333ndash8

29 Helgadottir A Thorleifsson G Manolescu A et al A commonvariant on chromosome 9p21 affects the risk of myocardialinfarction Science 20073161491ndash3

30 Visel A Zhu Y May D et al Targeted deletion of the 9p21non-coding coronary artery disease risk interval in miceNature 2010464409ndash12

31 Soranzo N Rendon A Gieger C et al A novel variant onchromosome 7q223 associated with mean platelet volumecounts and function Blood 20091133831ndash7

32 Paul DS Nisbet JP Yang TP et al Maps of open chromatinguide the functional follow-up of genome-wide associationsignals application to hematological traits PLoS Genet20117e1002139

33 Hull J Campino S Rowlands K et al Identification of com-mon genetic variation that modulates alternative splicingPLoS Genet 20073e99

34 Albert FW Kruglyak L The role of regulatory vari-ation in complex traits and disease Nat Rev Genet201516197ndash212

35 Battle A Khan Z Wang SH et al Genomic variation Impactof regulatory variation from RNA to protein Science2015347664ndash7

36 den Dunnen JT Antonarakis SE Nomenclature for the de-scription of human sequence variations Hum Genet2001109121ndash4

18 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

37 Eilbeck K Lewis SE Mungall CJ et al The sequence ontologya tool for the unification of genome annotations Genome Biol20056R44

38 Landrum MJ Lee JM Riley GR et al ClinVar public archive ofrelationships among sequence variation and human pheno-type Nucleic Acids Res 201442D980ndash5

39 Kircher M Witten DM Jain P et al A general framework forestimating the relative pathogenicity of human genetic vari-ants Nat Genet 201446310ndash15

40 Kumar P Henikoff S Ng PC Predicting the effects of codingnon-synonymous variants on protein function using theSIFT algorithm Nat Protoc 200941073ndash81

41 Adzhubei I Schmidt S Peshkin L et al A method and serverfor predicting damaging missense mutations Nat Meth20107248ndash9

42 Gonzalez-Perez A Lopez-Bigas N Improving the assessmentof the outcome of nonsynonymous SNVs with a consensusdeleteriousness score Condel Am J Hum Genet 201188440ndash9

43 Grantham R Amino acid difference formula to help explainprotein evolution Science 1974185862ndash4

44 Cooper GM Stone EA Asimenos G et al Distribution and in-tensity of constraint in mammalian genomic sequenceGenome Res 200515901ndash13

45 Pollard KS Hubisz MJ Rosenbloom KR et al Detection ofnonneutral substitution rates on mammalian phylogeniesGenome Res 201020110ndash21

46 Siepel A Bejerano G Pedersen JS et al Evolutionarily con-served elements in vertebrate insect worm and yeast gen-omes Genome Res 2005151034ndash50

47 Dayhoff MO Eck RV Park CM Model of evolutionary changein proteins In MO Dayhoff (ed) Atlas of Protein Sequence andStructure Washington DC National Biomedical ResearchFoundation 1972 89ndash99

48 Henikoff S Henikoff JG Amino acid substitution matricesfrom protein blocks Proc Natl Acad Sci USA 19928910915ndash19

49 Altschul SF Gish W Miller W et al Basic local alignmentsearch tool J Mol Biol 1990215403ndash10

50 Altschul SF Madden TL Schaffer AA et al Gapped BLASTand PSI-BLAST a new generation of protein database searchprograms Nucleic Acids Res 1997253389ndash402

51 Sunyaev S Ramensky V Koch I et al Prediction of deleteri-ous human alleles Hum Mol Genet 200110591ndash7

52 Yang Z Bielawski JP Statistical methods for detecting mo-lecular adaptation Trends Ecol Evol 200015496ndash503

53 Ng PC Henikoff S Predicting deleterious amino acid substi-tutions Genome Res 200111863ndash74

54 Wang Z Moult J SNPs protein structure and disease HumMutat 200117263ndash70

55 Thomas PD Kejariwal A Coding single-nucleotide poly-morphisms associated with complex vs Mendelian diseaseevolutionary evidence for differences in molecular effectsProc Natl Acad Sci USA 200410115398ndash403

56 Stone EA Sidow A Physicochemical constraint violation bymissense substitutions mediates impairment of proteinfunction and disease severity Genome Res 200515978ndash86

57 Miller MP Kumar S Understanding human disease muta-tions through the use of interspecific genetic variation HumMol Genet 2001102319ndash28

58 Johnson AD Single-nucleotide polymorphism bioinfor-matics a comprehensive review of resources Circ CardiovascGenet 20092530ndash6

59 Marjoram P Tavare S Modern computational approachesfor analysing molecular genetic variation data Nat Rev Genet20067759ndash70

60 Excoffier L Heckel G Computer programs for populationgenetics data analysis a survival guide Nat Rev Genet20067745ndash58

61 Peterson TA Nehrt NL Park D et al Incorporating molecularand functional context into the analysis and prioritizationof human variants associated with cancer J Am Med InformAssoc 201219275ndash83

62 Kann MG Advances in translational bioinformatics compu-tational approaches for the hunting of disease genes BriefBioinform 20101196ndash110

63 Cooper GM Shendure J Needles in stacks of needles findingdisease-causal variants in a wealth of genomic data Nat RevGenet 201112628ndash40

64 Stitziel NO Kiezun A Sunyaev S Computational and statis-tical approaches to analyzing variants identified by exomesequencing Genome Biol 201112227

65 Alexander RP Fang G Rozowsky J et al Annotating non-coding regions of the genome Nat Rev Genet 201011559ndash71

66 Mort M Evani US Krishnan VG et al In silico func-tional profiling of human disease-associated and poly-morphic amino acid substitutions Hum Mutat 201031335ndash46

67 Hamosh A Scott AF Amberger JS et al Online MendelianInheritance in Man (OMIM) a knowledgebase of humangenes and genetic disorders Nucleic Acids Res200533D514ndash17

68 Stenson PD Mort M Ball EV et al The human gene mutationdatabase 2008 update Genome Med 2009113

69 Forbes SA Bindal N Bamford S et al COSMIC mining com-plete cancer genomes in the Catalogue of SomaticMutations in Cancer Nucleic Acids Res 201139D945ndash50

70 Chen YA Tripathi LP Mizuguchi K TargetMine an inte-grated data warehouse for candidate gene prioritisation andtarget discovery PLoS One 20116e17844

71 Wang K Li M Hakonarson H ANNOVAR functional annota-tion of genetic variants from high-throughput sequencingdata Nucleic Acids Res 201038e164

72 Danecek P Auton A Abecasis G et al The variant call formatand VCFtools Bioinformatics 2011272156ndash8

73 Macintyre G Jimeno Yepes A Ong CS et al Associating dis-ease-related genetic variants in intergenic regions to thegenes they impact Peer J 20142e639

74 Javed A Agrawal S Ng PC Phen-Gen combining phenotypeand genotype to analyze rare disorders Nat Methods201411935ndash7

75 Dai HJ Wu JC Tsai RT et al T-HOD a literature-based candi-date gene database for hypertension obesity and diabetesDatabase (Oxford) 20132013bas061

76 Cheng D Knox C Young N et al PolySearch a web-basedtext mining system for extracting relationships betweenhuman diseases genes mutations drugs and metabolitesNucleic Acids Res 200836W399ndash405

77 Do R Kathiresan S Abecasis GR Exome sequencing andcomplex disease practical aspects of rare variant associ-ation studies Hum Mol Genet 201221R1ndash9

78 Wagner MJ Rare-variant genome-wide association studiesa new frontier in genetic analysis of complex traitsPharmacogenomics 201314413ndash24

79 Genomes Project C Abecasis GR Auton A et al An inte-grated map of genetic variation from 1 092 human genomesNature 201249156ndash65

80 Anfinsen CB The formation and stabilization of proteinstructure Biochem J 1972128737ndash49

81 Anfinsen CB Principles that govern the folding of proteinchains Science 1973181223ndash30

Interpreting functional effects of coding variants | 19

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

82 Fay J Wu C Sequence divergence functional constraintand selection in protein evolution Annu Rev Genomics HumGenet 20034213ndash35

83 Ponting CP Russell RR The natural history of protein do-mains Annu Rev Biophys Biomol Struct 20023145ndash71

84 Ekman D Bjorklund AK Frey-Skott J et al Multi-domain pro-teins in the three kingdoms of life orphan domains andother unassigned regions J Mol Biol 2005348231ndash43

85 Sigrist CJ Cerutti L Hulo N et al PROSITE a documenteddatabase using patterns and profiles as motif descriptorsBrief Bioinform 20023265ndash74

86 Dinkel H Michael S Weatheritt RJ et al ELMndashthe database ofeukaryotic linear motifs Nucleic Acids Res 201240D242ndash51

87 Haberichter SL Budde U Obser T et al The mutation N528Sin the von Willebrand factor (VWF) propeptide causes de-fective multimerization and storage of VWF Blood20101154580ndash7

88 Kutz WE Wang LW Dagoneau N et al Functional analysis ofan ADAMTS10 signal peptide mutation in Weill-Marchesanisyndrome demonstrates a long-range effect on secretion ofthe full-length enzyme Hum Mutat 2008291425ndash34

89 Pugalenthi G Suganthan PN Sowdhamini R et al SMotif aserver for structural motifs in proteins Bioinformatics200723637ndash8

90 Sundaramurthy P Shameer K Sreenivasan R et al HORI aweb server to compute Higher Order Residue Interactions inprotein structures BMC Bioinformatics 201011(Suppl 1)S24

91 Porter CT Bartlett GJ Thornton JM The Catalytic Site Atlas aresource of catalytic sites and residues identified in enzymesusing structural data Nucleic Acids Res 200432D129ndash33

92 Stuart AC Ilyin VA Sali A LigBase a database of families ofaligned ligand binding sites in known protein sequencesand structures Bioinformatics 200218200ndash01

93 Swapna LS Mahajan S de Brevern A et al Comparison oftertiary structures of proteins in protein-protein complexeswith unbound forms suggests prevalence of allostery in sig-nalling proteins BMC Struct Biol 2012126

94 Mathivanan S Ahmed M Ahn NG et al Human Proteinpediaenables sharing of human protein data Nat Biotechnol200826164ndash7

95 Bhaskara RM Srinivasan N Stability of domain structures inmulti-domain proteins Sci Rep 2011140

96 Furnham N de Beer TA Thornton JM Current challenges ingenome annotation through structural biology and bioinfor-matics Curr Opin Struct Biol 201222594ndash601

97 Wang Y Cheng H Pan Z et al Reconfiguring phosphoryl-ation signaling by genetic polymorphisms affects cancersusceptibility J Mol Cell Biol 20157187ndash202

98 Gnad F Gunawardena J Mann M PHOSIDA 2011 the post-translational modification database Nucleic Acids Res201139D253ndash60

99 Nicolaou N Margadant C Kevelam SH et al Gain of glycosy-lation in integrin alpha3 causes lung disease and nephroticsyndrome J Clin Invest 20121224375ndash87

100 Radivojac P Baenziger PH Kann MG et al Gain and loss ofphosphorylation sites in human cancer Bioinformatics200824i241ndash7

101 Pan Y Karagiannis K Zhang H et al Human germline andpan-cancer variomes and their distinct functional profilesNucleic Acids Res 20144211570ndash88

102 Mazumder R Morampudi KS Motwani M et al Proteome-wide analysis of single-nucleotide variations in the N-glyco-sylation sequon of human genes PLoS One 20127e36212

103 Reddy CC Shameer K Offmann BO et al PURE a webserverfor the prediction of domains in unassigned regions in pro-teins BMC Bioinformatics 20089281

104 Coletta A Pinney JW Solis DY et al Low-complexity regionswithin protein sequences have position-dependent rolesBMC Syst Biol 2010443

105 Jones DT Swindells MB Getting the most from PSI-BLASTTrends Biochem Sci 200227161ndash4

106 Chograni M Rejeb I Jemaa LB et al The first missense muta-tion of NHS gene in a Tunisian family with clinical featuresof NHS syndrome including cardiac anomaly Eur J HumGenet 201119851ndash6

107 Jones S Thornton JM Principles of protein-protein inter-actions Proc Natl Acad Sci USA 19969313ndash20

108 Babu MM van der Lee R de Groot NS et al Intrinsically dis-ordered proteins regulation and disease Curr Opin Struct Biol201121432ndash40

109 Dunker AK Silman I Uversky VN et al Function and struc-ture of inherently disordered proteins Curr Opin Struct Biol200818756ndash64

110 Wright PE Dyson HJ Intrinsically unstructured proteins re-assessing the protein structure-function paradigm J Mol Biol1999293321ndash31

111 Dyson HJ Wright PE Coupling of folding and binding for un-structured proteins Curr Opin Struct Biol 20021254ndash60

112 Nguyen Ba AN Yeh BJ van Dyk D et al Proteome-wide dis-covery of evolutionary conserved sequences in disorderedregions Sci Signal 20125rs1

113 Das RK Mao AH Pappu RV Unmasking functionalmotifs within disordered regions of proteins Sci Signal20125pe17

114 Hu Y Liu Y Jung J et al Changes in predicted protein dis-order tendency may contribute to disease risk BMCGenomics 201112(Suppl 5)S2

115 Vacic V Iakoucheva LM Disease mutations in disordered re-gionsndashexception to the rule Mol BioSyst 2012827ndash32

116 Srivastava SK Gayathri S Manjasetty BA et al Analysis ofconformational variation in macromolecular structuralmodels PLoS One 20127e39993

117 Ward JJ Sodhi JS McGuffin LJ et al Prediction and functionalanalysis of native disorder in proteins from the three king-doms of life J Mol Biol 2004337635ndash45

118 Thomas PJ Qu BH Pedersen PL Defective protein folding asa basis of human disease Trends Biochem Sci 199520456ndash9

119 Welch W Role of quality control pathways in human dis-eases involving protein misfolding Semin Cell Dev Biol20041531ndash8

120 Fink AL Protein aggregation folding aggregates inclusionbodies and amyloid Fold Des 19983R9ndash23

121 Thirumalai D Klimov DK Dima RI Emerging ideas on themolecular basis of protein and peptide aggregation CurrOpin Struct Biol 200313146ndash59

122 Khare SD Dokholyan NV Molecular mechanisms of poly-peptide aggregation in human diseases Curr Protein Pept Sci20078573ndash9

123 Soto C Estrada L Castilla J Amyloids prions and the inher-ent infectious nature of misfolded protein aggregatesTrends Biochem Sci 200631150ndash5

124 Prusiner SB Molecular biology and pathogenesis of priondiseases Trends Biochem Sci 199621482ndash7

125 Shameer K Shingate PN Manjunath SC et al 3DSwap cura-ted knowledgebase of proteins involved in 3D domain swap-ping Database (Oxford) 20112011bar042

20 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

126 Bennett MJ Choe S Eisenberg D Domain swapping entan-gling alliances between proteins Proc Natl Acad Sci USA1994913127ndash31

127 Bennett MJ Sawaya MR Eisenberg D Deposition diseasesand 3D domain swapping Structure 200614811ndash24

128 Shameer K Sowdhamini R Functional repertoire molecularpathways and diseases associated with 3D domain swap-ping in the human proteome J Clin Bioinform 201228

129 Waters PJ Parniak MA Hewson AS et al Alterations in pro-tein aggregation and degradation due to mild and severemissense mutations (A104D R157N) in the human phenyl-alanine hydroxylase gene (PAH) Hum Mutat 199812344ndash54

130 OrsquoNeill JW Kim DE Johnsen K et al Single-site mutations in-duce 3D domain swapping in the B1 domain of protein Lfrom Peptostreptococcus magnus Structure 200191017ndash27

131 Seeliger MA Spichty M Kelly SE et al Role of conformationalheterogeneity in domain swapping and adapter function ofthe Cks proteins J Biol Chem 200528030448ndash59

132 Kirsten Frank M Dyda F Dobrodumov A et al Core muta-tions switch monomeric protein GB1 into an intertwinedtetramer Nat Struct Biol 20029877ndash85

133 Huang Y Cao H Liu Z Three-dimensional domain swappingin the protein structure space Proteins 2012801610ndash19

134 Gidalevitz T Wang N Deravaj T et al Natural genetic vari-ation determines susceptibility to aggregation or toxicity ina C elegans model for polyglutamine disease BMC Biol201311100

135 Ross CA Poirier MA Protein aggregation and neurodegener-ative disease Nat Med 200410(Suppl)S10ndash17

136 Murzin A Metamorphic proteins Science 20083201725ndash6137 Shortle D One sequence plus one mutation equals two

folds Proc Natl Acad Sci USA 200910621011ndash12138 Yadid I Kirshenbaum N Sharon M et al Metamorphic pro-

teins mediate evolutionary transitions of structure Proc NatlAcad Sci USA 20101077287ndash92

139 Yadid I Kirshenbaum N Sharon M et al Metamorphic pro-teins mediate evolutionary transitions of structure Proc NatlAcad Sci USA 20101077287ndash92

140 Pickrell J Marioni J Pai A et al Understanding mechanismsunderlying human gene expression variation with RNAsequencing Nature 2010464768ndash72

141 Matlin AJ Clark F Smith CW Understanding alternative splic-ing towards a cellular code Nat Rev Mol Cell Biol 20056386ndash98

142 Rickman DS Pflueger D Moss B et al SLC45A3-ELK4 is anovel and frequent erythroblast transformation-specific fu-sion transcript in prostate cancer Cancer Res 2009692734ndash8

143 Flomen R Makoff A Increased RNA editing in EAAT2 pre-mRNA from amyotrophic lateral sclerosis patients involve-ment of a cryptic polyadenylation site Neurosci Lett2011497139ndash43

144 Asmann YW Hossain A Necela BM et al A novel bioinfor-matics pipeline for identification and characterization of fu-sion transcripts in breast cancer and normal cell linesNucleic Acids Res 201139e100

145 Bundschuh R Computational prediction of RNA editingsites Bioinformatics 2004203214ndash20

146 Eisenberg E Li JB Levanon EY Sequence based identificationof RNA editing sites RNA Biol 20107248ndash52

147 Levanon EY Eisenberg E Algorithmic approaches for identi-fication of RNA editing sites Brief Funct Genomics Proteomic2006543ndash5

148 McCarthy DJ Humburg P Kanapin A et al Choice of tran-scripts and software has a large effect on variant annota-tion Genome Med 2014626

149 Tang X Baheti S Shameer K et al The eSNV-detect a com-putational system to identify expressed single nucleotidevariants from transcriptome sequencing data Nucleic AcidsRes 201442e172

150 Chorev M Carmel L The function of introns Front Genet2012355

151 Chen R Davydov EV Sirota M et al Non-synonymous andsynonymous coding SNPs show similar likelihood and effectsize of human disease association PLoS One 20105e13574

152 Rose AB Intron-mediated regulation of gene expressionCurr Top Microbiol Immunol 2008326277ndash90

153 Birnbaum RY Clowney EJ Agamy O et al Coding exonsfunction as tissue-specific enhancers of nearby genesGenome Res 2012221059ndash68

154 Nakaya HI Amaral PP Louro R et al Genome mapping andexpression analyses of human intronic noncoding RNAsreveal tissue-specific patterns and enrichment in genesrelated to regulation of transcription Genome Biol 20078R43

155 Rohs R Jin X West SM et al Origins of specificity in protein-DNA recognition Ann Rev Biochem 201079233ndash69

156 Karczewski KJ Dudley JT Kukurba KR et al Systematic func-tional regulatory assessment of disease-associated variantsProc Natl Acad Sci USA 20131109607ndash12

157 Khalil AM Rinn JL RNA-protein interactions in humanhealth and disease Semin Cell Dev Biol 201122359ndash65

158 Jones S van Heyningen P Berman HM et al Protein-DNAinteractions a structural analysis J Mol Biol 1999287877ndash96

159 Jones S Daley DT Luscombe NM et al Protein-RNAinteractions a structural analysis Nucleic Acids Res200129943ndash54

160 Barabasi AL Oltvai ZN Network biology understanding thecellrsquos functional organization Nat Rev Genet 20045101ndash13

161 Nooren IM Thornton JM Diversity of protein-protein inter-actions EMBO J 2003223486ndash92

162 Caspi R Altman T Dreher K et al The MetaCyc database ofmetabolic pathways and enzymes and the BioCyc collectionof pathwaygenome databases Nucleic Acids Res201240D742ndash53

163 Yamada T Bork P Evolution of biomolecular networks les-sons from metabolic and protein interactions Nat Rev MolCell Biol 200910791ndash803

164 Stark C Breitkreutz BJ Chatr-Aryamontri A et al TheBioGRID interaction database 2011 update Nucleic Acids Res201139D698ndash704

165 Szklarczyk D Franceschini A Kuhn M et al The STRINGdatabase in 2011 functional interaction networks of pro-teins globally integrated and scored Nucleic Acids Res201139D561ndash8

166 Bader GD Hogue CW An automated method for finding mo-lecular complexes in large protein interaction networksBMC Bioinformatics 200342

167 DallrsquoOlio GM Vahdati AR Bertranpetit J et alVCF2Networks applying Genotype Networks to SingleNucleotide Variants data Bioinformatics 201531438ndash9

168 Ding L Getz G Wheeler DA et al Somatic mutations affectkey pathways in lung adenocarcinoma Nature20084551069ndash75

169 Case M Matheson E Minto L et al Mutation of genes affect-ing the RAS pathway is common in childhood acutelymphoblastic leukemia Cancer Res 2008686803ndash9

170 Thorstensen L Lind GE Lovig T et al Genetic and epigeneticchanges of components affecting the WNT pathway in colo-rectal carcinomas stratified by microsatellite instabilityNeoplasia 2005799ndash108

Interpreting functional effects of coding variants | 21

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

171 Weng L Macciardi F Subramanian A et al SNP-based path-way enrichment analysis for genome-wide associationstudies BMC Bioinformatics 20111299

172 Wang K Li M Bucan M Pathway-based approaches for ana-lysis of genomewide association studies Am J Hum Genet2007811278ndash83

173 OrsquoDushlaine C Kenny E Heron EA et al The SNP ratio testpathway analysis of genome-wide association datasetsBioinformatics 2009252762ndash3

174 Tarca AL Draghici S Khatri P et al A novel signaling path-way impact analysis Bioinformatics 20092575ndash82

175 Ramanan VK Shen L Moore JH et al Pathway analysis ofgenomic data concepts methods and prospects for futuredevelopment Trends Genet 201228323ndash2

176 Li Y Agarwal P Rajagopalan D A global pathway crosstalknetwork Bioinformatics 2008241442ndash7

177 Rivals I Personnaz L Taing L et al Enrichment or depletionof a GO category within a class of genes which testBioinformatics 200723401ndash7

178 Rhee SY Wood V Dolinski K et al Use and misuse of thegene ontology annotations Nat Rev Genet 20089509ndash15

179 Chandonia JM Brenner SE The impact of structural gen-omics expectations and outcomes Science 2006311347ndash51

180 Johnson MS Srinivasan N Sowdhamini R et al Knowledge-based protein modeling Crit Rev Biochem Mol Biol 1994291ndash68

181 Sulkowska JI Morcos F Weigt M et al Genomics-aided struc-ture prediction Proc Natl Acad Sci USA 201210910340ndash5

182 Lee D Redfern O Orengo C Predicting protein function from se-quence and structure Nat Rev Mol Cell Biol 20078995ndash1005

183 Worth CL Gong S Blundell TL Structural and functionalconstraints in the evolution of protein families Nat Rev MolCell Biol 200910709ndash20

184 Karczewski KJ Fernald GH Martin AR et al STORMSeq anopen-source user-friendly pipeline for processing personalgenomics data in the cloud PLoS One 20149e84860

185 Kalari KR Rossell D Necela BM et al Deep sequence ana-lysis of non-small cell lung cancer integrated analysis ofgene expression alternative splicing and single nucleotidevariations in lung adenocarcinomas with and without onco-genic KRAS mutations Front Oncol 2012212

186 Patil S Upadhayay A Arya D et al A high throughput exomesequencing approach to analyse events within a good re-sponder CML patient under imatinib at diagnosis and underremission Blood 20131225161

187 Kullo IJ Shameer K Jouni H et al The ATXN2-SH2B3 locus isassociated with peripheral arterial disease an electronicmedical record-based genome-wide association study FrontGenet 20145166

188 Shameer K Denny JC Ding K et al A genome- and phe-nome-wide association study to identify genetic variantsinfluencing platelet count and volume and their pleiotropiceffects Hum Genet 201413395ndash109

189 Sabarinathan R Wenzel A Novotny P et al Transcriptome-wide analysis of UTRs in non-small cell lung cancer revealscancer-related genes with SNV-induced changes on RNAsecondary structure and miRNA target sites PLoS One20149e82699

22 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

  • bbv084-TF1
Page 19: challenges in proteome-scale prediction, annotation and assessment

37 Eilbeck K Lewis SE Mungall CJ et al The sequence ontologya tool for the unification of genome annotations Genome Biol20056R44

38 Landrum MJ Lee JM Riley GR et al ClinVar public archive ofrelationships among sequence variation and human pheno-type Nucleic Acids Res 201442D980ndash5

39 Kircher M Witten DM Jain P et al A general framework forestimating the relative pathogenicity of human genetic vari-ants Nat Genet 201446310ndash15

40 Kumar P Henikoff S Ng PC Predicting the effects of codingnon-synonymous variants on protein function using theSIFT algorithm Nat Protoc 200941073ndash81

41 Adzhubei I Schmidt S Peshkin L et al A method and serverfor predicting damaging missense mutations Nat Meth20107248ndash9

42 Gonzalez-Perez A Lopez-Bigas N Improving the assessmentof the outcome of nonsynonymous SNVs with a consensusdeleteriousness score Condel Am J Hum Genet 201188440ndash9

43 Grantham R Amino acid difference formula to help explainprotein evolution Science 1974185862ndash4

44 Cooper GM Stone EA Asimenos G et al Distribution and in-tensity of constraint in mammalian genomic sequenceGenome Res 200515901ndash13

45 Pollard KS Hubisz MJ Rosenbloom KR et al Detection ofnonneutral substitution rates on mammalian phylogeniesGenome Res 201020110ndash21

46 Siepel A Bejerano G Pedersen JS et al Evolutionarily con-served elements in vertebrate insect worm and yeast gen-omes Genome Res 2005151034ndash50

47 Dayhoff MO Eck RV Park CM Model of evolutionary changein proteins In MO Dayhoff (ed) Atlas of Protein Sequence andStructure Washington DC National Biomedical ResearchFoundation 1972 89ndash99

48 Henikoff S Henikoff JG Amino acid substitution matricesfrom protein blocks Proc Natl Acad Sci USA 19928910915ndash19

49 Altschul SF Gish W Miller W et al Basic local alignmentsearch tool J Mol Biol 1990215403ndash10

50 Altschul SF Madden TL Schaffer AA et al Gapped BLASTand PSI-BLAST a new generation of protein database searchprograms Nucleic Acids Res 1997253389ndash402

51 Sunyaev S Ramensky V Koch I et al Prediction of deleteri-ous human alleles Hum Mol Genet 200110591ndash7

52 Yang Z Bielawski JP Statistical methods for detecting mo-lecular adaptation Trends Ecol Evol 200015496ndash503

53 Ng PC Henikoff S Predicting deleterious amino acid substi-tutions Genome Res 200111863ndash74

54 Wang Z Moult J SNPs protein structure and disease HumMutat 200117263ndash70

55 Thomas PD Kejariwal A Coding single-nucleotide poly-morphisms associated with complex vs Mendelian diseaseevolutionary evidence for differences in molecular effectsProc Natl Acad Sci USA 200410115398ndash403

56 Stone EA Sidow A Physicochemical constraint violation bymissense substitutions mediates impairment of proteinfunction and disease severity Genome Res 200515978ndash86

57 Miller MP Kumar S Understanding human disease muta-tions through the use of interspecific genetic variation HumMol Genet 2001102319ndash28

58 Johnson AD Single-nucleotide polymorphism bioinfor-matics a comprehensive review of resources Circ CardiovascGenet 20092530ndash6

59 Marjoram P Tavare S Modern computational approachesfor analysing molecular genetic variation data Nat Rev Genet20067759ndash70

60 Excoffier L Heckel G Computer programs for populationgenetics data analysis a survival guide Nat Rev Genet20067745ndash58

61 Peterson TA Nehrt NL Park D et al Incorporating molecularand functional context into the analysis and prioritizationof human variants associated with cancer J Am Med InformAssoc 201219275ndash83

62 Kann MG Advances in translational bioinformatics compu-tational approaches for the hunting of disease genes BriefBioinform 20101196ndash110

63 Cooper GM Shendure J Needles in stacks of needles findingdisease-causal variants in a wealth of genomic data Nat RevGenet 201112628ndash40

64 Stitziel NO Kiezun A Sunyaev S Computational and statis-tical approaches to analyzing variants identified by exomesequencing Genome Biol 201112227

65 Alexander RP Fang G Rozowsky J et al Annotating non-coding regions of the genome Nat Rev Genet 201011559ndash71

66 Mort M Evani US Krishnan VG et al In silico func-tional profiling of human disease-associated and poly-morphic amino acid substitutions Hum Mutat 201031335ndash46

67 Hamosh A Scott AF Amberger JS et al Online MendelianInheritance in Man (OMIM) a knowledgebase of humangenes and genetic disorders Nucleic Acids Res200533D514ndash17

68 Stenson PD Mort M Ball EV et al The human gene mutationdatabase 2008 update Genome Med 2009113

69 Forbes SA Bindal N Bamford S et al COSMIC mining com-plete cancer genomes in the Catalogue of SomaticMutations in Cancer Nucleic Acids Res 201139D945ndash50

70 Chen YA Tripathi LP Mizuguchi K TargetMine an inte-grated data warehouse for candidate gene prioritisation andtarget discovery PLoS One 20116e17844

71 Wang K Li M Hakonarson H ANNOVAR functional annota-tion of genetic variants from high-throughput sequencingdata Nucleic Acids Res 201038e164

72 Danecek P Auton A Abecasis G et al The variant call formatand VCFtools Bioinformatics 2011272156ndash8

73 Macintyre G Jimeno Yepes A Ong CS et al Associating dis-ease-related genetic variants in intergenic regions to thegenes they impact Peer J 20142e639

74 Javed A Agrawal S Ng PC Phen-Gen combining phenotypeand genotype to analyze rare disorders Nat Methods201411935ndash7

75 Dai HJ Wu JC Tsai RT et al T-HOD a literature-based candi-date gene database for hypertension obesity and diabetesDatabase (Oxford) 20132013bas061

76 Cheng D Knox C Young N et al PolySearch a web-basedtext mining system for extracting relationships betweenhuman diseases genes mutations drugs and metabolitesNucleic Acids Res 200836W399ndash405

77 Do R Kathiresan S Abecasis GR Exome sequencing andcomplex disease practical aspects of rare variant associ-ation studies Hum Mol Genet 201221R1ndash9

78 Wagner MJ Rare-variant genome-wide association studiesa new frontier in genetic analysis of complex traitsPharmacogenomics 201314413ndash24

79 Genomes Project C Abecasis GR Auton A et al An inte-grated map of genetic variation from 1 092 human genomesNature 201249156ndash65

80 Anfinsen CB The formation and stabilization of proteinstructure Biochem J 1972128737ndash49

81 Anfinsen CB Principles that govern the folding of proteinchains Science 1973181223ndash30

Interpreting functional effects of coding variants | 19

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

82 Fay J Wu C Sequence divergence functional constraintand selection in protein evolution Annu Rev Genomics HumGenet 20034213ndash35

83 Ponting CP Russell RR The natural history of protein do-mains Annu Rev Biophys Biomol Struct 20023145ndash71

84 Ekman D Bjorklund AK Frey-Skott J et al Multi-domain pro-teins in the three kingdoms of life orphan domains andother unassigned regions J Mol Biol 2005348231ndash43

85 Sigrist CJ Cerutti L Hulo N et al PROSITE a documenteddatabase using patterns and profiles as motif descriptorsBrief Bioinform 20023265ndash74

86 Dinkel H Michael S Weatheritt RJ et al ELMndashthe database ofeukaryotic linear motifs Nucleic Acids Res 201240D242ndash51

87 Haberichter SL Budde U Obser T et al The mutation N528Sin the von Willebrand factor (VWF) propeptide causes de-fective multimerization and storage of VWF Blood20101154580ndash7

88 Kutz WE Wang LW Dagoneau N et al Functional analysis ofan ADAMTS10 signal peptide mutation in Weill-Marchesanisyndrome demonstrates a long-range effect on secretion ofthe full-length enzyme Hum Mutat 2008291425ndash34

89 Pugalenthi G Suganthan PN Sowdhamini R et al SMotif aserver for structural motifs in proteins Bioinformatics200723637ndash8

90 Sundaramurthy P Shameer K Sreenivasan R et al HORI aweb server to compute Higher Order Residue Interactions inprotein structures BMC Bioinformatics 201011(Suppl 1)S24

91 Porter CT Bartlett GJ Thornton JM The Catalytic Site Atlas aresource of catalytic sites and residues identified in enzymesusing structural data Nucleic Acids Res 200432D129ndash33

92 Stuart AC Ilyin VA Sali A LigBase a database of families ofaligned ligand binding sites in known protein sequencesand structures Bioinformatics 200218200ndash01

93 Swapna LS Mahajan S de Brevern A et al Comparison oftertiary structures of proteins in protein-protein complexeswith unbound forms suggests prevalence of allostery in sig-nalling proteins BMC Struct Biol 2012126

94 Mathivanan S Ahmed M Ahn NG et al Human Proteinpediaenables sharing of human protein data Nat Biotechnol200826164ndash7

95 Bhaskara RM Srinivasan N Stability of domain structures inmulti-domain proteins Sci Rep 2011140

96 Furnham N de Beer TA Thornton JM Current challenges ingenome annotation through structural biology and bioinfor-matics Curr Opin Struct Biol 201222594ndash601

97 Wang Y Cheng H Pan Z et al Reconfiguring phosphoryl-ation signaling by genetic polymorphisms affects cancersusceptibility J Mol Cell Biol 20157187ndash202

98 Gnad F Gunawardena J Mann M PHOSIDA 2011 the post-translational modification database Nucleic Acids Res201139D253ndash60

99 Nicolaou N Margadant C Kevelam SH et al Gain of glycosy-lation in integrin alpha3 causes lung disease and nephroticsyndrome J Clin Invest 20121224375ndash87

100 Radivojac P Baenziger PH Kann MG et al Gain and loss ofphosphorylation sites in human cancer Bioinformatics200824i241ndash7

101 Pan Y Karagiannis K Zhang H et al Human germline andpan-cancer variomes and their distinct functional profilesNucleic Acids Res 20144211570ndash88

102 Mazumder R Morampudi KS Motwani M et al Proteome-wide analysis of single-nucleotide variations in the N-glyco-sylation sequon of human genes PLoS One 20127e36212

103 Reddy CC Shameer K Offmann BO et al PURE a webserverfor the prediction of domains in unassigned regions in pro-teins BMC Bioinformatics 20089281

104 Coletta A Pinney JW Solis DY et al Low-complexity regionswithin protein sequences have position-dependent rolesBMC Syst Biol 2010443

105 Jones DT Swindells MB Getting the most from PSI-BLASTTrends Biochem Sci 200227161ndash4

106 Chograni M Rejeb I Jemaa LB et al The first missense muta-tion of NHS gene in a Tunisian family with clinical featuresof NHS syndrome including cardiac anomaly Eur J HumGenet 201119851ndash6

107 Jones S Thornton JM Principles of protein-protein inter-actions Proc Natl Acad Sci USA 19969313ndash20

108 Babu MM van der Lee R de Groot NS et al Intrinsically dis-ordered proteins regulation and disease Curr Opin Struct Biol201121432ndash40

109 Dunker AK Silman I Uversky VN et al Function and struc-ture of inherently disordered proteins Curr Opin Struct Biol200818756ndash64

110 Wright PE Dyson HJ Intrinsically unstructured proteins re-assessing the protein structure-function paradigm J Mol Biol1999293321ndash31

111 Dyson HJ Wright PE Coupling of folding and binding for un-structured proteins Curr Opin Struct Biol 20021254ndash60

112 Nguyen Ba AN Yeh BJ van Dyk D et al Proteome-wide dis-covery of evolutionary conserved sequences in disorderedregions Sci Signal 20125rs1

113 Das RK Mao AH Pappu RV Unmasking functionalmotifs within disordered regions of proteins Sci Signal20125pe17

114 Hu Y Liu Y Jung J et al Changes in predicted protein dis-order tendency may contribute to disease risk BMCGenomics 201112(Suppl 5)S2

115 Vacic V Iakoucheva LM Disease mutations in disordered re-gionsndashexception to the rule Mol BioSyst 2012827ndash32

116 Srivastava SK Gayathri S Manjasetty BA et al Analysis ofconformational variation in macromolecular structuralmodels PLoS One 20127e39993

117 Ward JJ Sodhi JS McGuffin LJ et al Prediction and functionalanalysis of native disorder in proteins from the three king-doms of life J Mol Biol 2004337635ndash45

118 Thomas PJ Qu BH Pedersen PL Defective protein folding asa basis of human disease Trends Biochem Sci 199520456ndash9

119 Welch W Role of quality control pathways in human dis-eases involving protein misfolding Semin Cell Dev Biol20041531ndash8

120 Fink AL Protein aggregation folding aggregates inclusionbodies and amyloid Fold Des 19983R9ndash23

121 Thirumalai D Klimov DK Dima RI Emerging ideas on themolecular basis of protein and peptide aggregation CurrOpin Struct Biol 200313146ndash59

122 Khare SD Dokholyan NV Molecular mechanisms of poly-peptide aggregation in human diseases Curr Protein Pept Sci20078573ndash9

123 Soto C Estrada L Castilla J Amyloids prions and the inher-ent infectious nature of misfolded protein aggregatesTrends Biochem Sci 200631150ndash5

124 Prusiner SB Molecular biology and pathogenesis of priondiseases Trends Biochem Sci 199621482ndash7

125 Shameer K Shingate PN Manjunath SC et al 3DSwap cura-ted knowledgebase of proteins involved in 3D domain swap-ping Database (Oxford) 20112011bar042

20 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

126 Bennett MJ Choe S Eisenberg D Domain swapping entan-gling alliances between proteins Proc Natl Acad Sci USA1994913127ndash31

127 Bennett MJ Sawaya MR Eisenberg D Deposition diseasesand 3D domain swapping Structure 200614811ndash24

128 Shameer K Sowdhamini R Functional repertoire molecularpathways and diseases associated with 3D domain swap-ping in the human proteome J Clin Bioinform 201228

129 Waters PJ Parniak MA Hewson AS et al Alterations in pro-tein aggregation and degradation due to mild and severemissense mutations (A104D R157N) in the human phenyl-alanine hydroxylase gene (PAH) Hum Mutat 199812344ndash54

130 OrsquoNeill JW Kim DE Johnsen K et al Single-site mutations in-duce 3D domain swapping in the B1 domain of protein Lfrom Peptostreptococcus magnus Structure 200191017ndash27

131 Seeliger MA Spichty M Kelly SE et al Role of conformationalheterogeneity in domain swapping and adapter function ofthe Cks proteins J Biol Chem 200528030448ndash59

132 Kirsten Frank M Dyda F Dobrodumov A et al Core muta-tions switch monomeric protein GB1 into an intertwinedtetramer Nat Struct Biol 20029877ndash85

133 Huang Y Cao H Liu Z Three-dimensional domain swappingin the protein structure space Proteins 2012801610ndash19

134 Gidalevitz T Wang N Deravaj T et al Natural genetic vari-ation determines susceptibility to aggregation or toxicity ina C elegans model for polyglutamine disease BMC Biol201311100

135 Ross CA Poirier MA Protein aggregation and neurodegener-ative disease Nat Med 200410(Suppl)S10ndash17

136 Murzin A Metamorphic proteins Science 20083201725ndash6137 Shortle D One sequence plus one mutation equals two

folds Proc Natl Acad Sci USA 200910621011ndash12138 Yadid I Kirshenbaum N Sharon M et al Metamorphic pro-

teins mediate evolutionary transitions of structure Proc NatlAcad Sci USA 20101077287ndash92

139 Yadid I Kirshenbaum N Sharon M et al Metamorphic pro-teins mediate evolutionary transitions of structure Proc NatlAcad Sci USA 20101077287ndash92

140 Pickrell J Marioni J Pai A et al Understanding mechanismsunderlying human gene expression variation with RNAsequencing Nature 2010464768ndash72

141 Matlin AJ Clark F Smith CW Understanding alternative splic-ing towards a cellular code Nat Rev Mol Cell Biol 20056386ndash98

142 Rickman DS Pflueger D Moss B et al SLC45A3-ELK4 is anovel and frequent erythroblast transformation-specific fu-sion transcript in prostate cancer Cancer Res 2009692734ndash8

143 Flomen R Makoff A Increased RNA editing in EAAT2 pre-mRNA from amyotrophic lateral sclerosis patients involve-ment of a cryptic polyadenylation site Neurosci Lett2011497139ndash43

144 Asmann YW Hossain A Necela BM et al A novel bioinfor-matics pipeline for identification and characterization of fu-sion transcripts in breast cancer and normal cell linesNucleic Acids Res 201139e100

145 Bundschuh R Computational prediction of RNA editingsites Bioinformatics 2004203214ndash20

146 Eisenberg E Li JB Levanon EY Sequence based identificationof RNA editing sites RNA Biol 20107248ndash52

147 Levanon EY Eisenberg E Algorithmic approaches for identi-fication of RNA editing sites Brief Funct Genomics Proteomic2006543ndash5

148 McCarthy DJ Humburg P Kanapin A et al Choice of tran-scripts and software has a large effect on variant annota-tion Genome Med 2014626

149 Tang X Baheti S Shameer K et al The eSNV-detect a com-putational system to identify expressed single nucleotidevariants from transcriptome sequencing data Nucleic AcidsRes 201442e172

150 Chorev M Carmel L The function of introns Front Genet2012355

151 Chen R Davydov EV Sirota M et al Non-synonymous andsynonymous coding SNPs show similar likelihood and effectsize of human disease association PLoS One 20105e13574

152 Rose AB Intron-mediated regulation of gene expressionCurr Top Microbiol Immunol 2008326277ndash90

153 Birnbaum RY Clowney EJ Agamy O et al Coding exonsfunction as tissue-specific enhancers of nearby genesGenome Res 2012221059ndash68

154 Nakaya HI Amaral PP Louro R et al Genome mapping andexpression analyses of human intronic noncoding RNAsreveal tissue-specific patterns and enrichment in genesrelated to regulation of transcription Genome Biol 20078R43

155 Rohs R Jin X West SM et al Origins of specificity in protein-DNA recognition Ann Rev Biochem 201079233ndash69

156 Karczewski KJ Dudley JT Kukurba KR et al Systematic func-tional regulatory assessment of disease-associated variantsProc Natl Acad Sci USA 20131109607ndash12

157 Khalil AM Rinn JL RNA-protein interactions in humanhealth and disease Semin Cell Dev Biol 201122359ndash65

158 Jones S van Heyningen P Berman HM et al Protein-DNAinteractions a structural analysis J Mol Biol 1999287877ndash96

159 Jones S Daley DT Luscombe NM et al Protein-RNAinteractions a structural analysis Nucleic Acids Res200129943ndash54

160 Barabasi AL Oltvai ZN Network biology understanding thecellrsquos functional organization Nat Rev Genet 20045101ndash13

161 Nooren IM Thornton JM Diversity of protein-protein inter-actions EMBO J 2003223486ndash92

162 Caspi R Altman T Dreher K et al The MetaCyc database ofmetabolic pathways and enzymes and the BioCyc collectionof pathwaygenome databases Nucleic Acids Res201240D742ndash53

163 Yamada T Bork P Evolution of biomolecular networks les-sons from metabolic and protein interactions Nat Rev MolCell Biol 200910791ndash803

164 Stark C Breitkreutz BJ Chatr-Aryamontri A et al TheBioGRID interaction database 2011 update Nucleic Acids Res201139D698ndash704

165 Szklarczyk D Franceschini A Kuhn M et al The STRINGdatabase in 2011 functional interaction networks of pro-teins globally integrated and scored Nucleic Acids Res201139D561ndash8

166 Bader GD Hogue CW An automated method for finding mo-lecular complexes in large protein interaction networksBMC Bioinformatics 200342

167 DallrsquoOlio GM Vahdati AR Bertranpetit J et alVCF2Networks applying Genotype Networks to SingleNucleotide Variants data Bioinformatics 201531438ndash9

168 Ding L Getz G Wheeler DA et al Somatic mutations affectkey pathways in lung adenocarcinoma Nature20084551069ndash75

169 Case M Matheson E Minto L et al Mutation of genes affect-ing the RAS pathway is common in childhood acutelymphoblastic leukemia Cancer Res 2008686803ndash9

170 Thorstensen L Lind GE Lovig T et al Genetic and epigeneticchanges of components affecting the WNT pathway in colo-rectal carcinomas stratified by microsatellite instabilityNeoplasia 2005799ndash108

Interpreting functional effects of coding variants | 21

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

171 Weng L Macciardi F Subramanian A et al SNP-based path-way enrichment analysis for genome-wide associationstudies BMC Bioinformatics 20111299

172 Wang K Li M Bucan M Pathway-based approaches for ana-lysis of genomewide association studies Am J Hum Genet2007811278ndash83

173 OrsquoDushlaine C Kenny E Heron EA et al The SNP ratio testpathway analysis of genome-wide association datasetsBioinformatics 2009252762ndash3

174 Tarca AL Draghici S Khatri P et al A novel signaling path-way impact analysis Bioinformatics 20092575ndash82

175 Ramanan VK Shen L Moore JH et al Pathway analysis ofgenomic data concepts methods and prospects for futuredevelopment Trends Genet 201228323ndash2

176 Li Y Agarwal P Rajagopalan D A global pathway crosstalknetwork Bioinformatics 2008241442ndash7

177 Rivals I Personnaz L Taing L et al Enrichment or depletionof a GO category within a class of genes which testBioinformatics 200723401ndash7

178 Rhee SY Wood V Dolinski K et al Use and misuse of thegene ontology annotations Nat Rev Genet 20089509ndash15

179 Chandonia JM Brenner SE The impact of structural gen-omics expectations and outcomes Science 2006311347ndash51

180 Johnson MS Srinivasan N Sowdhamini R et al Knowledge-based protein modeling Crit Rev Biochem Mol Biol 1994291ndash68

181 Sulkowska JI Morcos F Weigt M et al Genomics-aided struc-ture prediction Proc Natl Acad Sci USA 201210910340ndash5

182 Lee D Redfern O Orengo C Predicting protein function from se-quence and structure Nat Rev Mol Cell Biol 20078995ndash1005

183 Worth CL Gong S Blundell TL Structural and functionalconstraints in the evolution of protein families Nat Rev MolCell Biol 200910709ndash20

184 Karczewski KJ Fernald GH Martin AR et al STORMSeq anopen-source user-friendly pipeline for processing personalgenomics data in the cloud PLoS One 20149e84860

185 Kalari KR Rossell D Necela BM et al Deep sequence ana-lysis of non-small cell lung cancer integrated analysis ofgene expression alternative splicing and single nucleotidevariations in lung adenocarcinomas with and without onco-genic KRAS mutations Front Oncol 2012212

186 Patil S Upadhayay A Arya D et al A high throughput exomesequencing approach to analyse events within a good re-sponder CML patient under imatinib at diagnosis and underremission Blood 20131225161

187 Kullo IJ Shameer K Jouni H et al The ATXN2-SH2B3 locus isassociated with peripheral arterial disease an electronicmedical record-based genome-wide association study FrontGenet 20145166

188 Shameer K Denny JC Ding K et al A genome- and phe-nome-wide association study to identify genetic variantsinfluencing platelet count and volume and their pleiotropiceffects Hum Genet 201413395ndash109

189 Sabarinathan R Wenzel A Novotny P et al Transcriptome-wide analysis of UTRs in non-small cell lung cancer revealscancer-related genes with SNV-induced changes on RNAsecondary structure and miRNA target sites PLoS One20149e82699

22 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

  • bbv084-TF1
Page 20: challenges in proteome-scale prediction, annotation and assessment

82 Fay J Wu C Sequence divergence functional constraintand selection in protein evolution Annu Rev Genomics HumGenet 20034213ndash35

83 Ponting CP Russell RR The natural history of protein do-mains Annu Rev Biophys Biomol Struct 20023145ndash71

84 Ekman D Bjorklund AK Frey-Skott J et al Multi-domain pro-teins in the three kingdoms of life orphan domains andother unassigned regions J Mol Biol 2005348231ndash43

85 Sigrist CJ Cerutti L Hulo N et al PROSITE a documenteddatabase using patterns and profiles as motif descriptorsBrief Bioinform 20023265ndash74

86 Dinkel H Michael S Weatheritt RJ et al ELMndashthe database ofeukaryotic linear motifs Nucleic Acids Res 201240D242ndash51

87 Haberichter SL Budde U Obser T et al The mutation N528Sin the von Willebrand factor (VWF) propeptide causes de-fective multimerization and storage of VWF Blood20101154580ndash7

88 Kutz WE Wang LW Dagoneau N et al Functional analysis ofan ADAMTS10 signal peptide mutation in Weill-Marchesanisyndrome demonstrates a long-range effect on secretion ofthe full-length enzyme Hum Mutat 2008291425ndash34

89 Pugalenthi G Suganthan PN Sowdhamini R et al SMotif aserver for structural motifs in proteins Bioinformatics200723637ndash8

90 Sundaramurthy P Shameer K Sreenivasan R et al HORI aweb server to compute Higher Order Residue Interactions inprotein structures BMC Bioinformatics 201011(Suppl 1)S24

91 Porter CT Bartlett GJ Thornton JM The Catalytic Site Atlas aresource of catalytic sites and residues identified in enzymesusing structural data Nucleic Acids Res 200432D129ndash33

92 Stuart AC Ilyin VA Sali A LigBase a database of families ofaligned ligand binding sites in known protein sequencesand structures Bioinformatics 200218200ndash01

93 Swapna LS Mahajan S de Brevern A et al Comparison oftertiary structures of proteins in protein-protein complexeswith unbound forms suggests prevalence of allostery in sig-nalling proteins BMC Struct Biol 2012126

94 Mathivanan S Ahmed M Ahn NG et al Human Proteinpediaenables sharing of human protein data Nat Biotechnol200826164ndash7

95 Bhaskara RM Srinivasan N Stability of domain structures inmulti-domain proteins Sci Rep 2011140

96 Furnham N de Beer TA Thornton JM Current challenges ingenome annotation through structural biology and bioinfor-matics Curr Opin Struct Biol 201222594ndash601

97 Wang Y Cheng H Pan Z et al Reconfiguring phosphoryl-ation signaling by genetic polymorphisms affects cancersusceptibility J Mol Cell Biol 20157187ndash202

98 Gnad F Gunawardena J Mann M PHOSIDA 2011 the post-translational modification database Nucleic Acids Res201139D253ndash60

99 Nicolaou N Margadant C Kevelam SH et al Gain of glycosy-lation in integrin alpha3 causes lung disease and nephroticsyndrome J Clin Invest 20121224375ndash87

100 Radivojac P Baenziger PH Kann MG et al Gain and loss ofphosphorylation sites in human cancer Bioinformatics200824i241ndash7

101 Pan Y Karagiannis K Zhang H et al Human germline andpan-cancer variomes and their distinct functional profilesNucleic Acids Res 20144211570ndash88

102 Mazumder R Morampudi KS Motwani M et al Proteome-wide analysis of single-nucleotide variations in the N-glyco-sylation sequon of human genes PLoS One 20127e36212

103 Reddy CC Shameer K Offmann BO et al PURE a webserverfor the prediction of domains in unassigned regions in pro-teins BMC Bioinformatics 20089281

104 Coletta A Pinney JW Solis DY et al Low-complexity regionswithin protein sequences have position-dependent rolesBMC Syst Biol 2010443

105 Jones DT Swindells MB Getting the most from PSI-BLASTTrends Biochem Sci 200227161ndash4

106 Chograni M Rejeb I Jemaa LB et al The first missense muta-tion of NHS gene in a Tunisian family with clinical featuresof NHS syndrome including cardiac anomaly Eur J HumGenet 201119851ndash6

107 Jones S Thornton JM Principles of protein-protein inter-actions Proc Natl Acad Sci USA 19969313ndash20

108 Babu MM van der Lee R de Groot NS et al Intrinsically dis-ordered proteins regulation and disease Curr Opin Struct Biol201121432ndash40

109 Dunker AK Silman I Uversky VN et al Function and struc-ture of inherently disordered proteins Curr Opin Struct Biol200818756ndash64

110 Wright PE Dyson HJ Intrinsically unstructured proteins re-assessing the protein structure-function paradigm J Mol Biol1999293321ndash31

111 Dyson HJ Wright PE Coupling of folding and binding for un-structured proteins Curr Opin Struct Biol 20021254ndash60

112 Nguyen Ba AN Yeh BJ van Dyk D et al Proteome-wide dis-covery of evolutionary conserved sequences in disorderedregions Sci Signal 20125rs1

113 Das RK Mao AH Pappu RV Unmasking functionalmotifs within disordered regions of proteins Sci Signal20125pe17

114 Hu Y Liu Y Jung J et al Changes in predicted protein dis-order tendency may contribute to disease risk BMCGenomics 201112(Suppl 5)S2

115 Vacic V Iakoucheva LM Disease mutations in disordered re-gionsndashexception to the rule Mol BioSyst 2012827ndash32

116 Srivastava SK Gayathri S Manjasetty BA et al Analysis ofconformational variation in macromolecular structuralmodels PLoS One 20127e39993

117 Ward JJ Sodhi JS McGuffin LJ et al Prediction and functionalanalysis of native disorder in proteins from the three king-doms of life J Mol Biol 2004337635ndash45

118 Thomas PJ Qu BH Pedersen PL Defective protein folding asa basis of human disease Trends Biochem Sci 199520456ndash9

119 Welch W Role of quality control pathways in human dis-eases involving protein misfolding Semin Cell Dev Biol20041531ndash8

120 Fink AL Protein aggregation folding aggregates inclusionbodies and amyloid Fold Des 19983R9ndash23

121 Thirumalai D Klimov DK Dima RI Emerging ideas on themolecular basis of protein and peptide aggregation CurrOpin Struct Biol 200313146ndash59

122 Khare SD Dokholyan NV Molecular mechanisms of poly-peptide aggregation in human diseases Curr Protein Pept Sci20078573ndash9

123 Soto C Estrada L Castilla J Amyloids prions and the inher-ent infectious nature of misfolded protein aggregatesTrends Biochem Sci 200631150ndash5

124 Prusiner SB Molecular biology and pathogenesis of priondiseases Trends Biochem Sci 199621482ndash7

125 Shameer K Shingate PN Manjunath SC et al 3DSwap cura-ted knowledgebase of proteins involved in 3D domain swap-ping Database (Oxford) 20112011bar042

20 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

126 Bennett MJ Choe S Eisenberg D Domain swapping entan-gling alliances between proteins Proc Natl Acad Sci USA1994913127ndash31

127 Bennett MJ Sawaya MR Eisenberg D Deposition diseasesand 3D domain swapping Structure 200614811ndash24

128 Shameer K Sowdhamini R Functional repertoire molecularpathways and diseases associated with 3D domain swap-ping in the human proteome J Clin Bioinform 201228

129 Waters PJ Parniak MA Hewson AS et al Alterations in pro-tein aggregation and degradation due to mild and severemissense mutations (A104D R157N) in the human phenyl-alanine hydroxylase gene (PAH) Hum Mutat 199812344ndash54

130 OrsquoNeill JW Kim DE Johnsen K et al Single-site mutations in-duce 3D domain swapping in the B1 domain of protein Lfrom Peptostreptococcus magnus Structure 200191017ndash27

131 Seeliger MA Spichty M Kelly SE et al Role of conformationalheterogeneity in domain swapping and adapter function ofthe Cks proteins J Biol Chem 200528030448ndash59

132 Kirsten Frank M Dyda F Dobrodumov A et al Core muta-tions switch monomeric protein GB1 into an intertwinedtetramer Nat Struct Biol 20029877ndash85

133 Huang Y Cao H Liu Z Three-dimensional domain swappingin the protein structure space Proteins 2012801610ndash19

134 Gidalevitz T Wang N Deravaj T et al Natural genetic vari-ation determines susceptibility to aggregation or toxicity ina C elegans model for polyglutamine disease BMC Biol201311100

135 Ross CA Poirier MA Protein aggregation and neurodegener-ative disease Nat Med 200410(Suppl)S10ndash17

136 Murzin A Metamorphic proteins Science 20083201725ndash6137 Shortle D One sequence plus one mutation equals two

folds Proc Natl Acad Sci USA 200910621011ndash12138 Yadid I Kirshenbaum N Sharon M et al Metamorphic pro-

teins mediate evolutionary transitions of structure Proc NatlAcad Sci USA 20101077287ndash92

139 Yadid I Kirshenbaum N Sharon M et al Metamorphic pro-teins mediate evolutionary transitions of structure Proc NatlAcad Sci USA 20101077287ndash92

140 Pickrell J Marioni J Pai A et al Understanding mechanismsunderlying human gene expression variation with RNAsequencing Nature 2010464768ndash72

141 Matlin AJ Clark F Smith CW Understanding alternative splic-ing towards a cellular code Nat Rev Mol Cell Biol 20056386ndash98

142 Rickman DS Pflueger D Moss B et al SLC45A3-ELK4 is anovel and frequent erythroblast transformation-specific fu-sion transcript in prostate cancer Cancer Res 2009692734ndash8

143 Flomen R Makoff A Increased RNA editing in EAAT2 pre-mRNA from amyotrophic lateral sclerosis patients involve-ment of a cryptic polyadenylation site Neurosci Lett2011497139ndash43

144 Asmann YW Hossain A Necela BM et al A novel bioinfor-matics pipeline for identification and characterization of fu-sion transcripts in breast cancer and normal cell linesNucleic Acids Res 201139e100

145 Bundschuh R Computational prediction of RNA editingsites Bioinformatics 2004203214ndash20

146 Eisenberg E Li JB Levanon EY Sequence based identificationof RNA editing sites RNA Biol 20107248ndash52

147 Levanon EY Eisenberg E Algorithmic approaches for identi-fication of RNA editing sites Brief Funct Genomics Proteomic2006543ndash5

148 McCarthy DJ Humburg P Kanapin A et al Choice of tran-scripts and software has a large effect on variant annota-tion Genome Med 2014626

149 Tang X Baheti S Shameer K et al The eSNV-detect a com-putational system to identify expressed single nucleotidevariants from transcriptome sequencing data Nucleic AcidsRes 201442e172

150 Chorev M Carmel L The function of introns Front Genet2012355

151 Chen R Davydov EV Sirota M et al Non-synonymous andsynonymous coding SNPs show similar likelihood and effectsize of human disease association PLoS One 20105e13574

152 Rose AB Intron-mediated regulation of gene expressionCurr Top Microbiol Immunol 2008326277ndash90

153 Birnbaum RY Clowney EJ Agamy O et al Coding exonsfunction as tissue-specific enhancers of nearby genesGenome Res 2012221059ndash68

154 Nakaya HI Amaral PP Louro R et al Genome mapping andexpression analyses of human intronic noncoding RNAsreveal tissue-specific patterns and enrichment in genesrelated to regulation of transcription Genome Biol 20078R43

155 Rohs R Jin X West SM et al Origins of specificity in protein-DNA recognition Ann Rev Biochem 201079233ndash69

156 Karczewski KJ Dudley JT Kukurba KR et al Systematic func-tional regulatory assessment of disease-associated variantsProc Natl Acad Sci USA 20131109607ndash12

157 Khalil AM Rinn JL RNA-protein interactions in humanhealth and disease Semin Cell Dev Biol 201122359ndash65

158 Jones S van Heyningen P Berman HM et al Protein-DNAinteractions a structural analysis J Mol Biol 1999287877ndash96

159 Jones S Daley DT Luscombe NM et al Protein-RNAinteractions a structural analysis Nucleic Acids Res200129943ndash54

160 Barabasi AL Oltvai ZN Network biology understanding thecellrsquos functional organization Nat Rev Genet 20045101ndash13

161 Nooren IM Thornton JM Diversity of protein-protein inter-actions EMBO J 2003223486ndash92

162 Caspi R Altman T Dreher K et al The MetaCyc database ofmetabolic pathways and enzymes and the BioCyc collectionof pathwaygenome databases Nucleic Acids Res201240D742ndash53

163 Yamada T Bork P Evolution of biomolecular networks les-sons from metabolic and protein interactions Nat Rev MolCell Biol 200910791ndash803

164 Stark C Breitkreutz BJ Chatr-Aryamontri A et al TheBioGRID interaction database 2011 update Nucleic Acids Res201139D698ndash704

165 Szklarczyk D Franceschini A Kuhn M et al The STRINGdatabase in 2011 functional interaction networks of pro-teins globally integrated and scored Nucleic Acids Res201139D561ndash8

166 Bader GD Hogue CW An automated method for finding mo-lecular complexes in large protein interaction networksBMC Bioinformatics 200342

167 DallrsquoOlio GM Vahdati AR Bertranpetit J et alVCF2Networks applying Genotype Networks to SingleNucleotide Variants data Bioinformatics 201531438ndash9

168 Ding L Getz G Wheeler DA et al Somatic mutations affectkey pathways in lung adenocarcinoma Nature20084551069ndash75

169 Case M Matheson E Minto L et al Mutation of genes affect-ing the RAS pathway is common in childhood acutelymphoblastic leukemia Cancer Res 2008686803ndash9

170 Thorstensen L Lind GE Lovig T et al Genetic and epigeneticchanges of components affecting the WNT pathway in colo-rectal carcinomas stratified by microsatellite instabilityNeoplasia 2005799ndash108

Interpreting functional effects of coding variants | 21

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

171 Weng L Macciardi F Subramanian A et al SNP-based path-way enrichment analysis for genome-wide associationstudies BMC Bioinformatics 20111299

172 Wang K Li M Bucan M Pathway-based approaches for ana-lysis of genomewide association studies Am J Hum Genet2007811278ndash83

173 OrsquoDushlaine C Kenny E Heron EA et al The SNP ratio testpathway analysis of genome-wide association datasetsBioinformatics 2009252762ndash3

174 Tarca AL Draghici S Khatri P et al A novel signaling path-way impact analysis Bioinformatics 20092575ndash82

175 Ramanan VK Shen L Moore JH et al Pathway analysis ofgenomic data concepts methods and prospects for futuredevelopment Trends Genet 201228323ndash2

176 Li Y Agarwal P Rajagopalan D A global pathway crosstalknetwork Bioinformatics 2008241442ndash7

177 Rivals I Personnaz L Taing L et al Enrichment or depletionof a GO category within a class of genes which testBioinformatics 200723401ndash7

178 Rhee SY Wood V Dolinski K et al Use and misuse of thegene ontology annotations Nat Rev Genet 20089509ndash15

179 Chandonia JM Brenner SE The impact of structural gen-omics expectations and outcomes Science 2006311347ndash51

180 Johnson MS Srinivasan N Sowdhamini R et al Knowledge-based protein modeling Crit Rev Biochem Mol Biol 1994291ndash68

181 Sulkowska JI Morcos F Weigt M et al Genomics-aided struc-ture prediction Proc Natl Acad Sci USA 201210910340ndash5

182 Lee D Redfern O Orengo C Predicting protein function from se-quence and structure Nat Rev Mol Cell Biol 20078995ndash1005

183 Worth CL Gong S Blundell TL Structural and functionalconstraints in the evolution of protein families Nat Rev MolCell Biol 200910709ndash20

184 Karczewski KJ Fernald GH Martin AR et al STORMSeq anopen-source user-friendly pipeline for processing personalgenomics data in the cloud PLoS One 20149e84860

185 Kalari KR Rossell D Necela BM et al Deep sequence ana-lysis of non-small cell lung cancer integrated analysis ofgene expression alternative splicing and single nucleotidevariations in lung adenocarcinomas with and without onco-genic KRAS mutations Front Oncol 2012212

186 Patil S Upadhayay A Arya D et al A high throughput exomesequencing approach to analyse events within a good re-sponder CML patient under imatinib at diagnosis and underremission Blood 20131225161

187 Kullo IJ Shameer K Jouni H et al The ATXN2-SH2B3 locus isassociated with peripheral arterial disease an electronicmedical record-based genome-wide association study FrontGenet 20145166

188 Shameer K Denny JC Ding K et al A genome- and phe-nome-wide association study to identify genetic variantsinfluencing platelet count and volume and their pleiotropiceffects Hum Genet 201413395ndash109

189 Sabarinathan R Wenzel A Novotny P et al Transcriptome-wide analysis of UTRs in non-small cell lung cancer revealscancer-related genes with SNV-induced changes on RNAsecondary structure and miRNA target sites PLoS One20149e82699

22 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

  • bbv084-TF1
Page 21: challenges in proteome-scale prediction, annotation and assessment

126 Bennett MJ Choe S Eisenberg D Domain swapping entan-gling alliances between proteins Proc Natl Acad Sci USA1994913127ndash31

127 Bennett MJ Sawaya MR Eisenberg D Deposition diseasesand 3D domain swapping Structure 200614811ndash24

128 Shameer K Sowdhamini R Functional repertoire molecularpathways and diseases associated with 3D domain swap-ping in the human proteome J Clin Bioinform 201228

129 Waters PJ Parniak MA Hewson AS et al Alterations in pro-tein aggregation and degradation due to mild and severemissense mutations (A104D R157N) in the human phenyl-alanine hydroxylase gene (PAH) Hum Mutat 199812344ndash54

130 OrsquoNeill JW Kim DE Johnsen K et al Single-site mutations in-duce 3D domain swapping in the B1 domain of protein Lfrom Peptostreptococcus magnus Structure 200191017ndash27

131 Seeliger MA Spichty M Kelly SE et al Role of conformationalheterogeneity in domain swapping and adapter function ofthe Cks proteins J Biol Chem 200528030448ndash59

132 Kirsten Frank M Dyda F Dobrodumov A et al Core muta-tions switch monomeric protein GB1 into an intertwinedtetramer Nat Struct Biol 20029877ndash85

133 Huang Y Cao H Liu Z Three-dimensional domain swappingin the protein structure space Proteins 2012801610ndash19

134 Gidalevitz T Wang N Deravaj T et al Natural genetic vari-ation determines susceptibility to aggregation or toxicity ina C elegans model for polyglutamine disease BMC Biol201311100

135 Ross CA Poirier MA Protein aggregation and neurodegener-ative disease Nat Med 200410(Suppl)S10ndash17

136 Murzin A Metamorphic proteins Science 20083201725ndash6137 Shortle D One sequence plus one mutation equals two

folds Proc Natl Acad Sci USA 200910621011ndash12138 Yadid I Kirshenbaum N Sharon M et al Metamorphic pro-

teins mediate evolutionary transitions of structure Proc NatlAcad Sci USA 20101077287ndash92

139 Yadid I Kirshenbaum N Sharon M et al Metamorphic pro-teins mediate evolutionary transitions of structure Proc NatlAcad Sci USA 20101077287ndash92

140 Pickrell J Marioni J Pai A et al Understanding mechanismsunderlying human gene expression variation with RNAsequencing Nature 2010464768ndash72

141 Matlin AJ Clark F Smith CW Understanding alternative splic-ing towards a cellular code Nat Rev Mol Cell Biol 20056386ndash98

142 Rickman DS Pflueger D Moss B et al SLC45A3-ELK4 is anovel and frequent erythroblast transformation-specific fu-sion transcript in prostate cancer Cancer Res 2009692734ndash8

143 Flomen R Makoff A Increased RNA editing in EAAT2 pre-mRNA from amyotrophic lateral sclerosis patients involve-ment of a cryptic polyadenylation site Neurosci Lett2011497139ndash43

144 Asmann YW Hossain A Necela BM et al A novel bioinfor-matics pipeline for identification and characterization of fu-sion transcripts in breast cancer and normal cell linesNucleic Acids Res 201139e100

145 Bundschuh R Computational prediction of RNA editingsites Bioinformatics 2004203214ndash20

146 Eisenberg E Li JB Levanon EY Sequence based identificationof RNA editing sites RNA Biol 20107248ndash52

147 Levanon EY Eisenberg E Algorithmic approaches for identi-fication of RNA editing sites Brief Funct Genomics Proteomic2006543ndash5

148 McCarthy DJ Humburg P Kanapin A et al Choice of tran-scripts and software has a large effect on variant annota-tion Genome Med 2014626

149 Tang X Baheti S Shameer K et al The eSNV-detect a com-putational system to identify expressed single nucleotidevariants from transcriptome sequencing data Nucleic AcidsRes 201442e172

150 Chorev M Carmel L The function of introns Front Genet2012355

151 Chen R Davydov EV Sirota M et al Non-synonymous andsynonymous coding SNPs show similar likelihood and effectsize of human disease association PLoS One 20105e13574

152 Rose AB Intron-mediated regulation of gene expressionCurr Top Microbiol Immunol 2008326277ndash90

153 Birnbaum RY Clowney EJ Agamy O et al Coding exonsfunction as tissue-specific enhancers of nearby genesGenome Res 2012221059ndash68

154 Nakaya HI Amaral PP Louro R et al Genome mapping andexpression analyses of human intronic noncoding RNAsreveal tissue-specific patterns and enrichment in genesrelated to regulation of transcription Genome Biol 20078R43

155 Rohs R Jin X West SM et al Origins of specificity in protein-DNA recognition Ann Rev Biochem 201079233ndash69

156 Karczewski KJ Dudley JT Kukurba KR et al Systematic func-tional regulatory assessment of disease-associated variantsProc Natl Acad Sci USA 20131109607ndash12

157 Khalil AM Rinn JL RNA-protein interactions in humanhealth and disease Semin Cell Dev Biol 201122359ndash65

158 Jones S van Heyningen P Berman HM et al Protein-DNAinteractions a structural analysis J Mol Biol 1999287877ndash96

159 Jones S Daley DT Luscombe NM et al Protein-RNAinteractions a structural analysis Nucleic Acids Res200129943ndash54

160 Barabasi AL Oltvai ZN Network biology understanding thecellrsquos functional organization Nat Rev Genet 20045101ndash13

161 Nooren IM Thornton JM Diversity of protein-protein inter-actions EMBO J 2003223486ndash92

162 Caspi R Altman T Dreher K et al The MetaCyc database ofmetabolic pathways and enzymes and the BioCyc collectionof pathwaygenome databases Nucleic Acids Res201240D742ndash53

163 Yamada T Bork P Evolution of biomolecular networks les-sons from metabolic and protein interactions Nat Rev MolCell Biol 200910791ndash803

164 Stark C Breitkreutz BJ Chatr-Aryamontri A et al TheBioGRID interaction database 2011 update Nucleic Acids Res201139D698ndash704

165 Szklarczyk D Franceschini A Kuhn M et al The STRINGdatabase in 2011 functional interaction networks of pro-teins globally integrated and scored Nucleic Acids Res201139D561ndash8

166 Bader GD Hogue CW An automated method for finding mo-lecular complexes in large protein interaction networksBMC Bioinformatics 200342

167 DallrsquoOlio GM Vahdati AR Bertranpetit J et alVCF2Networks applying Genotype Networks to SingleNucleotide Variants data Bioinformatics 201531438ndash9

168 Ding L Getz G Wheeler DA et al Somatic mutations affectkey pathways in lung adenocarcinoma Nature20084551069ndash75

169 Case M Matheson E Minto L et al Mutation of genes affect-ing the RAS pathway is common in childhood acutelymphoblastic leukemia Cancer Res 2008686803ndash9

170 Thorstensen L Lind GE Lovig T et al Genetic and epigeneticchanges of components affecting the WNT pathway in colo-rectal carcinomas stratified by microsatellite instabilityNeoplasia 2005799ndash108

Interpreting functional effects of coding variants | 21

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

171 Weng L Macciardi F Subramanian A et al SNP-based path-way enrichment analysis for genome-wide associationstudies BMC Bioinformatics 20111299

172 Wang K Li M Bucan M Pathway-based approaches for ana-lysis of genomewide association studies Am J Hum Genet2007811278ndash83

173 OrsquoDushlaine C Kenny E Heron EA et al The SNP ratio testpathway analysis of genome-wide association datasetsBioinformatics 2009252762ndash3

174 Tarca AL Draghici S Khatri P et al A novel signaling path-way impact analysis Bioinformatics 20092575ndash82

175 Ramanan VK Shen L Moore JH et al Pathway analysis ofgenomic data concepts methods and prospects for futuredevelopment Trends Genet 201228323ndash2

176 Li Y Agarwal P Rajagopalan D A global pathway crosstalknetwork Bioinformatics 2008241442ndash7

177 Rivals I Personnaz L Taing L et al Enrichment or depletionof a GO category within a class of genes which testBioinformatics 200723401ndash7

178 Rhee SY Wood V Dolinski K et al Use and misuse of thegene ontology annotations Nat Rev Genet 20089509ndash15

179 Chandonia JM Brenner SE The impact of structural gen-omics expectations and outcomes Science 2006311347ndash51

180 Johnson MS Srinivasan N Sowdhamini R et al Knowledge-based protein modeling Crit Rev Biochem Mol Biol 1994291ndash68

181 Sulkowska JI Morcos F Weigt M et al Genomics-aided struc-ture prediction Proc Natl Acad Sci USA 201210910340ndash5

182 Lee D Redfern O Orengo C Predicting protein function from se-quence and structure Nat Rev Mol Cell Biol 20078995ndash1005

183 Worth CL Gong S Blundell TL Structural and functionalconstraints in the evolution of protein families Nat Rev MolCell Biol 200910709ndash20

184 Karczewski KJ Fernald GH Martin AR et al STORMSeq anopen-source user-friendly pipeline for processing personalgenomics data in the cloud PLoS One 20149e84860

185 Kalari KR Rossell D Necela BM et al Deep sequence ana-lysis of non-small cell lung cancer integrated analysis ofgene expression alternative splicing and single nucleotidevariations in lung adenocarcinomas with and without onco-genic KRAS mutations Front Oncol 2012212

186 Patil S Upadhayay A Arya D et al A high throughput exomesequencing approach to analyse events within a good re-sponder CML patient under imatinib at diagnosis and underremission Blood 20131225161

187 Kullo IJ Shameer K Jouni H et al The ATXN2-SH2B3 locus isassociated with peripheral arterial disease an electronicmedical record-based genome-wide association study FrontGenet 20145166

188 Shameer K Denny JC Ding K et al A genome- and phe-nome-wide association study to identify genetic variantsinfluencing platelet count and volume and their pleiotropiceffects Hum Genet 201413395ndash109

189 Sabarinathan R Wenzel A Novotny P et al Transcriptome-wide analysis of UTRs in non-small cell lung cancer revealscancer-related genes with SNV-induced changes on RNAsecondary structure and miRNA target sites PLoS One20149e82699

22 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

  • bbv084-TF1
Page 22: challenges in proteome-scale prediction, annotation and assessment

171 Weng L Macciardi F Subramanian A et al SNP-based path-way enrichment analysis for genome-wide associationstudies BMC Bioinformatics 20111299

172 Wang K Li M Bucan M Pathway-based approaches for ana-lysis of genomewide association studies Am J Hum Genet2007811278ndash83

173 OrsquoDushlaine C Kenny E Heron EA et al The SNP ratio testpathway analysis of genome-wide association datasetsBioinformatics 2009252762ndash3

174 Tarca AL Draghici S Khatri P et al A novel signaling path-way impact analysis Bioinformatics 20092575ndash82

175 Ramanan VK Shen L Moore JH et al Pathway analysis ofgenomic data concepts methods and prospects for futuredevelopment Trends Genet 201228323ndash2

176 Li Y Agarwal P Rajagopalan D A global pathway crosstalknetwork Bioinformatics 2008241442ndash7

177 Rivals I Personnaz L Taing L et al Enrichment or depletionof a GO category within a class of genes which testBioinformatics 200723401ndash7

178 Rhee SY Wood V Dolinski K et al Use and misuse of thegene ontology annotations Nat Rev Genet 20089509ndash15

179 Chandonia JM Brenner SE The impact of structural gen-omics expectations and outcomes Science 2006311347ndash51

180 Johnson MS Srinivasan N Sowdhamini R et al Knowledge-based protein modeling Crit Rev Biochem Mol Biol 1994291ndash68

181 Sulkowska JI Morcos F Weigt M et al Genomics-aided struc-ture prediction Proc Natl Acad Sci USA 201210910340ndash5

182 Lee D Redfern O Orengo C Predicting protein function from se-quence and structure Nat Rev Mol Cell Biol 20078995ndash1005

183 Worth CL Gong S Blundell TL Structural and functionalconstraints in the evolution of protein families Nat Rev MolCell Biol 200910709ndash20

184 Karczewski KJ Fernald GH Martin AR et al STORMSeq anopen-source user-friendly pipeline for processing personalgenomics data in the cloud PLoS One 20149e84860

185 Kalari KR Rossell D Necela BM et al Deep sequence ana-lysis of non-small cell lung cancer integrated analysis ofgene expression alternative splicing and single nucleotidevariations in lung adenocarcinomas with and without onco-genic KRAS mutations Front Oncol 2012212

186 Patil S Upadhayay A Arya D et al A high throughput exomesequencing approach to analyse events within a good re-sponder CML patient under imatinib at diagnosis and underremission Blood 20131225161

187 Kullo IJ Shameer K Jouni H et al The ATXN2-SH2B3 locus isassociated with peripheral arterial disease an electronicmedical record-based genome-wide association study FrontGenet 20145166

188 Shameer K Denny JC Ding K et al A genome- and phe-nome-wide association study to identify genetic variantsinfluencing platelet count and volume and their pleiotropiceffects Hum Genet 201413395ndash109

189 Sabarinathan R Wenzel A Novotny P et al Transcriptome-wide analysis of UTRs in non-small cell lung cancer revealscancer-related genes with SNV-induced changes on RNAsecondary structure and miRNA target sites PLoS One20149e82699

22 | Shameer et al

at New

York U

niversity on February 16 2016httpbiboxfordjournalsorg

Dow

nloaded from

  • bbv084-TF1