from genomes to post-processing of bayesian...

From genomes to post-processing of Bayesianinference of phylogeny

RAJA HASHIM ALI

Doctoral ThesisStockholm, Sweden 2016

TRITA-CSC-A-2016:01ISSN-1653-5723ISRN-KTH/CSC/A-16/01-SEISBN: 978-91-7595-849-1

KTHSE-100 44 Stockholm

SWEDEN

Akademisk avhandling som med tillstånd av Kungl Tekniska högskolan framläggestill offentlig granskning för avläggande av Akademisk avhandling 25 February 2016i Fire Scilifelab.

© Raja Hashim Ali, January 2016

Tryck: Universitetsservice US AB

iii

“Do not let your difficulties fill you with anxiety, after all it is only in thedarkest nights that stars shine more brightly.”

— Hazrat Ali Ibn Abu-Talib A.S. (7th century A.D.)

— Dr. Muhammad Iqbal in Bal-e-Jibril (Gabriel’s Wing) 1935

Beyond the stars, other galaxies also exist.As of now, other tests of love also exist.You are an eagle, flight is your vocation:Further skies stretching out before you also exist.

To my parents

iv

Abstract

Life is extremely complex and amazingly diverse; it has taken billions of yearsof evolution to attain the level of complexity we observe in nature now andranges from single-celled prokaryotes to multi-cellular human beings. Withavailability of molecular sequence data, algorithms inferring homology andgene families have emerged and similarity in gene content between two geneshas been the major signal utilized for homology inference. Recently therehas been a significant rise in number of species with fully sequenced genome,which provides an opportunity to investigate and infer homologs with greateraccuracy and in a more informed way. Phylogeny analysis explains the rela-tionship between member genes of a gene family in a simple, graphical andplausible way using a tree representation. Bayesian phylogenetic inference isa probabilistic method used to infer gene phylogenies and posteriors of otherevolutionary parameters. Markov chain Monte Carlo (MCMC) algorithm, inparticular using Metropolis-Hastings sampling scheme, is the most commonlyemployed algorithm to determine evolutionary history of genes. There aremany softwares available that process results from each MCMC run, and ex-plore the parameter posterior but there is a need for interactive software thatcan analyse both discrete and real-valued parameters, and which has conver-gence assessment and burnin estimation diagnostics specifically designed forBayesian phylogenetic inference.

In this thesis, a synteny-aware approach for gene homology inference, calledGenFamClust (GFC), is proposed that uses gene content and gene order con-servation to infer homology. The feature which distinguishes GFC from ear-lier homology inference methods is that local synteny has been combined withgene similarity to infer homologs, without inferring homologous regions. GFCwas validated for accuracy on a simulated dataset. Gene families were com-puted by applying clustering algorithms on homologs inferred from GFC, andcompared for accuracy, dependence and similarity with gene families inferredfrom other popular gene family inference methods on a eukaryotic dataset.Gene families in fungi obtained from GFC were evaluated against pillars fromYeast Gene Order Browser. Genome-wide gene families for some eukaryoticspecies are computed using this approach.

Another topic focused in this thesis is the processing of MCMC traces forBayesian phylogenetics inference. We introduce a new software VMCMCwhich simplifies post-processing of MCMC traces. VMCMC can be usedboth as a GUI-based application and as a convenient command-line tool.VMCMC supports interactive exploration, is suitable for automated pipelinesand can handle both real-valued and discrete parameters observed in a MCMCtrace. We propose and implement joint burnin estimators that are specificallyapplicable to Bayesian phylogenetics inference. These methods have beencompared for similarity with some other popular convergence diagnostics. Weshow that Bayesian phylogenetic inference and VMCMC can be applied toinfer valuable evolutionary information for a biological case – the evolutionaryhistory of FERM domain.

v

Sammanfattning

Livet är extremt komplext och otroligt varierande; det har tagit evolutionenmiljarder år att uppnå den nivå av komplexitet som vi ser i naturen idag ochvarierar från encelliga prokaryoter till flercelliga människor. Med tillgångentill molekylär sekvensdata, har utvecklingen av algoritmer för att bestämmahomologi och genfamiljer gått snabbt och likheten mellan två gener har varitden främsta signalen som använts för att bestämma homologi. Nyligen har detskett en betydande ökning av antalet arter med fullt sekvense genomet, vilketger en möjlighet att undersöka och bestämma homologi med större noggrann-het och på ett mer informerat sätt. Fylogenetisk analys beskriver sambandetmellan gener i en genfamilj på ett enkelt, grafiskt och rimligt sätt med ettträd. Bayesiansk fylogenetisk inferens är en sannolikhetsteoretisk metod somanvänds för att bestämma genfylogenier och posteriorfördelningen för evolu-tionära parametrar genom applicering av metoden Markov Chain Monte Carlo(MCMC), särskilt genom Metropolis-Hastings sampling som är den mest an-vända algoritmen för att bestämma den evolutionära historien för en mängdgener. Det finns många program tillgängliga för att bearbeta resultaten ifrånen MCMC-körning och utforska posteriorfördelningen för parametrarna mendet finns ett behov av en interaktiv programvara som kan analysera både trädoch kontinuerliga parametrar samt erbjuder konvergensbedömning och skatt-ningsdiagnostik för burnin och är särskilt utformad för Bayesiansk inferensav fylogenier.

I denna avhandling introduceras en synteni-medveten ansats för att bestäm-ma gen-homologier, som kallas GenFamClust (GFC). Denna ansats användergeninnehåll och genordning för att bestämma homologi. Det utmärkande förGFC jämfört med tidigare homologi-inferensmetoder är att lokal synteni harkombinerats med genlikhet för att bestämma homologer utan att bestämmahomologa regioner. GFC validerades för noggrannhet på simulerad data. Gen-familjer skattades genom att tillämpa klusteralgoritmer på homologer sombestämts av GFC och jämfördes med avseende på noggrannhet, beroende ochlikhet med genfamiljer som bestämts av andra populära genfamilj slutled-ningsmetoder på data ifrån eukaryoter. Genfamiljer i svampar som bestämtsav GFC jämfördes mot det liknande begreppet “pelare” i Yeast Gene Or-der Browser. Hela genfamiljer för eukaryota arter med fullständigt framtagenarvsmassa beräknas med hjälp av denna metod och visar därmed på viktenav att ta hänsyn till konservering av geneordning i homologi-inferens.

Ett annat ämne som denna avhandling behandlar är bearbetningen av MCMCspår för Bayesiansk inferens av fylogenier. Vi introducerar en ny program-vara VMCMC som förenklar bearbetningen av MCMC-spår. VMCMC kananvändas både som en GUI-baserad applikation och som ett bekvämt kom-mandoradsverktyg. VMCMC stödjer interaktiv utforskning, är lämplig förautomatiserade pipelines och kan hantera både kontinuerliga och diskreta pa-rametrar från ett MCMC-spår. Vi föreslår och implementerar gemensamma

vi

burnin-skattningar som är skräddarsydda för Bayesiansk inferens av fyloge-nier. Dessa metoder har jämförts med andra populära metoder för konver-gensdiagnostik. Vi visar att Bayesiansk inferens av fylogenier och VMCMCkan användas för att upptäcka värdefull evolutionär information i en biologisktillämpning: den evolutionära historien för FERM-domänen.

Contents

Contents vii

Acknowledgements 1

List of Publications 3

1 Introduction 51.1 Thesis overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Genomes: Biology and background to project 72.1 Genes, proteins and genomes . . . . . . . . . . . . . . . . . . . . . . 72.2 Evolution and evolutionary events . . . . . . . . . . . . . . . . . . . 112.3 Biologically interesting proteins – FERM domain containing proteins 13

3 Homology and gene family inference 153.1 Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2 Homology and gene family inference methods . . . . . . . . . . . . . 21

4 Phylogenetic Inference 294.1 Fundamentals of Phylogeny . . . . . . . . . . . . . . . . . . . . . . . 294.2 Computing Phylogenetic Trees – Traditional Approaches . . . . . . . 314.3 Bayesian phylogenetic inference . . . . . . . . . . . . . . . . . . . . . 31

5 Post-processing of traces from MCMC runs 335.1 MCMC Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . 335.2 Post-processing MCMC traces . . . . . . . . . . . . . . . . . . . . . . 365.3 Software packages for post-processing of MCMC traces . . . . . . . . 39

6 Present Investigations 41

7 Discussion & Conclusion 457.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457.2 Future perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

vii

viii CONTENTS

Bibliography 49

Acknowledgements

It is my pleasure to express my gratitude to my advisor, Lars Arvestad, forhis continuous support, motivation, encouragement, and help during my Doctoralstudies and research. Without his help, this thesis would not be possible. With hisguidance, I started to learn how to do research and how to grow academically withcollaborative spirit. With his motivation and encouragement, I started to explorethe strength of innovation. I also appreciate his wife’s support, whom we wouldinvolve freely to read through our manuscripts, to Emma Arvestad for deliciousapple pies and to Alexander Arvestad for his toys and stories! In short, thankyou Arvestad clan for your part in my PhD and in my stay in Sweden.

I would like to thank Ammad Aslam Khan, with whom I have collaboratedon FERM paper among other works. The collaboration experience showed me thejoint strength of theoretical and problem driven research, taught me to believe inmyself and gave me the opportunity to do independent research. I badly miss ourdiscussions on Science and Saturday dinners with work. I want to give my grati-tude to my friends and collaborators, Sayyed Auwn Muhammad, MuhammadMushtaq and Dr. Dara Mohammad. It has been an educational and interest-ing experience to collaborate with you all and to learn from you about your fields.Best of luck to you, Auwn and Mushtaq, for your degree and publications, and toDara for his goals. I am also grateful to all my other collaborators for their help.

I owe my sincere gratitude to Bengt Sennblad and Jens Lagergren, myco-supervisors, who have given me generous help and encouragement. Bengt hasalways been at an arm’s distance (literally!) from my desk and has been extremelyhelpful for all my little personal and professional queries. Thank you for yourtime, help and sincere, useful and important advice on all matters. Jens has beenoverseeing my study and research, and has given invaluable guidance and supportin critical situations. Thank you and wish you all the best.

I was fortunate to have colleagues in SciLifeLab with whom I have spent somememorable time: Auwn Muhammad; Thank you for hosting me in one of mytough times, for constant support in projects and for great motivational discussions.Owais Mahmudi and Ikramullah; I fondly remember the discussions, food andtime spent with both of you. Kristoffer Sahlin and Mattias Franberg – Statis-ticians stuck in the body of computer scientists; your fights over frequentist versusBayesian approaches are legendary! Joel Sjöstrand; Thank you for all the advice

1

2 CONTENTS

and interest in VMCMC. Erik Sjölund; Thank you for discussions on language,regional interests, music and the quality time spent with you. Matthew The; Bestof luck for your PhD and to Roger Federer to win that last elusive slam some day.Hossein Farahani; Wish you all the best in Canada. I should especially mentionmy other comrades with whom I have spent a great time: Annu, Lumi, Victor,Linus, Yrin and Mehmood Khan.

I would like to thank my Pakistani gang (Asrar Mehdi, Faheem Mughal,Qamar Toheed, Adeel Yasin, Nabeel Shehzad, Iram Bilal, Alamdar Hus-sain, Shahid Hussain, Rashid Mehmood, little Rameen Rashid, Muham-mad Mushtaq, Sharif Hasni, Aamer Riaz, Imran Jamali, Naeem Anwar,little Ibraheem, Irfan Khan and Farasat Zaman) for making these five yearsmemorable in Sweden. Without your company, discussions and your parties withdelicious baryanis, karahis and other desi food, I would have been bored to deathin this grey, cloudy, gloomy weather. Amit Gahoi, I miss your late night discus-sions, mushroom cooking and pav bhaji. Best wishes for your work in Germany!Finally to my room mates Muhammad Irfan and Syed Muhammad Zubair;Thank you for late night discussions, Game of Thrones talks and for wonderfulcompanionship.

I want to thank my family, especially my parents, Zafar Abbas and SariaZafar, for giving me my genome, for nurturing and educating me, for having faithin me, and for their constant care and support throughout my life. My grandfather,Lt. Col. (Retd) Sadaqat Ali, has been a strict mentor to me since I wasyoung and I am extremely grateful to him. My siblings, Usra Hussain and RajaManzar, have been supportive during this time. I would like to give special thanksto my wife, Umme Rabab Syed, who has been my life support for the past 7years and has my gratitude and acknowledgement for contributing her part in whatI am today. Last but not least, I believe I would have finished my degree a longwhile ago, had it not been for the mischief, love, naughtiness and roguery of thethree musketeers, Saifullah Bhatti, Mustafa Ali and Ibraheem Ali. Life wouldhave been much more boring without them!

List of Publications

I: Raja H. Ali, Sayyed A. Muhammad, Mehmood A. Khan, & Lars Arvestad.Quantitative synteny scoring improves homology inference and partitioningof gene families.BMC Bioinformatics 2013; 14(Suppl 15):S12.RHA helped in designing the algorithm, implemented the algorithm, prepared the biological datasets,

performed comparative analysis and drafted the manuscript.

II: Raja H. Ali, Sayyed A. Muhammad, & Lars Arvestad.GenFamClust: An accurate, synteny-aware and reliable homology inferencealgorithm.Manuscript.RHA conceived and designed the study, prepared the biological datasets, performed comparative analysis

with most software and drafted the manuscript.

III: Raja H. Ali, Mikael Bark, Jorge Miró, Sayyed A. Muhammad, Joel Sjös-trand, Syed M. Zubair, Raja M. Abbas, & Lars Arvestad.VMCMC: a graphical and statistical analysis tool for Markov chain MonteCarlo traces.Manuscript.RHA lead and coordinated the implementation, implemented most of the functionality and drafted the

manuscript.

IV: Raja H. Ali, & Lars Arvestad.Burnin estimation and convergence assessment in Bayesian phylogenetic in-ference.Manuscript.RHA prepared the datasets, performed experiments and drafted the manuscript.

V: Raja H. Ali∗, & Ammad A. Khan∗.Tracing the evolution of FERM domain of Kindlins.Molecular Phylogenetics and Evolution 2014; 80:193-204.RHA shared equal responsibility in data extraction, analysis of results and drafting of the manuscript.

∗ Contributed equally to this manuscript.

3

Chapter 1

Introduction

Officially, this thesis is presented in School of Computer Science (CSC) but thereader will note that content is a fusion of biology and computer science, in shortcomputational biology. The work has been carried out at SciLifeLab, Solna campusand at the Department of Computational Biology in CSC. One can view this work asa pipeline, where the input is a set of annotated genomes from different species andis followed by identifying homologous gene pairs and inferring gene families usingclustering algorithms. Then these gene families are analysed using probabilisticBayesian inference of phylogeny which is known to be more accurate and informativethan traditional methods of phylogeny inference. Finally, Markov chain MonteCarlo runs obtained from Bayesian phylogenetic inference are post-processed andanalysed to extract interesting results about evolutionary parameters underlying thegene data. The focus and emphasis of research in this thesis is on two subtopics;Homology inference and post-processing of Bayesian inference of phylogeny.

1.1 Thesis overview

The thesis has been organized according to topics in the project pipeline. Chapter2 deals with the input of the pipeline and explains the origin and definition ofdifferent biological concepts and terminologies utilized in later chapters. In Chapter3, homology inference and gene family inference have been discussed in detail andexisting gene family inference algorithms have been briefly presented. In Chapter4, a brief overview of advances in phylogenetic inference is given with focus onmethods that employ Bayesian phylogenetics inference. Chapter 5 discusses somecharacteristics of MCMC runs resulting from Bayesian phylogenetics analysis, shedslight on existing softwares that explore these characteristics and presents a casestudy, where Bayesian phylogenetic analysis and post-processing have been appliedon inferring the phylogenetic tree for FERM domain. Chapter 6 provides a briefsummary of papers presented in this thesis and Chapter 7 provides conclusions,pros and cons and future outlook of this work.

5

Chapter 2

Genomes: Biology and backgroundto project

There are countless fine introductions to molecular biology and genomics (see, e.g.,“Learn.Genetics” from The University of Utah [148], “DNA from the Beginning”from Cold Spring Harbor Laboratory [109] and standard reference textbooks formolecular biology like [2] and [119]). However, this section touches only thoseconcepts in molecular biology that are essential minimum knowledge for the nextchapters.

This chapter starts with a primer on Genomics and continues with discussionof important components of genomes. Further, composition and appearance ofgenomes and of genes, the building blocks of the genome, are discussed. Becausethe focus of this work is mainly on algorithms that infer evolution, knowledge ofdifferent biological concepts related to gene and protein evolution is essential forunderstanding the contents of this thesis. We conclude this chapter by a discussionabout effects of these evolutionary events on gene order and gene content similarity.

2.1 Genes, proteins and genomes

Genomes and their composition

A cell is the smallest structural, functional and biological unit in an organism andeach cell has a genome, a collection of chemical molecules that contain the completeset of instructions for all cellular activities. Thus, genomes are essential componentsof life and the primary genetic material in living beings. Each species has a char-acteristic genome, different from the characteristic genome of other species, whichexplains differences in morphology, behaviour and other characteristics of differentspecies. Genomic traits (see, e.g., chromosome number, and genome size, etc.) areusually unique to a species [173, 16]. Additionally, the genome as a whole of aparticular individual is also unique; genomes of other organisms of the same species

7

8 CHAPTER 2. GENOMES: BIOLOGY AND BACKGROUND TO PROJECT

are not exactly the same as this genome. There are small yet subtle differencesbetween these genomes that confer individual characteristics to this individual,e.g., eye color, skin color, height, obesity and other variable morphological featureswithin species [10, 71].

Figure 2.1: The structure of DNA double he-lix [206].

A typical eukaryotic genome can beseen as a group of twisted thread-likestructures called chromosomes, whichare strands of chemical molecules callednucleotide bases and are composed ofDeoxyribonucleic acid (DNA). DNAconsists of a sequence of four differentchemical nucleotide bases (adenine (A),thymine (T), cytosine (C) and guanine(G)). Friedrich Miescher in 1869 iso-lated and identified DNA as a majorchemical in the nucleus [34]. Subse-quent experiments performed by Os-wald Avery [122] and later replicatedby Hershey et al. [90] revealed thatgenomes are composed of DNA andthat DNA is the chemical responsi-ble for genetic inheritance. Later,James Watson and Francis Crick dis-covered the three-dimensional struc-ture of DNA (shown in Figure 2.1),which we now know is a twisted-ladder,double stranded helix [206]. Theyshowed that the chromosome is chem-ically composed of two long chains orstrands of nucleotide bases (now known as Watson strand and Crick strand) suchthat for each base on one strand, the opposite strand contains a correspondingpaired base, where each species, generally, has the same karyotype and all geneticmaterial is stored in these strands of DNA. Most high-level eukaryotes have a mul-tiploid genome, i.e., two or more copies of each chromosome can be found in thegenome, which plays an essential role in inheriting genomic content during cell di-vision. Other eukaryotes and prokaryotes have haploid genomes, i.e., they containone copy of each chromosome. For eukaryotes, all chromosomes are located in thecell nucleus but in prokaryotes, chromosomes are found in the cell cytoplasm.

Genes – the building blocks of life

While the phenomenon of inheritance from parents to offspring was long known,Gregor Mendel, in mid 19th century, was one of the first scientists to streamline this

2.1. GENES, PROTEINS AND GENOMES 9

Figure 2.2: Representation of chromosome asgenes and intergenic regions [208].

concept and to introduce the principlesof heredity [129, 130]. He introducedfactors (known to us now as genes) forhereditary unit and also gave a law ofindependent assortment, which statesthat the gene for each trait is inheritedby the offspring irrespective of genes forother trait(s). These genes are respon-sible for various cell functions and arethe building block of life. Thomas Mor-gan won the Noble prize in 1933 fordiscovering that genes are physically lo-cated on chromosomes [140].

In retrospect, a chromosome (shownin Figure 2.2) is an ordered collection ofgenes separated from each other by intergenic regions, a subset of noncoding DNA.

From gene to protein – a central dogma of molecular biologyGenes are not the only chemical molecule behind cellular activities. Ribonucleicacid (RNA) and proteins are also functionally important molecules in the cell andare necessary for almost all cellular functions.

Jöns Jacob Berzelius, a Swedish chemist, coined the term protein during his workwith Gerardus Johannes Mulder in his elemental analyses of organic compounds in1838 [201]. Urease was the first known enzyme, and probably the first protein witha known function [193]. The first protein known by sequence was Insulin, whichwas sequenced using Sangar sequencing in 1953 [178, 179, 177, 176].

Friedrich Miescher in 1868 discovered nucleic acids [35] but it was at end ofthe nineteenth century that experiments on RNA and DNA for sensitivity towardsalkaline highlighted chemical differences between them and subsequently RNA wereidentified as separate chemical molecules from DNA in the cell. The importanceand contribution of RNA in synthesizing protein was discovered in 1939 [20].

The relationship between gene, RNA and protein (which is now termed as cen-tral dogma of molecular biology) became clear only in the last seventy years. In1959, Severo Ochoa discovered the process of RNA synthesis from DNA (calledtranscription) and was awarded Nobel Prize in Medicine for the discovery [147].Nirenberg and Matteaei figured out in 1961 that an amino acid in protein is con-stituted by three nucleotides in DNA and that proteins are formed from RNA by aprocess called translation [144]. Francis Crick is regarded as the pioneer in formu-lating the central dogma of molecular biology in 1956 [31] but the term was formallycoined in 1970 in a Nature publication [32]. The summary of Crick’s work is thatgenetic information is stored in DNA, is transferred to the cytoplasm in form ofRibonucleic acid (RNA) and is finally translated into proteins, the workhorse of thecell. The cellular machinery transcribes instructions encoded in a gene sequence to

10CHAPTER 2. GENOMES: BIOLOGY AND BACKGROUND TO PROJECT

produce RNA of various types [111, 220], out of which messenger RNAs (mRNA)are further translated into proteins (Figure 2.3). Details of both transcription andtranslation processes can be studied in many textbooks, e.g., in Chapter 4 and 5of [2].

Transcription

Translation

Figure 2.3: The central dogma of molecular biology – From DNA to RNA to Protein.

Eukaryotic genes can be divided into exons and introns [72]. Exons are DNAsequences that can be found in at least one mature RNA product of the gene.Introns, on the other hand, are DNA sequence not present on final RNA productbut which control the presence or absence of a particular exon and which decideorder of exons on RNA during RNA splicing [27]. RNA splicing is a cellular process,in which a copy of a gene is made to produce primary RNA, introns are removedfrom the primary RNA and the remaining sequences are stitched together in theorder necessary to produce mature RNA. Hence, one gene can be translated intomany diverse protein products (termed gene isoforms) with the presence/absenceof certain exons and with shuffling of exons due to RNA splicing.

2.2. EVOLUTION AND EVOLUTIONARY EVENTS 11

2.2 Evolution and evolutionary events

Gene level evolution

Genes are not a static entity; they evolve over time and across generations. Thisevolution is responsible for creating variance in the gene pool and even developingnew gene functions (see [120]). Sequence mutation [26] and indel events (sequenceinsertions and deletions) are important evolutionary events that change molecularsequence, are known to disturb similarity among genes and play a vital role insequence divergence among species [143]. Copying errors during DNA duplication,unrepaired DNA damage and indel events of DNA by mobile genetic elements areknown to cause permanent sequence mutations visible in the next generation [26].Since molecular sequence reflects protein structure and function [53], these eventsplay a defining role in functional and structural evolution at the gene level.

Gene duplication and gene loss are significant gene level evolutionary eventsthat are nature’s way of developing new cellular functions, when combined withsequence mutation and indel events. Fisher is credited for introducing the conceptof duplications in genes in 1928 [60] and Ohno developed a coherent concept in hiswork in 1970 [150]. Phenomena like neofunctionalization [150] and subfunctionaliza-tion [191, 66] can take place via gene duplication followed by sequence mutation inone or both genes. Neofunctionalization is a process in which one of two duplicatedgenes mutates to gain a novel function not attributed with the parent gene [164].The sialic acid synthase (SAS) gene duplicated and one of the duplicated genesevolved to become the extant antifreeze protein gene in Antarctic zoarcid fish. Thefunctional evolution of antifreeze protein gene is hypothesized to be a result of ne-ofunctionalization after gene duplication of sialic acid synthase (SAS) gene; theancestral/extant SAS gene has sialic acid synthase and rudimentary ice-bindingfunctionalities, but the antifreeze protein gene specializes in noncolligative freezingpoint depression [43]. On the other hand, subfunctionalization is an evolutionaryprocess in which each duplicate of original ancestral gene retains a subset of itsoriginal function [191, 66]. Hemoglobin protein in Homo sapiens is an example ofsubfunctionalization, where a duplicate copy of hemoglobin β-chain evolved to formthe gene for hemoglobin α-chain but neither of the two chains can function withoutthe other to form a monomeric hemoglobin molecule [28].

Spontaneous formation of genes from DNA (or de novo origination of genes) isanother method for new genes to form and has received increasing importance in re-cent times [44, 196]. Though such events are uncommon, yet there is compelling ev-idence for de novo origination of genes in viruses, prokaryotes and eukaryotes, (see,e.g., in Homo sapiens [106], in mammals [132], in Drosophila melanogaster [223], inyeast [19, 52], in Plasmodium vivax [214], in Escherichia coli [41] and in viruses [172]).The function and evolution of de novo originated genes has been discussed in detailby Wu and Zhang [211].


Chromosome level evolution

Evolution occurs at both gene level and at chromosome level on a group of genes.In recent times, the term synteny is defined as conservation of order between twogroups of genes/genomic content present in two chromosomes/contigs/genomic re-gions that are being analysed and reflects the gene order conservation between twochromosomal regions from the same or different species. When applied on chromo-somes from two different species, this concept is referred to as shared synteny, e.g.,many Homo sapiens genes

Figure 2.4: Mapping of syntenic regions of chromo-somes of Homo sapiens on chromosomes of Mus mus-culus where figure shows chromosomes of Mus mus-culus, each chromosome of Homo sapiens is given aunique color defined in the legend and the chromoso-mal regions of Mus musculus sharing high similaritywith chromosomal regions of Homo sapiens are filledwith colour assigned to chromosome of Homo sapi-ens. [24]

are syntenic with those of othermammals. Figure 2.4 showsan example of shared syntenicregions between Homo sapi-ens and Mus musculus [24].Stronger shared synteny be-tween regions than that ex-pected between species can de-pict shared regulatory mech-anisms as well as supportfor functional relationships be-tween syntenic genes [138].Synteny can be used to make arough estimation of evolution-ary divergence between twochromosomal segments [76] be-cause in general, relativelyrecently diverged organismsshare similar blocks of genes ingenome and divergence timeshave been found inversely pro-portional to synteny conserva-tion [36, 209].

Gene translocation and chro-mosome translocations are im-portant evolutionary eventsthat rearrange the genome byseparating two loci apart orjoin two previously separatepieces of a chromosome to-gether, thereby changing syn-tenic conservation between re-gions [159]. Events like wholegenome duplication (WGD)followed by massive gene loss

2.3. BIOLOGICALLY INTERESTING PROTEINS – FERM DOMAINCONTAINING PROTEINS 13

on one or both chromosomal regions is a special event that has been of great in-terest and is a hot topic in recent times, e.g., in fungi [49], fishes [95], floweringplants [33] and vertebrates [40].

A common molecular mechanism for translocation is by use of mobile geneticelements, commonly known as transposons, which are stretches of DNA capable ofchanging their position within the genome either through a cut and paste (termedDNA-only transposons) or though a copy and paste (termed retrotransposons)mechanism [152]. Retrotransposons are common in eukaryotic cells in particularin plant cells and are responsible for substantial parts of vertebrate genomes, e.g.,long interspersed nuclear elements (LINEs) form up to 17% of the human genomeand short interspersed nuclear elements (SINEs) make up to 11% of the humangenome [30]. Retrotransposons usually use a RNA intermediate that is copied backinto genome. Since intronic and UTR regions present in the original gene are nottranscribed during transcription, these regions present in original gene are missingin copy gene and play an important role in syntenic and sequence divergence of aeukaryotic genome [152].

In short, evolution occurs at both gene level and at genome level and under-standing of basic biological events such as sequence mutation, sequence indels, geneloss, gene duplication and gene translocation together may explain the differencesin content similarity and order similarity between two or more genes on two or morechromosomes.

2.3 Biologically interesting proteins – FERM domaincontaining proteins

FERM domain containing proteins (FDCPs) are a specific class of proteins, whichcontain a FERM domain near the N-terminus of the protein. The name FERM hasbeen derived from presence of this domain in band 4.1 (F), ezrin (E), radixin (R) andmoesin (M) proteins [25]. FDCPs are characterized as signalling molecules that areused for in-out and out-in signalling between the plasma membrane and cytoskeletalstructures. FDCPs can be observed in a diverse group of organisms; in vertebrates,invertebrates and even in plants. Interactions between two proteins or between aprotein and a lipid are regulated through FERM domain, which in turn regulatemany important protein functions (see, e.g., the activity, sub-cellular localizationsand recruitment of proteins and/or lipids into macromolecular complexes). FDCPmembers with known function are kindlin, myosin, band 4.1, ezrin, radixin, moesin,kinase-like calmodulin-binding protein, talin, Krev interaction trapped (KRIT),Focal adhesion kinase (FAK), Janus kinase (JAK) and Guanine nucleotide exchangefactors (GEFs). FDCPs are known to be involved in many diseases (see, e.g., kindlinhomologs in Kindler syndrome [102, 85], leukocyte adhesion deficiency III (LAD-III) [194, 203] and abnormal expression in different types of cancers [110, 221, 67]).Embryonic knockout of Kindlin-2 inMus musculus is lethal for embryo [137]. FERMdomain can be divided into three subdomains; F1, F2 and F3, which are observed


in a clover-like shape [84]. FERM domain is known to interact and bind withmany molecules, e.g., Integrin, Ca+2 ions, Actin, IP3, PIP2 and PIP3 [82, 50, 195].FDCPs are therefore an important biological protein family and evolution of theFERM domain in FDCPs is an interesting subject.

Chapter 3

Homology and gene familyinference

This chapter contains an overview of gene homology and gene families, and thecomputational methods to infer homologs and gene families. The chapter startswith a primer on the fundamental concepts and historical background of homologyand gene family inference. The chapter concludes with a discussion on currentcomputational methods and important heuristics, in particular sequence similarityand synteny, for inferring homology and gene families.

3.1 Fundamentals

Homology

Homology is a widely applied concept in many fields of biology, e.g., in comparativeevolutionary biology, phylogeny reconstruction [59], and developmental biology [15].The term homology was introduced in 1843 by Owen [153] and was later adaptedin an evolutionary framework [94] after Darwin’s famous concept of evolution bydecent under natural selection [37]. In short, it was loosely defined as “two partshaving a common evolutionary origin”.

At present, under the most commonly accepted definition, homologous genesare a group of genes that share a last common ancestor (LCA) either throughspeciation or through gene duplication [61]. It is important to note that using thisdefinition, homology is a binary concept – two genes are either homologous or notand can not be expressed, for example as percentage homologous [166, 62, 207].Zuckerkandl and Pauling pioneered in the field of molecular evolution [139, 225].They highlighted the subtle difference between homologs resulting from speciationand those from duplication in 1960s but the terminology and formal classificationof homologs into orthologs (homologs related through speciation at their LCA) andparalogs (homologs related through gene duplication at their LCA) is credited to

15

16 CHAPTER 3. HOMOLOGY AND GENE FAMILY INFERENCE

Walter M. Fitch [61]. This distinction is important mainly because of the arguedhypothesis that orthologous genes are functionally more similar than paralogousgenes [107, 142, 5, 23, 168]. However, it is universally accepted that homology ispositively correlated with common structure and function of genes.

Molecular Sequence SimilarityCalculating molecular sequence similarity between all genes is the foundation ofhomology inference and gene family inference. All homology inference algorithmsare primarily based on molecular sequence similarity. Computational techniques forsequence comparison are usually classified into alignment-based and alignment-freemethods. Alignment-based algorithms are accurate and robust, but hard to applyfor large datasets in comparison with alignment-free algorithms (see, e.g., computa-tion time for performing BLAST and computing homologs using afree for completegenomes of Homo sapiens and Mus musculus [124]). Alignment-free methods (see,e.g., afree [124] and Universal Sequence Mapping [3]) are an efficient alternate basedon word count statistics and k-tuple contents for both sequences, but tend to bedominated by single-sequence noise and are not as accurate as alignment-basedsequence comparison methods. For further details on alignment-free sequence com-parison methods, see [202]. I will stick to alignment-based methods in this work.

Pairwise sequence alignment

Pairwise sequence alignment measures sequence similarity quantitatively, in whichtwo sequences are aligned together such that the common residues are placed insame column and depending on aim of alignment, an objective function representedby a score is maximized. The objective function is measured quantitatively by ascoring scheme, which is based on penalizing indels and scoring matches and mis-matches using substitution scores from a substitution matrix (usually PAM matri-ces, initially developed in 1970 and recalculated in 1978 by Margarett Dayhoff [38]

Figure 3.1: A sample pairwise sequence alignment between two proteins, where conserva-tion is shown in third row under the two sequences, ‘*’ denotes identically aligned aminoacid in both proteins, ‘:’ and ‘.’ denote conservation between two non-identical alignedamino acids with highly similar properties and with weakly similar properties respectively.

or BLOSUM matrices by Henikoff and Henikoff [89]). A sample pairwise sequencealignment is shown in Figure 3.1, where characters in each row represent the amino

3.1. FUNDAMENTALS 17

acid or nucleotide sequence of a particular protein or gene, ‘-’ represents an indel inone sequence, matching characters in a column represent a match and mismatchingcharacters in a column represent a mismatch. When two sequences are aligned fordetermining the optimally matching regions or subsequence, such an alignment istermed local alignment. On the other hand, global alignment represents the mostoptimal alignment for the complete sequence. Variants of these alignment methodsare also known (see, e.g., semiglobal also known as ends-free alignment).

The exact and exhaustive solutions to determine optimal alignment under agiven scoring scheme can be computed by using a dynamic programming algorithmin particular by “Needleman-Wunsch” [141] for global alignment and by “Smith-Waterman” [184] for local alignment of two protein sequences. While these algo-rithms guarantee the optimal alignment under a scoring function, the right scoringfunction to reflect the alignment goals is usually calculated empirically from data.Applying a minimum score threshold on the optimal alignment score then deter-mines if two sequences are homologs or not.

The quadratic computation time required for each pair of sequences makes thesemethods inapplicable for large datasets. Therefore heuristic algorithms based on k-tuples or word methods became popular and are applicable due to their linear timecomplexity despite not exploring all solutions. Common software suites in this classare BLAST [6] and FASTA [118]. BLAST and FASTA do not guarantee an optimalsolution, and are heuristic methods. It has been shown that heuristic-based simi-larity search programs can be erratic (see, e.g., extension of local alignment beyondactual homologous domain by BLAST [77]). Despite these limitations, BLAST isa famous and widely used similarity measurement tool due to its applicability onlarge datasets.

Gene familyA gene family is defined as a group of homologous genes evolved through a tree-likevertical evolution in which one ancestral gene evolves over time, undergoes severalgene duplication and gene loss events, and results in a group of extant genes pooledas a single gene family [149]. Such models of gene family inference follow a stricttree-like evolution, where one gene belongs to exactly one family and all genes inthe gene family are homologous to each other. Under this model of evolution,homology is transitive, i.e., if A and B are homologs and B and C are homologs,then A and C are also homologs and A, B and C belong to same gene family.The transitive property of homologs in gene family has been employed in guidingstructural studies [167].

The term gene family has many definitions and is often used interchangeablywith protein family. In this thesis, gene family refers to all genes that as a wholehave evolved from a common ancestral gene. Other definitions of gene family alsoexist in literature, e.g., arising from structural similarity [29] or from functionalgrouping based on working in the same metabolic pathway, etc. Domain familyreflects sharing between proteins of one or more common domains – a conserved


part of a given protein sequence and structure that usually evolves, functions, andexists irrespective of of the evolution in the rest of the protein chain [17]. The termsuperfamily has been used in the context of domains to reflect structural homology,i.e., common secondary and tertiary structure of two proteins [155]. From thispoint on, I will use the term gene family only for genes related through verticalinheritance evolving from a common ancestor and all other definitions pertainingto domain, structural and functional grouping will be ignored.

There are many advantages and applications of homology inference and genefamily classification. Family classification for individual genes helps in describ-ing relationship between genes and in predicting function, structure and expres-sion patterns of newly identified genes due to the shared similarity with knowngenes [210, 42]. Gene families can also aid in identification of genes that are activein particular diseases [104].

The strict tree-like definition of gene family and homology is a rather simplisticview of homology. Phylogenetic network thinking (PNT) provides an alternativedefinition of homologs and gene families, where legitimate recombination eventsalong with vertical evolution are allowed [105]. These events turn a phylogenetictree into a phylogenetic web that relates closely related sequences without affectinghomology relationships. The major applications of PNT are in analyzing legiti-mate recombination and understanding contradictions in gene or genome histories.Another mode of visualizing gene family evolution is Goods Thinking (GT) that al-lows evolution by horizontal dissemination of genes (see, e.g., recombination events,fusion, fission, etc.) along with duplications and losses [128, 8]. Refer to [81] forfurther discussion on non-tree like definitions of homology.

There are many known practical difficulties in identifying homologs and defininggene families with the strictly tree-like definition of a gene family [81]. Sometimesmore information about a gene is required to classify it as a member of an es-tablished family; Events like convergent evolution (causing similarity between twogenes that do not have a common evolutionary origin [154]) are impossible to de-tect on just sequence similarity and more information (e.g., gene order) is needed toidentify such gene pairs [61]. In other cases, genes may belong to multiple families;Conflicting homology assignment (e.g., in case of genes related by fission or fusionevents) for a gene makes it imperative to assign a gene to two families, which violatesthe definition of gene family and the transitive property of homology [222]. Stricttree-like evolutionary models lack the machinery to handle these complications andare unable to handle these issues [81].

These complexities can, however, either be handled or ignored in some cases.Sequence convergence remains a rare phenomenon and few examples of genes re-lated by sequence convergence have been found in recent times [45, 21]. Also,convergent evolution of domain architectures is observed rarely [79]. Therefore,sequence similarity arising from convergent evolution can, in general, be ignored inhomology inference due to rarity of convergent evolution at the sequence level. Inother cases, simplification in gene family inference is preferred over accuracy, e.g.,some problems strictly demand assignment of one gene to one gene family.

3.1. FUNDAMENTALS 19

Homology and gene family inference for multi-domain proteins is, especially,challenging (see, e.g., [55]), in particular for gene families containing promiscuousdomain(s) [9] and diverse domain architecture(s) [187]. Inserted domain contentinto two non-homologous proteins (e.g., through convergent evolution) should notmake the pair suddenly homologous and should be discounted for in homologyinference. However, it is difficult to differentiate between multi-domain homologs(that follow vertical inheritance) and non-homologous proteins with shared domainsbased on sequence similarity alone. Also, shared domains are common in manyspecies (see, e.g., in Homo sapiens [116]), can link two non-homologous proteinsthrough a strong local similarity, and thus prove to be problematic in homologyand gene family inference.

Cluster analysis algorithms

Cluster analysis or clustering attempt to group objects more similar to each otherthan other objects in the same group and other objects in other clusters withrespect to an objective evaluation criteria [96]. In gene family inference, givengene pairs with quantified similarity scores, the objective function is to group allhomologs (direct or transient depending on the objective evaluation criteria) of agene together in same cluster and all non-homologs of the gene in other group(s),where set of genes acts as data. So homologous genes are grouped into gene familiesby applying a clustering algorithm.

Clustering is an optimization problem with one or more goals, which depend(s)on underlying data and involve(s) optimization of clustering parameters, usuallywith trial and failure to get cluster with desired properties. The clustering al-gorithms are notably different from each other in terms of how they define, andefficiently find clusters [212, 213, 136]. The most popular notion of a cluster isa densely populated similarity graph, where each vertex represents a data pointand each edge represents a connection (or similarity) between two data points. Acluster is composed of all connected nodes and members not reachable from a nodebelong to a different cluster. Clustering algorithms try to maximize an objectivefunction, which can be obtaining groups with minimal difference between membersof the cluster members or clustering together densely populated areas or definedon intervals in data or particular statistical distributions of data [135]. Data pre-processing and modification of model parameters are usually done until the resultachieves desired properties [11].

Clustering algorithms are the backbone for inferring gene families. Hierarchi-cal clustering [205, 48] is the most commonly used technique in Bioinformatics forinferring gene families. Other useful clustering algorithms in data mining includecentroid based clustering (e.g., k-mean clustering [123]), distribution-based cluster-ing and density-based clustering [135, 189, 117, 56]. These methods require someprior information about gene families, e.g., the underlying distribution or the num-ber of gene families, which is usually not known before analysis for gene families.


Therefore, methods other than hierarchical clustering can not be used for genefamily inference.

Hierarchical clustering (connectivity-based clustering) brings related objectscloser, increases distance with unrelated objects [205] and provides an extensivehierarchy of clusters, which are joined/divided with each other by the distancesbetween them, instead of a single partitioning of data set. Hence, threshold ondistance defines the limit up to which two or more clusters can be combined (ag-glomerative clustering) or a cluster broken into two or more smaller clusters (divisiveclustering).

Distance computation and linkage criterion play an important role in hierar-chical clustering [96]. Distance functions (see, e.g., Euclidean distance, Hammingdistance [83] and Manhattan distance) and the linkage criterion (amount of evi-dence/edges relative to cluster size required for merging two clusters) determinesthe type of clusters desired from hierarchical clustering. Single linkage (minimum ofobject distance) [80, 181], complete linkage (maximum of object distance) [39] andaverage linkage (minimum average distance or UPGMA – Unweighted Pair GroupMethod with Arithmetic Mean [185]) clustering are popular choices.

Algorithms for measuring synteny for homologous regionsAlgorithms that measure conservation in synteny of two genomic regions can bebroadly divided into two types [91].

Global synteny conservation algorithms measure synteny conservation based onthe complete chromosomes without any count restraint on the number of hits inboth regions. So, as long as such algorithms are able to find a homolog within aspecific predefined distance or even on whole length of the chromosome, they willextend homologous region on both genetic segments and look for next hit usingthis new pair as anchor genes. Typically these algorithms employ the concept ofgene teams [121], where a gene team consists of two chromosomal regions withclosely placed homologs. An example of an algorithm based on global syntenyconservation is the max-gap algorithm that employs the gene-team concept andoutputs regions containing anchors [92]. It uses a maximum length parameter thatgives the maximum number of genes on left or right of current anchor gene anditeratively performs this process until no more homolog can be found for any anchorin both chromosomes.

Local synteny conservation algorithms measure synteny conservation within afixed window around an anchor gene and all homologous hits within this regionaccount for measurement of synteny conservation locally for this region. An examplealgorithm that employs this approach is the r-window algorithm that uses a windowsize parameter to count the number of homologous pairs within this window. Forlocal synteny, the maximum limit is dependent on size of the window and cannotexceed a certain limit imposed by window size regardless of size of the chromosomewhile for global synteny, there is no limit on size of the gene team, which can be aslarge as the size of chromosomes. Algorithms, that measure synteny conservation

3.2. HOMOLOGY AND GENE FAMILY INFERENCE METHODS 21

for differentiation between orthologs and paralogs, employ local synteny (see, e.g.,for fungal and mammalian data [100, 180, 18]).

3.2 Homology and gene family inference methods

Homologs and gene families can be inferred from each other. Given a gene fam-ily, all pairs of members of this family are homologs by transitivity. Given allhomolog pairs, a specified clustering algorithm can be applied to infer gene fami-lies. Some methods first infer homologs (see, e.g., Neighborhood Correlation [186])followed by applying clustering algorithm to infer gene families (see, e.g., [99]).Other methods infer gene families directly from similarity data typically employingall-vs-all BLAST scores (see, e.g., GeneRage [54], TribeMCL [55] and SiLiX [134])and homologs can then be determined from each gene family using the transitiv-ity property. The following sections reviews some methods used for gene familyinference.

BlastClust, SiLiX and other similar approachesSome gene family inference applications apply a threshold to either BLAST bitscores,E-values, alignment length, or a combination of these parameters. Gene family in-ference algorithms apply a clustering algorithm on top of inferred homologs fromall-versus-all BLAST. Typical examples include BlastClust (single linkage algorithmdirectly applied on BLAST results with a specific threshold) [65], SiLiX (a mem-ory and time efficient implementation of BlastClust) [134] and ProtoMap (restric-tive single linkage clustering using different levels of thresholds to yield an orderedgrouping of all proteins) [219]. The main shortcoming of these approaches is the dif-ficulty in estimating a universal threshold value, e.g., on alignment length, bitscoreor E-value for inferring homology, and they suffer from not modelling domains.

GeneRageMultidomain gene families with diverse architecture are problematic to cluster usinglinkage algorithms directly on BLAST scores. A simple way to resolve this issueis to detect and correct for missing links or incorrect links in the graph-theoreticalapproach. Enright et al. [54] developed an algorithm, called GeneRage, with theability to detect and correct such erroneous links. Hence, problems caused by thepresence of multidomain proteins are minimized and more precise gene families areobtained.

Similarity relationships are stored in a matrix consisting of binary numbers.Smith-Waterman dynamic programming alignment algorithm is performed for sub-sequent symmetrification of matrix to remove false positives. The authors haveused the simple homology transitivity criteria explained in Section 3.1 to detectmulti domain proteins. Smith-Waterman is then performed in successive rounds,which detects protein families consisting of multiple domains and removes some


of the incorrectly recognized similarity relationships within the symmetrical ma-trix. Single-linkage clustering is performed on corrected matrix and initial largerclusters, containing multi-domain families, are split using domain architecture in-formation. Hence, this algorithm clusters large protein datasets into families andis particularly useful to detect and eliminate fusion genes – genes that are a resultof fusion of two other genes. Figure 3.2 displays the flowchart of this algorithm.

Figure 3.2: Schematic representation of GeneRage algorithm (adapted from Enright etal. [54]).

However, this algorithm is unable to deal with promiscuous domains1, peptidefragments and proteins consisting of complex domain structure, which are generallynot present in smaller prokaryotic datasets but are widely present in eukaryoticdatasets. A typical example is that of the ‘response regulator’ domain from two-component systems [190] that causes incorrect grouping of functionally differentproteins (e.g., heat shock factors and phytochromes) [22] to the same family [218,55].

1 A protein domain that is found with many distinct domains in multiple functionally unre-lated proteins [126].


Markov Clustering and TribeMCLEnright et al. [55] have applied another graph theoretic approach in an algorithmcalled TribeMCL for clustering of protein sequences into families. In order toovercome aforementioned problems with multidomain proteins, TribeMCL uses thesame approach as GeneRage but with a more elegant mathematical and probability-based approach instead of Smith Waterman alignment, domain detection and singlelinkage clustering. A simplistic view of TribeMCL is preprocessing of all-versus-all BLAST results followed by application of the Markov Clustering (MCL) algo-rithm [199]. The source code and executable program TribeMCL is not availableany more but the MCL implementation, available from [200], can be used as analternative to TribeMCL. The flowchart for TribeMCL is given in Figure 3.3.

Input Set of Protein Sequences

All vs All Blast

Parse results and symmetrify similarity

scores

Similarity Matrix

Normalize similarity scores (-log[evalue]) to

generate transition probabilities

Markov Matrix

Matrix Squaring (Expansion)

Matrix Inflation

Terminate when no further change is observed in the

matrix

Interpret final matrix as a clustering

Protein Clusters (Families)

Post-Processing and Domain Correction

Core MCL Algorithm

Figure 3.3: A flowchart of Tribe-MCL algorithm (adapted from Enright et al. [55]).

MCL is an unsupervised cluster algorithm for graphs based on simulating the-oretical flow in weighted graphs [199], where the main idea is to further strengthenthe stronger links (based on the concept that stronger links are used more during arandom walk than the weaker links) and to weaken and finally remove the weakerlinks. The data (E-values from all-versus-all BLAST) is represented as a Markovianmatrix – the sum of all elements of a row is exactly one and each cell can only havenon-negative values. The inflation parameter I determines speed and granularity


of clustering. The matrix is inflated first, i.e., raised to power I. After every in-flation, the matrix is expanded, i.e., the Markovian matrix property is restored bynormalization. The process of inflation and expansion continues until there are nomore changes in the matrix or the changes are within a break point and at thispoint, MCL is said to have converged.

HiFiXAn approach to significantly decrease false positives and false negatives is devel-oped by Miele et al. [133] called HiFiX. HiFiX as well as other gene family inferencesoftware consider the gene family inference problem as a graph -theoretical problemwhere nodes are represented by vertices, and similarities by edges. Using precom-piled families with relaxed threshold settings generated from SiLiX, the input toHiFiX is pre-families of sequences with good sensitivity. HiFiX then takes ad-vantage of the community1 structure of this similarity network and maximizesmodularity2 of these communities to divide each family into independent smallerfamilies at weak links. HiFiX uses multiple sequence alignment, a community deter-mination algorithm (Louvain [13]), a hierarchical algorithm (that merges sequencesiteratively into meta-communities3) and alignment likelihood using profile-HMMmodels [51] for evaluation of each meta community.

Clustering algorithms applied on Neighborhood CorrelationscoresWhile aforementioned gene family inference methods have used all-versus-all BLASTresults as similarity measure and apply a clustering algorithm to infer gene families,Joseph et al. [99] infer homologs first and then apply a clustering algorithm to infergene families. BLAST scores are first transformed into Neighborhood Correlation(NC) scores and the NC score is then used as a similarity measure between a genepair [186]. A threshold is applied on NC scores to infer homologous gene pairs anda clustering algorithm is applied on these pairs to infer gene families.

NC distinguishes between sequence pairs that have evolved from the same lastcommon ancestor, and those with a common inserted domain but are otherwise notrelated. In some ways, NC is similar to MCL. Other applications use BLAST resultsdirectly but MCL and NC transform BLAST results into a standard score withina range. The intended datasets for both these methods are multidomain proteinfamilies. Both NC and MCL are empirical and are reliant on sequences in sequence

1 A grouping of vertices into clusters such that vertices of the same cluster have a lot of edgesbetween them and relatively fewer edges with the vertices belonging to other clusters [133].

2 Modularity determines the ratio between existing edges of clusters and expected numberof edges for a random graph with the same degree distribution where the degree of a vertex isnumber of edges connected to the vertex [133].

3 A cluster formed by merging distinct communities containing homologous sequences be-longing to the same protein family [133].


database on which all-versus-all BLAST is performed. Also both methods do notuse or detect underlying domains explicitly. However, for MCL, transformation isfrom E-values of an all-versus-all BLAST to probability values between 0 and 1but NC transforms bitscores of an all-versus-all BLAST to correlation scores alsoranged between 0 and 1.

Conversely, PDGFRB and NCAM2 are related through domain

insertion and have significant sequence similarity due to a shared

Ig domain. Their shared neighborhood is relatively small (242

sequences) and comprised primarily of Ig-based matches. These

contribute little to the Neighborhood Correlation score of this pair

due to low sequence conservation within the Ig superfamily. In

contrast, the unique neighborhood of PDGFRB is large (630 se-

quences), with strong edge weights. For these reasons, PDGFRB

and NCAM2 have a Neighborhood Correlation score of 0.29,

distinctly smaller than the score for PDGFRB and PRKG1B. Unlike

sequence comparison, this clear difference in neighborhood

structure can be used to recognize multidomain homology.

A Benchmark Dataset for Multidomain HomologyEvaluation of classification performance requires a trusted set of

positive examples (known homologous pairs) and negative

examples (pairs known not to share common ancestry). Although

benchmarks are available for detection of remote homology (e.g.,

SCOP [38], CATH [39]), functional similarity (e.g., the Gene

Ontology (GO) [59]), orthology (e.g, COGs [40]), and structural

genomics ([16,45,60], and work cited therein), we are unaware of

any gold-standard validation dataset for multidomain homology.

Our benchmark is designed to be suitable for testing two

classification goals: good overall performance on a large set of

sequence pairs and consistent performance on individual families

Figure 4. Differences in neighborhood structure of the sequence similarity network reflect differences in evolutionary history.Network neighborhoods in which nodes represent sequences. Edges connect pairs with significant sequence similarity. Edge weights reflectingdegree of sequence similarity are not shown. (A) The neighborhoods of the homologous pair, PDGFRB and PRKG1B. PDGFRB and PRKG1B share 779neighbors, mostly Kinases (turquoise nodes). These are strong matches due to a shared kinase domain. PDGFRB has 183 unique neighbors, mostlydue to weak matches with Ig domains (green nodes). PRKG1B has 142 unique neighbors due to weak matches with the cNMP-binding domain (rednodes). Other matching sequences are shown in yellow. (B) PDGFRB and NCAM2, a domain-only match, have 232 matches in common. PDGFRB has730 unique neighbors and NCAM2 has 240, mostly due to Fn3 domains (dark blue nodes).doi:10.1371/journal.pcbi.1000063.g004

Similarity Network Reveals Common Ancestry

PLoS Computational Biology | www.ploscompbiol.org 6 May 2008 | Volume 4 | Issue 5 | e1000063

Figure 3.4: Figure displaying the distribution of unique and common hits in the neighbor-hood for two homologous (at top) and two non-homologous (at bottom) genes. Differencesin neighborhood structure (denser at the middle for homologs and at the two edges for anon-homolog) in the graph depicted here points to the difference in the path taken by theproteins during evolution [186].

NC calculates correlation score for each pair of genes. Two ordered lists ofBLAST bitscores are computed using common and unique BLAST hits betweenboth genes. Each list consists of the neighboring hits of a gene, where neighboringhit is defined as a gene with a BLAST score with this gene. A correlation score isnow computed between the pair of genes using these lists as data, which reflectsthe difference in density between common and unique hits of both genes as showngraphically in Figure 3.4. In a graph, a homolog pair ideally shows a dense common


neighborhood and a comparatively sparse unique neighborhood for both genes whilea non-homolog pair tends to show a sparsely populated common neighborhood anda dense unique neighborhood.

A threshold can now be applied on NC scores to infer homologs. The authors [99]recommend a threshold of 0.5. Also, clustering analysis can now be performed usingNC scores as input instead of BLAST score and it has been shown to perform betteron diverse domain architecture families.

It is important to discuss treatment of data in NC. Data is partitioned intotwo datasets, not necessarily disjoint with each other. The query dataset Q con-tains genes for which we want to infer homology relationships and to classify intogene families. The reference dataset R contains genes providing evidence for (non-)homology of genes in the query dataset but for whom homology inference is notinquired. The reference data R plays an important role in inferring homologs andgene families. If the reference data is, for example, rich in one particular promis-cuous domain present in both non-homologous genes but lacks or have few casesfor the second domain present in one of the two non-homologous genes, high NCscores will be observed. On the other hand, if reference data does not contain casesof promiscuous domain, then homology inference will be different.

ProClust, PhyRn, Profile HMM and other distant homologyinference methodsSometimes, the goal of a gene family inference algorithm is to infer gene familiescontaining remote homologs – homologs that are members of highly divergent genefamilies. It is difficult to infer remote homologs with similarity based techniquesbecause the molecular sequences have diverged as far as twilight zone (≤25% aminoacid identity), which are known to be problematic for similarity based techniques ingeneral. So, specific algorithms based on, e.g., iterative or transitive search, profileHidden Markov Models and Position Specific Scoring Matrices (PSSM) [156, 12,51, 88] have been developed for inferring gene families containing remote homologs.

Homology inference using sequence similarity and syntenyResearchers have used additional information along with sequence similarity toaid in homology inference because similarity alone is not completely synonymouswith homology; Examples of disagreement are distant homologs and genes relatedthrough convergent evolution.

As discussed before, traces of evolution are visible at both gene and genomelevel, which result in divergence in gene content and gene order. Homologous genesthat are a result of regional duplications have more chances of retaining their neigh-bourhood conservation [145]. However, gene translocation, tandem duplications, denovo origination and gene loss events are responsible for divergence in gene orderconservation but the sequence similarity is not disturbed. It is, therefore, natural touse gene order conservation in conjunction with sequence similarity as a measure of


homology. In particular gene order conservation has been used to differentiate be-tween orthologs and paralogs [100], albeit paralogs related by regional duplicationevents, e.g., whole genome duplication cannot be differentiated by this approach.Divergence time between species, represented by a species tree, is another importantheuristic that can aid in inferring homologous genes.

It is important to note the difference between homology inference and homolo-gous or syntenic region inference. The main aim of homology inference is to inferpairs of genes that are homologous while the main aim of inferring syntenic regionsis to infer two regions in which some pairs of genes between the regions are homol-ogous. For inferring homologous regions, homologs are either provided from thestart between the two or more regions, or are inferred as an intermediate step.

SYNERGYWapinski et al. [204] developed a software (called SYNERGY) that can optionallyuse synteny information along with sequence similarity in inferring homologs, genefamilies and phylogenetic trees to determine the origin and evolution of all genesin a collection of species. This results in a better classification of orthologs andparalogs than most similarity-only based methods.

The input to SYNERGY is a collection of species, protein-coding genes in eachspecies and a phylogenetic species tree. Synteny information is a bonus that, ifavailable, can also be used. The aim of SYNERGY is to divide the groups of genesinto unique sets that do not have a gene in common, and each set consists of exactlythose genes that can be traced back to a common hypothetical single gene presentat the root of the species tree (known as the last common ancestor of all species).By doing so, SYNERGY also resolves the evolutionary history of a gene family andproduces a gene tree for each gene family.

SYNERGY traverses the species tree using post-order traversal. Orthogroups1

are determined for the current node using orthogroups and similarity relationshipsdetermined in the previous steps for the children of the current node. For leaves(i.e., extant species), similar genes are grouped to form the initial set of orthogroups(in this case, an orthogroup consists of paralogs only). For internal nodes includingthe root (i.e., for ancestral species), orthogroups present in the two children of thecurrent node are grouped together if they share more similarity than a specifiedlimit. The process ends, when the root of the species tree (or the last commonancestor of all species) is reached. Sequence similarity can be calculated from asimilarity-only method as well as combining it with other information available,e.g., with synteny. The sequence similarity and synteny scores are scaled, weightedand then combined to calculate a single rooting score between the two proteins.The sequence similarity is measured by first globally aligning the two proteins andthen searching for the most likely distance, which explains the substitutions in each

1 Group of genes related by a duplication or a speciation at or below the selected internalnode representing the last common ancestor of both sub-trees [204].


aligned position. The syntenic conservation is quantified by a syntenic similarityscore, which is defined as the ratio of orthologous neighbors of both proteins (likethe R-window method above).

SYNERGY does not require any homology or gene family information as input,unlike many other phylogenetic tree inference algorithms. It computes homologyand gene tree simultaneously using species tree and orthogroups. However, thesimilarity criteria used by SYNERGY is ad-hoc, where weights assigned to eachof similarity, synteny and likelihood is impossible to assign for gene families withdifferent divergence rates for genes and genomes [1].

Chapter 4

Phylogenetic Inference

This chapter provides an introduction to Phylogenetic inference methods. Thechapter starts with the biological motivation behind phylogenetic trees and briefsabout the computational techniques for various phylogenetic tree construction meth-ods with a special focus on the Bayesian phylogenetic inference.

4.1 Fundamentals of Phylogeny

Phylogenetics infers and describes the evolutionary relationships among groups oforganisms based on molecular data and/or morphological traits. Molecular phylo-genetics emerged in the early 1960’s and was pioneered by Zuckerkandl, Pauling,Fitch, and Margoliash [226, 63]. Following their lead, Kimura [101], Ohta [151],King and Jukes [103] made landmark contributions to molecular biology, establish-ing the practice of using sequences for investigating evolution.

Gene trees and species tree evolutionEvolutionary relationships are often described as a tree. In a typical model of speciesevolution, a process of speciation and extinction applied to an initial ancestralspecies determines a species tree describing the evolution of present-day descendantspecies. A gene tree represents evolution for a gene family, where typically anancestral gene evolves through mutations, insertions, deletions, duplications andlosses. However, a tree is not always applicable for a species tree due to reticulateevents such as hybridization of species. In presence of these events, a network-like structure is a better representative for species tree. Since we do not accountfor these events, we will consider trees and not networks for the remainder of thisthesis.

The information available for phylogenetic inference based on sequences is typ-ically DNA sequences (or RNA or protein derivatives) of gene families for leaves ofthe tree. The underlying assumption is that evolutionarily closely related species

29

30 CHAPTER 4. PHYLOGENETIC INFERENCE

have more similar molecular sequences than distantly related species, accordingto some model of evolution. A standard reference that details most phylogenymethodology is Joseph Felsenstein’s “Inferring Phylogenies” [59].

Reconciling gene tree and species tree evolutionAssuming that gene family evolution and species evolution are both tree-like, thetwo trees will typically show a high degree of concordance, and by overlaying them,a gene-species tree reconciliation is obtained [46]. Most parsimonius reconcilia-tion [78] is the most well-known technique to reconcile a gene tree with a speciestree under the model of duplication and loss. One example of gene-species tree rec-onciliation can be seen in Figure 4.1. There are many applications of gene tree and

G11

G12

G21

G31

G41

G

42

G43

Spec1 Spec2 Spec3 Spec4

Speciation event or species

Gene duplication event

Gene loss event

Species tree

Gene tree

Extant gene

Figure 4.1: An illustration of a gene family that has evolved inside a species tree. Imagealso shows different evolutionary events that influence a gene tree.

species tree reconciliation but in this thesis, we are more concerned with findingevolutionary history (i.e., events and topology) of a gene tree, given species treeand molecular data of a gene family. Some recent approaches of exactly how gene

4.2. COMPUTING PHYLOGENETIC TREES – TRADITIONALAPPROACHES 31

trees are computed with and without using the species tree are shown in the nextsection.

4.2 Computing Phylogenetic Trees – TraditionalApproaches

The input to a phylogenetic analysis program is, usually, the molecular sequencesaligned into a multiple sequence alignment (MSA) [157]. Each column of the MSArepresents an evolutionarily homologous site and can have insertions, deletions andgaps. A substitution model states the stationary frequencies of nucleotide/aminoacid codons and their interchange rates, and is used to measure evolutionary dis-tances between pairs of sequences. Each site in the MSA is treated independentlyfrom the others. Hence, the substitution model along with penalties for gaps isused to score each site (column), which taken together represent the evolution ofone sequence to another. Parameters for Substitution models are either estimatedduring inference (e.g., GTR model [197]), or from empirical data (e.g., PAM [38],BLOSUM [89] and JTT [98, 75, 97] matrices).

The branch lengths of a phylogenetic tree represent the distance between se-quences, calculated by applying the substitution model. The topology of a phylo-genetic tree reflects the relative closeness and order of divergence between sequencesin the MSA.

There are many approaches designed to infer phylogeny, which are, in orderfrom simple to complex (and in order of appearance), parsimony-based methods(e.g., Maximum Parsimony [63, 188]), distance-based methods (e.g., Neighbor Join-ing [175]), maximum likelihood methods (e.g., Felsentein’s pruning algorithm [57]and Quartet Puzzling [192]), and Bayesian methods (e.g., Markov chain MonteCarlo [112]). The former approaches require lower computational and time re-sources and are applicable for very large datasets. The latter approaches are moreaccurate and enable more realistic modeling but are computationally expensive.More details on these methodologies can be found elsewhere [59, 169, 216].

4.3 Bayesian phylogenetic inference

Bayesian inference relies on Bayes theorem for obtaining posterior probability Pr(θ|D)of underlying parameters θ, e.g., duplication rate, loss rate, etc., given data D. Pa-rameters θ are treated as random variables, and the entire posterior landscape isinvestigated with respect to θ in Bayesian inference. The posterior probability canbe calculated using Bayes formula

Pr(θ|D) = Pr(D|θ)Pr(θ)Pr(D) (4.1)

where Pr(D|θ) represents likelihood, Pr(θ) is prior and Pr(D) is a normalizationconstant.

32 CHAPTER 4. PHYLOGENETIC INFERENCE

Note that in Bayesian phylogenetic inference, the underlying parameters θ canbe classified into real-valued parameters (also called continuous parameters in litera-ture) and discrete parameters. The real-valued parameters consist of all parameterswith numeric values. The discrete parameters in Bayesian phylogenetic inferenceconsist of the parameters associated with the topology of the tree parameter. Thebranch lengths parameter can be real-valued (if numeric, e.g., in MrBayes [170]) ordiscrete (if discretized).

The most popular way of performing Bayesian phylogenetic inference is by ap-plying the Markov chain Monte Carlo (MCMC) algorithm [215, 127, 115], which isa random walk algorithm that estimates posterior distribution by sampling (e.g.,by Metropolis-Hastings sampling scheme [86] or Gibbs sampling [69]) from the pa-rameter distribution [7]. The backbone of MCMC is to sample a value for eachparameter (including real-valued and discrete parameters) from the parameter dis-tribution using a sampling algorithm (typically Metropolis-Hastings), compute thelikelihood of observed data (MSA sequences and optionally other information likespecies tree) with these parameter settings, accept/reject the new state based oncomparison on likelihood with the older state, and continue this process until aspecified number of iterations are reached or a certain break condition is achieved.Some characteristics (e.g., mixing, convergence assessment and burnin estimation)of MCMC (only with Metropolis-Hastings sampling scheme) are important, andare discussed in detail in Chapter 5.

Bayesian analysis is used to estimate the posterior distribution of real-valuedparameters and of discrete parameters. The ability of searching both real-valuedparameter space and discrete parameter space simultaneously enables biologicallyrealistic multivariate models needed to solve more complex problems. Indepen-dence between samples is ensured by using thinning (sampling every nth iteration)from the MCMC chain and the sampled iterations are called an MCMC trace. TheMCMC trace represents an estimate of the posterior distribution of each parame-ter. For the trace of a real-valued parameter, mean, mode or confidence intervalof the marginal distribution can be examined. For the trace of a discrete parame-ter, posterior frequencies of tree topology, tree at state with maximum likelihood,maximum-a-posteriori tree, the majority rule consensus tree, or split frequencies ofterminal vertex can represent the result of the whole analysis.

MrBayes [170], BEAST [47], PrIME [1], JPrIME [183], BAMM [160] and Phy-loBayes [113] are popular Bayesian software, which are based on MCMC, estimateposterior distribution by using Metropolis-Hastings sampling scheme and are widelyapplied for phylogenetic inference. The input to these software is typically MSA ofmolecular sequences of each gene family along with a few software-specific require-ments (e.g., species tree in case of PrIME and JPrIME). The output is the MCMCtrace of real-valued parameters and of the discrete parameters.

Chapter 5

Post-processing of traces fromMCMC runs

The aim of this chapter is to discuss different tasks associated with post-processingof Markov Chain Monte Carlo (MCMC) runs. This chapter introduces importantcharacteristics of MCMC briefly followed by possible uses of MCMC in Bayesianphylogeny inference and concludes with a discussion on the currently availablesoftware that analyse MCMC.

5.1 MCMC Characteristics

MCMC trace

Basic assumptions for a Bayesian inference using Metropolis-Hastings samplingscheme are that samples can be obtained from MCMC simulation with relativefrequencies that agree with sample frequencies obtained from true distribution andall samples are dependent only on the last state. A state in MCMC is a set ofparameter values. MCMC initializes from a given state, samples from posteriordistribution for all parameters simultaneously in the neighbourhood of the currentstate, computes the likelihood of the new state, selects or rejects the new statebased upon the ratio of likelihood of the new state and the old state, and iteratesuntil a specific condition like number of iterations has been met. Acceptance of astate means that current parameter values are changed with proposed parametervalues. Rejection of a state means that the proposed parameter values are discardedand the current state is retained in this iteration.

Rapid mixing in the chain is essential for exploration of the parameter poste-rior distribution and it’s ability to converge quickly to the stationary distribution.Acceptance ratios for each parameter can be calculated from accepted proposalsand the total proposals made for a particular parameter during a complete run.High acceptance ratios of MCMC proposals are indicative of rapid mixing and

33

34 CHAPTER 5. POST-PROCESSING OF TRACES FROM MCMC RUNS

Iter Post Dens Subs Dens DLR Dens Dup Rate Loss Rate0 -2213.8539 -2169.0250 -44.8289 1.6578 1.6578100 -2063.5709 -2014.2883 -49.2826 2.8984 1.1652200 -1974.9265 -1930.6208 -44.3058 0.9655 1.5388300 -1969.5395 -1925.7214 -43.8181 2.2248 1.6864400 -1955.1124 -1910.5606 -44.5517 1.4999 2.3079500 -1949.8377 -1904.5363 -45.3014 1.2243 1.7212600 -1943.3235 -1902.6028 -40.7207 1.3488 1.7029700 -1942.5775 -1903.1583 -39.4192 1.9042 2.7428. . . . . .. . . . . .. . . . . .

999900 -1895.3154 -1938.9227 43.6073 1.7294 0.79091000000 -1901.6213 -1938.0307 36.4094 3.2748 3.1813

Table 5.1: A typical MCMC trace with five real-valued parameters and a thinning factorof 100 with 1000000 iterations. The tree parameter has been omitted due to lack of space.

Figure 5.1: Graphical illustration of trace of aMCMC run for three parameters that can be ex-tended to more parameters with further dimen-sions in the graph [182]. The run intends tofind the optimal state (the parameter values) thatmaximizes the objective function.

that the posterior space has beenwell-explored. On the other hand,low acceptance ratios are the firstindication of poor mixing and causeof non-convergence. The samplesare preserved by using a thinningfactor so that samples produced ina run are independent of each otherand make the subsequent statis-tics computationally feasible. Thesampled iterations are called anMCMC trace. Table 5.1 shows howa typical MCMC trace for five pa-rameters with a thinning factor of100 looks like and Figure 5.1 rep-resents the graphical interpretationof an MCMC trace.

Convergence assessmentand burnin estimationDue to lack of prior knowledge, itis common to start MCMC simu-lation at a random point (or basedon some heuristic) in parameter space. Usually, the initial choice is far from high-

5.1. MCMC CHARACTERISTICS 35

density regions of the posterior distribution. If the chain is mixing well and hasbeen allowed to run long enough, convergence is guaranteed for MCMC algorithms,where the chain is sampling from a region representing the stationary distributionbut convergence is unfortunately not decidable. The initial iterations are the burninperiod of the chain and the remaining iterations are the stationary part of chain.

Figure 5.2: A typical MCMC trace shown as a plot, where the sample # is shown onthe X-axis and the parameter is plotted on the Y-axis. In this case the parameter is thelog-likelihood value of the sample. Green bar separates the burnin period in the chainthat was estimated visually and the trace appears to have converged for this parameterafter this point.

One is often interested in estimating mean and variance of the posterior distri-bution of some parameter. The chain is assumed to have converged, if mean andvariance can be measured as close to specified values for the true distribution. Sucha convergence is termed as “convergence of an ergodic estimate”. The burnin periodof a trace slows down convergence of an ergodic estimate because it interferes withmean and variance estimates and it is recommended to remove burnin samples.


The simplest way to assess if a chain has converged or not is through visualanalysis of the trace, typically using one or more of trace plots, density plots andrunning mean plots alongside basic statistics. Trace plots also display mixing of thechain. If a chain is mixing well and appears to be sampling from the same regionfor better part of the chain, then it appears to have converged. Such assessmentsare simple and based on expert opinion [73]. However, they can not be automatizedand lack objectivity. A typical MCMC trace is shown in Figure 5.2 and the greenline marks the burnin estimate that was decided visually.

Another commonly employed heuristic for burnin estimation is to remove afixed percentage (none or 10% or 25%) of samples from the chain and take the restas converged or assess for convergence using other convergence diagnostics. Suchheuristics are easy to perform and are automated, but do not observe the tracecharacteristics and miss non-convergent runs altogether.

An alternate way to assess convergence and estimate burnin is to perform sta-tistical estimates of convergence on the real-valued parameters, e.g., Geweke [70],autocorrelation plots, Gelman-Rubin [68], Raftery-Lewis [161], Heidelberger andWelch stationarity test [87] and using effective sample size (ESS) [114, 198] asshown by Sahlin [174] and Höhna [93], which are objective, consistent and can beautomated. A drawback for convergence diagnostic methods is that they are basedonly on real-valued parameters, and can not be applied on discrete parameters (e.g.,tree parameter).

Assessing convergence for Bayesian inference in phylogeny remains a difficultand non-standardized task [93]. Frequencies of splits [171] have been used to assessconvergence for tree parameter, and are implemented in MrBayes [170] and Are WeThere Yet (AWTY) [146]. MrBayes periodically checks the difference in variancefor splits of trees on parallel chains. AWTY explores the splits and plots split-based statistics and trace plot of log-likelihood of trees. However, parallel chainsare required for split frequency analysis, and only comparing the split frequenciesand not the real-valued parameters can be misleading [93].

5.2 Post-processing MCMC traces

Statistical analysisMCMC runs are to be analysed statistically to infer important information likemean parameter values and standard deviation observed during a run. If the truedistribution from which Markov chain is sampling is known, then the sufficientstatistics of this distribution can be compared with the parameters of the estimateddistribution for assessing accuracy of phylogenetic inference.

Trace analysisAs discussed before in Section 5.1, visual inspection of a trace is an indicator ofassessing convergence and can be used to view trends that might suggest problems

5.2. POST-PROCESSING MCMC TRACES 37

with convergence. Apart from convergence assessment, it is also important in as-sessing mixing of the chain. Ideally if the Markov chain mixes properly, it shouldavoid getting stuck in local optima.

Joint marginal distribution analysisAnother significant feature recommended for analysis of Markov chains is to checkdependence between variables. A joint-marginal distribution plot displays the de-pendency trends between two parameters and can be used to check independenceof both parameters.

Tree split histogram and analysisTree parameter is a special parameter, significantly different from real-valued pa-rameters, and requires special attention and methods to analyse. An unrootedphylogenetic tree can be represented by a set of splits – a bipartition of the treerepresenting the smallest unit of information, where each unique non-trivial bipar-tition of the set of leaves of the phylogenetic tree represents one split. The set ofsplits obtained from all trees present in the posterior can be plotted as a histogram,where each bin corresponds to a fraction of total samples, a split is encountered inthe whole run. This histogram can also indicate mixing of tree parameter.

Maximum-a-Posteriori tree topologyThe “Maximum-a-Posteriori” tree topology (also called MAP tree topology) de-scribes the tree associated with the sampled state with the highest posterior prob-ability in the MCMC chain or simply, the most frequently sampled tree. TheMAP tree topology provides a simplistic and informative measure to select a sin-gle topology from the whole set of trees. However, in some cases with large trees,the topology of every sampled tree may be unique, which makes it impossible todifferentiate and choose between trees based on frequency. Also, if the number ofconverged samples is few, then the true MAP tree is hard to estimate.

Majority rule consensus tree and simple majority treeThe majority rule consensus tree is defined as a tree that contains all clades occur-ring in at least half of trees in the posterior distribution [58]. A slightly lenient ruleis simple majority tree (also called a fully resolved consensus tree), where remainingclades are ordered and selected according to decreasing posterior probability. Thereis a constraint, however, regarding the compatibility of each newly selected cladewith all previously selected clades. This method gives out a single tree, whose partsare agreed by the majority of trees in the posterior distribution. But a perceiveddrawback with majority trees is that it can lead to a tree with topology that hasnot been sampled during the run as well as to a tree topology with relatively low


probability, despite the fact that many features with very high probability will alsobe present.

Tree and tree space visualization

Gene trees are usually stored in Newick (also called New Hampshire), or Nexusformats. While it is easy to store and systematic to understand them for thecomputer, they are very hard to understand for humans in particular due to keepingtracks of brackets in these formats to understand when an underlying subtree hasfinished. On the other hand, a visual illustration of tree like in Figure 4.1 is veryeasy to interpret for humans. Tree visualization software (see, e.g., FigTree [162]or Forester [224]) convert the computer readable formats into tree illustrations.

It is also interesting to see how the chain progresses in tree space over time.Some algorithms, for example multidimensional scaling technique [108], can drawdistances between trees in a two dimensional space with a minimum error. One canconvert the tree topologies into a distance matrix using a good tree distance metric(e.g., Robinson-Foulds metric and Nodal distance metric [14]) and use this distancematrix as input to multidimensional scaling to view distribution of trees in treespace. This information can aid in assessing if the tree parameter has convergedor not. It can also be employed to gain more knowledge about mixing for treeparameter.

Post-processing of parallel MCMC traces

Some software (e.g., MrBayes) can run two chains in parallel and output bothchains. They analyze both chains simultaneously during the run to assess conver-gence or use some statistical measure (e.g., difference in split frequencies) on treeparameter of parallel chains to assess stopping criteria. One can also run two sep-arate runs on the same data using same or different starting points in parameterspace to minimize possibility of getting stuck in local maxima. Both these cases arebased on the assumption that on convergence, both chains must be sampling fromthe same distribution. To assess if both chains have converged to the same poste-rior distribution, hypothesis testing can be used, where the null hypothesis is thatboth samples have been drawn from the same distribution, implying convergence.Chi-square tests for two samples [217] for tree parameter and Mann-Whitney Utest [125] can be used for hypothesis testing. Other tricks like Metropolis-coupledMCMC [131] can be used with multiple chains to improve mixing and explorationof tree space, and has been used in several Bayesian phylogenetic software (e.g., inMrBayes [4] and in BAMM [160]). Figure 5.3 displays traces of two parallel runssimultaneously.

5.3. SOFTWARE PACKAGES FOR POST-PROCESSING OF MCMCTRACES 39

Figure 5.3: A trace of two parallel chains superimposed on each other with red line forfirst run and blue line for second run. The green bar indicates the visually selected burninvalue, after which both chains are appearing to sample from the same space.

5.3 Software packages for post-processing of MCMC traces

Following is a brief introduction of major software that analyse the MCMC outputand are commonly used to deduce results from MCMC runs.

CodaCoda [158] is a standard MCMC analysis and convergence diagnostic package writ-ten for the R language by Martin Plumer, which is popular mainly due to multipleand diverse convergence diagnostics, automatic estimation of burnin and the abilityto run from command line. It includes most standard convergence diagnostics, andcan perform effective sample size method, Raftery-Lewis diagnostic [161], Gewekediagnostic [70], Gelman-Rubin diagnostic [68] and Heidelberger-Welch stationaritytest [87]. It can remove burnin samples using one of these tests, and draw the trace


of a run using GNU plot commands. It can also run a MCMC chain with a givenstatistical distribution, thinning factor and other MCMC parameters. However,Coda does not handle the tree parameter at all, and can only analyse and generateMCMC runs for real-valued parameters.

Are We There YetAre We There Yet (or AWTY) [146] is a Perl-based program that analyzes MCMCruns and is used mainly to predict if a chain has converged or not by observingand performing convergence tests on the tree parameter. The input to AWTYare trace files containing phylogenetic trees which have been generated as outputby other phylogenetic MCMC programs, and is based on tree splits to assess theconvergence. It uses Coda [158] for some of the convergence diagnostics. The maincontribution of AWTY is to help a user in performing the graphical explorationof tree-specific parameters. The split histogram, the trace plots of log likelihoods,parallel chain analysis, split frequency analysis and plotting the symmetric treedistance are some of the functionality available in AWTY. It is a useful tool toanalyse the tree parameter of MCMC traces. The developers and authors of AWTYrecommend that it should be used in complement with Coda for both real-valuedand tree parameters.

TracerTracer [163] is a software program used for analysis of the trace files generated byBayesian MCMC runs. It can estimate selected statistics (see, e.g., mean, stan-dard deviation and confidence intervals) for the selected parameter. A frequencydistribution and a density plot for the Bayesian posterior of the selected parametercan also be drawn. It can generate trace plots and joint marginal plots for twoparameters. Tracer can handle real-valued parameters only and is not intended fordiscrete parameters. The default burnin for Tracer is set at 10% and estimates ESSvalues for each parameter at the specified burnin. It uses a heuristic for assessingconvergence where it flags ESS value less than 100 but cautions that values lessthan 200 should be further analysed by, e.g., tree shape. However, the thresholdsused by Tracer seem to be arbitrarily defined [93].

Chapter 6

Present Investigations

All papers included in this thesis target two key areas of Computational Biology;gene family inference and post-processing of Bayesian inference of phylogeny. PaperI-II involve method development of a synteny-aware homology inference and genefamily inference software and comparison of this method with other popular genefamily inference methods. Paper III focuses on development of a GUI-based soft-ware that is useful in Bayesian inference of phylogeny and displays the importantcharacteristics of MCMC particularly helpful in assessing behaviour of the MCMCchain. Paper IV uses the software designed in paper III and explores behaviourof tree parameter in chain. It also applies some popular convergence diagnosticsand discusses a new convergence assessment approach that takes all parametersof the chain jointly. Paper V traces the evolutionary and phylogenetic history ofFERM domain of Kindlin with respect to FERM domain of other FERM domain-containing proteins (FDCPs) and applies Bayesian phylogenetic inference amongother Bioinformatics tools to infer interesting observations about evolution of theFERM domain.

Paper I: Quantitative synteny scoring improves homologyinference and partitioning of gene families.

In this study, a novel homology inference method is proposed that takes into ac-count local synteny and merges local synteny estimates with similarity scores toobtain evaluation scores that better reflect homology. The algorithm uses neigh-borhood correlation scores applied on all-versus-all BLAST bitscores as similar-ity scores and computes neighborhood correlation scores applied on local syntenyscores calculated from neighborhood of two genes to quantify synteny conserva-tion between two genes. Both synteny and similarity scores are merged to forman evaluation score that shows confidence in homologous pair. Using this mergedscore, gene families are computed, and cluster quality of gene families obtainedfrom a synteny-aware approach is shown to be better than those obtained only

41

42 CHAPTER 6. PRESENT INVESTIGATIONS

from Neighborhood Correlation scores on simulated data and on a biological datawith known gold standard for gene families. Software implementation is availablewith the paper and is called GenFamClust.

Paper II: GenFamClust: An accurate, synteny-aware andreliable homology inference algorithm.

This paper discusses GenFamClust as an informed and reliable homology infer-ence algorithm and compares gene families obtained from GenFamClust for ac-curacy, similarity, dependence and other characteristics with those obtained fromsome other popular gene family inference software (hcluster_sg, Markov Clustering(MCL), HiFiX, SiLiX and clustering approaches) applied on BLAST and Neigh-borhood Correlation scores. Simulated datasets and a dataset consisting of manyeukaryotic species are used for this comparison. Gene families obtained from Gen-FamClust for Fungi dataset is then evaluated for accuracy with pillars availablefrom Yeast Gene Order Browser (YGOB), which have been semi-manually curatedusing gene similarity and synteny. Agreement and disagreement between pillarsand clusters obtained from GenFamClust were analysed. A few novel predictionsfor pillars based on better phylogenetic fit and equally supportive synteny weremade to show that GenFamClust is not only automated but can also predict genefamilies in a more informed and accurate manner than predictions from YGOB.

Paper III: VMCMC: a graphical and statistical analysis toolfor Markov chain Monte Carlo traces.

This study focuses on presenting a software (named Visual Markov chain MonteCarlo or VMCMC) that simplifies post-processing of MCMC traces with most com-monly useful tasks, e.g., with maximum-a-posteriori or majority rule consensus treecomputation, parameter mixing (for both tree topology and real-valued parameters),computing tree posterior, automated burnin estimation for individual parametersas well as for complete chain and visualizing traces of real-valued and tree topologyparameters. VMCMC also computes statistical properties, e.g., mean and varianceof each parameter and can analyse the same parameter for two parallel chains usingMann-Whitney U test for real-valued parameters and Chi-Square test for splits intree topology parameter. The burnin estimation methods employed in VMCMC areGelman-Rubin (potential scale reduction factor), Geweke and effective sample size(ESS). VMCMC is specifically tailored for Bayesian inference of phylogeny, and canbe used to analyse MCMC chains generated from MrBayes, PrIMe, JPrIMe-DLRS,BEAST and any other software generating tab separated chain output. VMCMCcan be used both as a GUI-based application, supporting interactive exploration,and as a convenient command-line tool suitable for automated pipelines. Softwareimplementation of VMCMC is available online.

43

Paper IV: Burnin estimation and convergence assessment inBayesian phylogenetic inference.

In this work, we present novel burnin estimation methods specifically applicableto MCMC traces for Bayesian phylogenetic inference, and make a case for use-fulness of features implemented in VMCMC presented in a previous paper. Weanalysed MCMC runs generated from JPrIMe using VMCMC. We use the lastburnin estimation method for assessing convergence and estimating burnin, andapply it to effective sample size (ESS) and convergence diagnostics like Geweke andGelman-Rubin. We also propose two new convergence diagnostic methods that arebased on estimating burnin using ESS from multiple parameters jointly. We quan-tify the effect of burnin on posterior of tree parameter. We explored correlationbetween burnin estimates from various convergence diagnostics. Furthermore, weinvestigated relationship between size of gene families and burnin estimation fromdifferent convergence diagnostics. Using parallel chains for ascertaining convergencein MCMC runs and estimating burnin, we concluded that it is always advisable touse convergence diagnostics and that the burnin estimates of last burnin appliedon ESS were the closest to those estimated by parallel chain analysis.

Paper V: Tracing the evolution of FERM domain ofKindlins.

In this study, we wanted to infer phylogeny using Bayesian phylogenetic inferenceon a biological dataset, use VMCMC to post-process the trace and infer evolution-ary history of the FERM domain for a medium-sized biological dataset (consistingof FERM domains from 185 FDCPs sequences from 14 proteins in 15 differentspecies). This work represents the application of these and other Bioinformaticstools to study and explore the evolutionary history of FERM domain of FDCPs.Phylogenetic analysis were performed using MEGA5, TimeTree, DLRS in JPrIMeand VMCMC. Other tools like Evolutionary Trace Analysis (ETA), BLAST, NCBIConserved Domain Database (CDD), Hidden Markov Model (HMM) and Chimerawere also deployed for identifying functionally and structurally important and con-served residues in FERM domain of Kindlins.

Chapter 7

Discussion & Conclusion

In 1995 the genome of Bacterium Haemophilus influenzae was sequenced – thefirst time complete genome of a free-living organism had become known [64]. Thiswas followed by sequencing of the complete genome of Saccharomyces cerevisiae in1996 [74]. However, a monumental moment that opened the floodgates for sequenc-ing of complete genomes of organisms was sequencing of the complete genome ofHomo sapiens in 2001 [111]. Since then, the number of fully sequenced genomeshas been increasing at an amazing rate [165]. The availability of genome scaledata has brought in more challenges with itself to the Bioinformatics community,in particular to the Phylogenetics community. One challenge is to use this informa-tion in inferring more informed and accurate gene families as input to a phylogenyinferring algorithm (in general). Second, while methods like Bayesian inference ofphylogeny are increasingly being used in literature for inferring tree posterior, theneed to have interactive and automated analysis tools for analysing output fromthese software is growing particularly after anticipating the application of thesemethods on genome scale. In the papers of this thesis, we tackle both these issues.

7.1 Discussion

Synteny-aware homology inferenceIn the first paper, we introduce a synteny-aware homology inference algorithm thatuses local synteny information in conjunction with gene similarity information toinfer homologs. Then clustering algorithms like single-linkage, average-linkage andcomplete linkage algorithms can be applied on these homologs to infer gene families.The idea is novel; at this point, to my knowledge, there do not exist any methodsthat compute synteny-aware homologs except SYNERGY whose implementationis not publicly available at the time of this write-up. The method is informed; ituses gene order conservation alongside gene content conservation to infer homolo-gous gene pairs. The method is built for genome scale data; it requires completeinformation about genes and positions of genes on chromosomes. The similarity

45

46 CHAPTER 7. DISCUSSION & CONCLUSION

component represented by neighborhood correlation scores is more robust than all-versus-all BLAST bitscores [186]. The synteny scoring scheme is simple yet robustand relevant to biological datasets; we verified it experimentally by choosing differ-ent boundaries and schemes as shown in supplementary data and as applied on abiological dataset. Homology inference and partitioning of gene families is signif-icantly improved by using synteny quantitatively; results suggest strong evidencewhen compared with results from Neighborhood Correlation – the similarity com-ponent of GenFamClust. The major issue we faced was how to choose values ofeach parameter for each module. While default values for all-versus-all BLASTand neighborhood correlation scores were already determined by Joseph et al. [99],we used empirical data and tested on all parameter values for determining bestvalues, which is given in supplementary data. A drawback with the GenFamClustalgorithm is its empirical or data dependent nature; it requires a good referencedataset, which truly reflects homologous and syntenic relationships between thequeried gene pair. Another drawback is the computation time to infer syntenyscores and synteny correlation scores, which are a hindrance in the scalability ofdata. The software is available online and is called GenFamClust.

In the second paper, we wanted to assess properties of gene families inferredby applying clustering algorithms on homologs inferred from GenFamClust and tomotivate everyone to use GenFamClust for inferring homologs and gene familiesdespite higher computational time and requiring more information than most com-monly available homology inference software. The summary of this work is thatgene families inferred from GenFamClust are more informed and accurate than genefamilies inferred from most other popular software.

Gene family inference precedes phylogeny inference and is of utmost importancein determining evolutionary history inferred from a phylogenetic tree. The avail-ability and annotation of complete genomes should help improve the quality of genefamilies by providing syntenic context. I believe that GenFamClust provides thebest opportunity, at this point, to incorporate gene similarity and gene syntenysupported by empirical evidence. Annotation and sequencing of more genomes willhelp improve gene family inference accuracy in future.

Analysis of Markov chain Monte Carlo runsIn the third paper, we wanted to specifically analyse the MCMC chains producedduring Bayesian phylogenetic analysis, which can be used for post-processing ofMCMC traces on genome-wide scale. Given a trace, we wanted to interactivelyview the effect of burnin on the posterior of both real and tree topology parame-ters without going to multiple software for each task and for each value of burnin.The currently used software for analysing MCMC traces in Bayesian inference ofphylogeny (namely Tracer, AWTY and CODA) do these tasks for one type of pa-rameters, are not interactive and can not be applied on genome-wide scale. Besides,the goal of a lot of phylogenetic studies is to find a single representative tree forthe posterior after removing burnin. VMCMC is intended to fill these gaps. An-

7.2. FUTURE PERSPECTIVES 47

other feature of VMCMC is to have a command line version as well with completefunctionality so that one is able to analyse thousands of runs using a small script. Ipersonally found VMCMC very useful in analysing MCMC traces, calculating themaximum-a-posteriori tree and consensus tree after removing burnin in tracing theevolutionary history of FERM domains.

In the fourth paper, we proposed and evaluated convergence diagnostics andburnin estimators that are specifically applicable for MCMC traces with multipleparameters (e.g., those used in Bayesian phylogenetic inference), and quantify theeffect of different burnin estimation and convergence diagnostics on tree topologyparameter. We noted that burnin estimates from different convergence diagnosticswere not dependent on size of gene family, on which the chain ran and there wasno correlation between burnin estimates from different convergence diagnostics.Chain convergence was verified by using parallel chains and hypothesis tests likechi-square test and Mann-Whitney U test, and we estimated possible burnin pointsfor converged chains as well. We motivated the use of joint burnin estimators inBayesian phylogenetic inference and in some other cases, where some parametersmay be dependent on one or more parameters and perturbation in this parametercauses other dependent parameters to “reset” causing problems in convergence tostationary state. The differences between tree posteriors obtained after removingburnin estimates from a convergence diagnostic and that obtained after removingburnin estimates from parallel chain analysis were quantitatively summarized usingKullback-Leibler divergence. We concluded that it was better to use a convergencediagnostic than using a fixed 25% burnin estimate and that burnin estimates fromlast-burnin estimation on ESS is the closest to the estimates from parallel chainanalysis among the different convergence diagnostics employed in this study.

7.2 Future perspectives

This thesis is dedicated to the topics of homology inference and analysis of chainsfrom Markov chain Monte Carlo in Bayesian phylogenetics inference. I believe thatthere is room for improvement in both these topics and in the works presented here.

Starting with synteny-aware homology inference, the role of reference or empir-ical data is not discussed and has not been explored completely in GenFamClust.Perhaps this can be explored in future experiments with GenFamClust. Incorporat-ing species tree somehow intelligently in BLAST scores, Neighbourhood Correlationscores, synteny scores and/or synteny correlation scores is another avenue that looksinteresting to explore.

VMCMC can be improved with a few interesting features. There are someobvious features from other software like Tracer [163] and AWTY [146] that couldbe duplicated in VMCMC to make it most useful. In particular, the ability toplot marginal distributions and correlations among different parameters could beinteresting additions to VMCMC. Direct extraction of samples with a particulartree topology could be another desired feature. Display plots of split frequencies (as

48 CHAPTER 7. DISCUSSION & CONCLUSION

is done by AWTY) can also be added to VMCMC. More interactivity can be addedto VMCMC particularly when analysing the distances between tree topologies.

Looking back at FERM domain analysis now, I would have identified FDCPhomologs using GenFamClust and identify homologs with syntenic support, whichwould allow more than one homolog per protein per species (e.g., two or morerepresentatives for recently duplicated paralogs) and will make the analysis moreinformed and complete. Then, it would represent a perfect biological use case forapplying the whole pipeline described in this thesis.

Bibliography

[1] Örjan Åkerborg, Bengt Sennblad, Lars Arvestad, and Jens Lagergren. Simul-taneous Bayesian gene tree reconstruction and reconciliation analysis. Pro-ceedings of the National Academy of Sciences, 106(14):5714–5719, 2009.

[2] Bruce Alberts. Molecular Biology of the Cell: Reference Edition. GarlandScience, 2008. ISBN 9780815341116.

[3] Jonas S. Almeida and Susana Vinga. Universal sequence map (USM) ofarbitrary discrete sequences. BMC Bioinformatics, 3(1):6, 2002.

[4] Gautam Altekar, Sandhya Dwarkadas, John P. Huelsenbeck, and FredrikRonquist. Parallel Metropolis coupled Markov chain Monte Carlo for Bayesianphylogenetic inference. Bioinformatics, 20(3):407–415, 2004.

[5] Adrian M. Altenhoff, Romain A. Studer, Marc Robinson-Rechavi, andChristophe Dessimoz. Resolving the ortholog conjecture: orthologs tend tobe weakly, but significantly, more similar in function than paralogs. PLoSComputational Biology, 8(5):e1002514–e1002514, 2012.

[6] Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, et al. GappedBLAST and PSI-BLAST: a new generation of protein database search pro-grams. Nucleic Acids Research, 25(17):3389–3402, 1997.

[7] Christophe Andrieu, Nando de Freitas, Arnaud Doucet, and Michael I. Jor-dan. An introduction to MCMC for machine learning. Machine Learning, 50(1-2):5–43, 2003.

[8] Eric Bapteste, Philippe Lopez, Frédéric Bouchard, et al. Evolutionary analy-ses of non-genealogical bonds produced by introgressive descent. Proceedingsof the National Academy of Sciences, 109(45):18266–18272, 2012.

[9] Malay K. Basu, Liran Carmel, Igor B. Rogozin, and Eugene V. Koonin. Evo-lution of protein domain promiscuity in eukaryotes. Genome Research, 18(3):449–461, 2008.

[10] Philip N. Benfey, Philip Benfey, and Alex D. Protopapas. Essentials of Ge-nomics. Prentice-Hall, 2005. ISBN 9780130470188.

49

50 BIBLIOGRAPHY

[11] Pavel Berkhin. A survey of clustering data mining techniques. In GroupingMultidimensional Data. Springer, 2006.

[12] Gaurav Bhardwaj, Kyung D. Ko, Yoojin Hong, et al. PHYRN: a robustmethod for phylogenetic analysis of highly divergent sequences. PloS One, 7(4):e34261, 2012.

[13] Vincent D. Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and EtienneLefebvre. Fast unfolding of communities in large networks. Journal of Sta-tistical Mechanics: Theory and Experiment, 2008(10):P10008, 2008.

[14] John Bluis and Dong-Guk Shin. Nodal distance algorithm: calculating aphylogenetic tree comparison metric. In Bioinformatics and Bioengineering,2003. Proceedings. Third IEEE Symposium on, pages 87–94. IEEE, 2003.

[15] Ingo Brigandt. Homology in comparative, molecular, and evolutionary de-velopmental biology: the radiation of a concept. Journal of ExperimentalZoology Part B: Molecular and Developmental Evolution, 299(1):9–17, 2003.

[16] Terence A. Brown. Genomes 3. Garland Science, 2007. ISBN 9780815341383.

[17] Marija Buljan and Alex Bateman. The evolution of protein domain families.Biochemical Society Transactions, 37(4):751, 2009.

[18] Kevin P. Byrne and Kenneth H. Wolfe. The Yeast Gene Order Browser: com-bining curated homology and syntenic context reveals gene fate in polyploidspecies. Genome Research, 15(10):1456–1461, 2005.

[19] Jing Cai, Ruoping Zhao, Huifeng Jiang, and Wen Wang. De novo originationof a new protein-coding gene in Saccharomyces cerevisiae. Genetics, 179(1):487–496, 2008.

[20] Torbjorn Caspersson and Jack Schultz. Pentose nucleotides in the cytoplasmof growing tissues. Nature, 143(3623):602–3, 1939.

[21] Todd A. Castoe, A.P. Jason de Koning, Hyun-Min Kim, et al. Evidence foran ancient adaptive episode of convergent molecular evolution. Proceedingsof the National Academy of Sciences, 106(22):8986–8991, 2009.

[22] C. Chang and E. M. Meyerowitz. Eukaryotes have “two-component” signaltranducers. Research in Microbiology, 145(5):481–486, 1994.

[23] Xiaoshu Chen and Jianzhi Zhang. Correction: The ortholog conjecture isuntestable by the current gene ontology but is supported by RNA sequencingdata. PLoS Computational Biology, 9(1), 2013.

[24] Asif T. Chinwalla, Lisa L. Cook, Kimberly D. Delehaunty, et al. Initialsequencing and comparative analysis of the mouse genome. Nature, 420(6915):520–562, 2002.

BIBLIOGRAPHY 51

[25] Athar H. Chishti, A. C. Kim, S. M. Marfatia, et al. The FERM domain: aunique module involved in the linkage of cytoplasmic proteins to the mem-brane. Trends in Biochemical Sciences, 23(8):281–282, Aug 1998.

[26] Suzanne Clancy. Genetic mutation. Nature Education, 1(1):187, 2008.

[27] Suzanne Clancy. RNA splicing: introns, exons and spliceosome. NatureEducation, 1(1):31, 2008.

[28] Bernard Conrad and Stylianos E. Antonarakis. Gene duplication: a drivefor phenotypic diversity and cause of human disease. Annual Reviews inGenomics and Human Genetics, 8:17–35, 2007.

[29] Loredana Lo Conte, Bart Ailey, Tim J.P. Hubbard, et al. SCOP: a structuralclassification of proteins database. Nucleic Acids Research, 28(1):257–259,2000.

[30] Richard Cordaux and Mark A. Batzer. The impact of retrotransposons onhuman genome evolution. Nature Reviews Genetics, 10(10):691–703, 2009.

[31] Francis H. Crick. On protein synthesis. Symposia of the Society for Experi-mental Biology, 12:139–163, 1956.

[32] Francis H. Crick. Central dogma of molecular biology. Nature, 227(5258):561–563, 1970.

[33] Liying Cui, P Kerr Wall, James H. Leebens-Mack, et al. Widespread genomeduplications throughout the history of flowering plants. Genome Research,16(6):738–749, 2006.

[34] Ralf Dahm. Friedrich Miescher and the discovery of DNA. DevelopmentalBiology, 278(2):274–288, 2005.

[35] Ralf Dahm. Discovering DNA: Friedrich Miescher and the early years ofnucleic acid research. Human Genetics, 122(6):565–581, Jan 2008.

[36] Thomas Dandekar, Berend Snel, Martijn Huynen, and Peer Bork. Conserva-tion of gene order: a fingerprint of proteins that physically interact. Trendsin Biochemical Sciences, 23(9):324–328, 1998.

[37] Charles Darwin. On the Origin of Species by Means of Natural Selection Orthe Preservation of Favored Races in the Struggle for Life. General BooksLLC, 2009. ISBN 9781150016707.

[38] Margaret O. Dayhoff and Robert M. Schwartz. A model of evolutionarychange in proteins. In In Atlas of Protein Sequence and Structure. NationalBiomedical Research Foundation, 1978.

52 BIBLIOGRAPHY

[39] Daniel Defays. An efficient algorithm for a complete link method. The Com-puter Journal, 20(4):364–366, 1977.

[40] Paramvir Dehal and Jeffrey L. Boore. Two rounds of whole genome duplica-tion in the ancestral vertebrate. PLoS Biology, 3(10):1700, 2005.

[41] Luis Delaye, Alexander DeLuna, Antonio Lazcano, and Arturo Becerra. Theorigin of a novel gene through overprinting in Escherichia coli. BMC Evolu-tionary Biology, 8(1):31, 2008.

[42] Jeffery P. Demuth, Tijl De Bie, Jason E. Stajich, et al. The evolution ofmammalian gene families. PloS One, 1(1):e85, 2006.

[43] Cheng Deng, C-H Christina Cheng, Hua Ye, et al. Evolution of an antifreezeprotein by neofunctionalization under escape from adaptive conflict. Proceed-ings of the National Academy of Sciences, 107(50):21593–21598, 2010.

[44] Yun Ding, Qi Zhou, and Wen Wang. Origins of new genes and evolution oftheir novel functions. Annual Review of Ecology, Evolution, and Systematics,43:345–363, 2012.

[45] Russell F. Doolittle. Convergent evolution: the need to be explicit. Trendsin Biochemical Sciences, 19(1):15–18, 1994.

[46] Jean-Philippe Doyon, Vincent Ranwez, Vincent Daubin, and Vincent Berry.Models, algorithms and programs for phylogeny reconciliation. Briefings inBioinformatics, 12(5):392–400, 2011.

[47] Alexei J. Drummond and Andrew Rambaut. BEAST: Bayesian evolutionaryanalysis by sampling trees. BMC Evolutionary Biology, 7(1):214, 2007.

[48] Richard C. Dubes and Anil K. Jain. Clustering methodologies in exploratorydata analysis. Advances in Computers, 19(11), 1980.

[49] Bernard Dujon, David Sherman, Gilles Fischer, et al. Genome evolution inyeasts. Nature, 430(6995):35–44, 2004.

[50] Jill M. Dunty, Veronica Gabarra-Niecko, Michelle L. King, et al. FERMdomain interaction promotes FAK signaling. Molecular and Cellular Biology,24(12):5353–5368, 2004.

[51] Sean R. Eddy. Profile hidden Markov models. Bioinformatics, 14(9):755–763,1998.

[52] Diana Ekman and Arne Elofsson. Identifying and quantifying orphan proteinsequences in fungi. Journal of Molecular Biology, 396(2):396–405, 2010.

BIBLIOGRAPHY 53

[53] Sabrina Ellenberger, Stefan Schuster, and Johannes Wöstemeyer. Correlationbetween sequence, structure and function for trisporoid processing proteinsin the model zygomycete Mucor mucedo. Journal of Theoretical Biology, 320:66–75, 2013.

[54] Anton J. Enright and Christos A. Ouzounis. GeneRAGE: a robust algorithmfor sequence clustering and domain detection. Bioinformatics, 16(5):451–457,2000.

[55] Anton J. Enright, Stijn van Dongen, and Christos A. Ouzounis. An efficientalgorithm for large-scale detection of protein families. Nucleic Acids Research,30(7):1575–1584, 2002.

[56] Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. A density-based algorithm for discovering clusters in large spatial databases with noise.In Kdd, volume 96, pages 226–231. AAAI Press, 1996.

[57] Joseph Felsenstein. Evolutionary trees from DNA sequences: a maximumlikelihood approach. Journal of Molecular Evolution, 17(6):368–376, 1981.

[58] Joseph Felsenstein. Confidence limits on phylogenies: an approach using thebootstrap. Evolution, pages 783–791, 1985.

[59] Joseph Felsenstein. Inferring Phylogenies. Macmillan Education, 2004. ISBN9780878931774.

[60] Ronald A. Fisher. The possible modification of the response of the wild typeto recurrent mutations. American Naturalist, pages 115–126, 1928.

[61] Walter M. Fitch. Distinguishing homologous from analogous proteins. Sys-tematic Biology, 19(2):99–113, 1970.

[62] Walter M. Fitch. Homology a personal view on some of the problems. Trendsin Genetics, 16(5):227–231, May 2000.

[63] Walter M. Fitch and Emanuel Margoliash. Construction of phylogenetic trees.Science, 155(3760):279–284, 1967.

[64] Robert D. Fleischmann, Mark D. Adams, Owen White, et al. Whole-genomerandom sequencing and assembly of Haemophilus influenzae Rd. Science, 269(5223):496–512, 1995.

[65] National Center for Biotechnology Information. Blastclust, 2015. URLftp://ftp.ncbi.nih.gov/blast/documents/blastclust.html.

[66] Allan Force, Michael Lynch, F. Bryan Pickett, et al. Preservation of duplicategenes by complementary, degenerative mutations. Genetics, 151(4):1531–1545, 1999.

54 BIBLIOGRAPHY

[67] Jianchao Gao, Ammad A. Khan, Takashi Shimokawa, et al. A feedbackregulation between Kindlin-2 and GLI1 in prostate cancer cells. FEBS Letters,587(6):631–638, Mar 2013.

[68] Andrew Gelman and Donald B. Rubin. Inference from iterative simulationusing multiple sequences. Statistical Science, pages 457–472, 1992.

[69] Stuart Geman and Donald Geman. Stochastic relaxation, Gibbs distribu-tions, and the Bayesian restoration of images. IEEE Transactions on PatternAnalysis and Machine Intelligence, 1(6):721–741, 1984.

[70] John Geweke. Evaluating the accuracy of sampling-based approaches to thecalculation of posterior moments, volume 196. Federal Reserve Bank of Min-neapolis, Research Department Minneapolis, MN, USA, 1991.

[71] Greg Gibson and Spencer V. Muse. A Primer of Genome Science. SinauerAssociates, 2009. ISBN 9780878932368.

[72] Walter Gilbert. Why genes in pieces? Nature, 271(5645):501, Feb 1978.

[73] Walter R. Gilks, Sylvia Richardson, and David J. Spiegelhalter. Introduc-ing Markov chain Monte Carlo. In Markov chain Monte Carlo in practice.London: Chapman and Hall, 1996.

[74] André Goffeau, Bart G. Barrell, Howard Bussey, et al. Life with 6000 genes.Science, 274(5287):546–567, 1996.

[75] Gaston H. Gonnet, Mark A. Cohen, and Steven A. Benner. Exhaustive match-ing of the entire protein sequence database. Science, 256(5062):1443–1445,1992.

[76] Josefa González, José María Ranz, and Alfredo Ruiz. Chromosomal elementsevolve at different rates in the Drosophila genome. Genetics, 161(3):1137–1154, 2002.

[77] Mileidy W. Gonzalez and William R. Pearson. Homologous over-extension:a challenge for iterative similarity searches. Nucleic Acids Research, 38(7):2177–2189, 2010.

[78] Morris Goodman, John Czelusniak, G. William Moore, A.E. Romero-Herrera,and Genji Matsuda. Fitting the gene lineage into its species lineage, a parsi-mony strategy illustrated by cladograms constructed from globin sequences.Systematic Zoology, pages 132–163, 1979.

[79] Julian Gough. Convergent evolution of domain architectures (is rare). Bioin-formatics, 21(8):1464–1471, 2005.

[80] John C. Gower and G.J.S. Ross. Minimum spanning trees and single linkagecluster analysis. Applied Statistics, pages 54–64, 1969.

BIBLIOGRAPHY 55

[81] Leanne S. Haggerty, Pierre-Alain Jachiet, William P. Hanage, et al. A plu-ralistic account of homology: adapting the models to the data. MolecularBiology and Evolution, 31(3):501–516, 2013.

[82] Keisuke Hamada, Toshiyuki Shimizu, Takeshi Matsui, et al. Structural basisof the membrane-targeting and unmasking mechanisms of the radixin FERMdomain. The EMBO journal, 19(17):4449–4462, 2000.

[83] Richard W. Hamming. Error detecting and error correcting codes. BellSystem technical Journal, 29(2):147–160, 1950.

[84] Bong-Gyoon Han, Wataru Nunomura, Yuichi Takakuwa, et al. Protein 4.1 Rcore domain structure and insights into regulation of cytoskeletal organiza-tion. Nature Structural & Molecular Biology, 7(10):871–875, 2000.

[85] Cristina Has, Daniele Castiglia, Marcela del Rio, et al. Kindler syndrome:extension of FERMT1 mutational spectrum and natural history. HumanMutation, 32(11):1204–1212, 2011.

[86] W. Keith Hastings. Monte Carlo sampling methods using markov chains andtheir applications. Biometrika, 57(1):97–109, 1970.

[87] Philip Heidelberger and Peter D. Welch. A spectral method for confidenceinterval generation and run length control in simulations. Communicationsof the ACM, 24(4):233–245, 1981.

[88] Jorja G. Henikoff and Steven Henikoff. Using substitution probabilities toimprove position-specific scoring matrices. Computer Applications in the Bio-sciences, 12(2):135–143, 1996.

[89] Steven Henikoff and Jorja G. Henikoff. Amino acid substitution matricesfrom protein blocks. Proceedings of the National Academy of Sciences, 89(22):10915–10919, 1992.

[90] Alfred D. Hershey and Martha Chase. Independent functions of viral pro-tein and nucleic acid in growth of bacteriophage. The Journal of GeneralPhysiology, 36(1):39–56, 1952.

[91] Rose Hoberman and Dannie Durand. The incompatible desiderata of genecluster properties. In Comparative Genomics. Springer, 2005.

[92] Rose Hoberman, David Sankoff, and Dannie Durand. The statistical signifi-cance of max-gap clusters. In Comparative Genomics. Springer, 2005. ISBN9783540322900.

[93] Sebastian Höhna. Burnin estimation and convergence assessment. Chapterin Licentiate thesis Bayesian Phylogenetic Inference, 2011.

56 BIBLIOGRAPHY

[94] Thomas H. Huxley. The origin of species. CreateSpace Independent Publish-ing Platform, 2015. ISBN 9781512324358.

[95] Olivier Jaillon, Jean-Marc Aury, Frédéric Brunet, et al. Genome duplicationin the teleost fish Tetraodon nigroviridis reveals the early vertebrate proto-karyotype. Nature, 431(7011):946–957, 2004.

[96] Anil K. Jain, M. Narasimha Murty, and Patrick J. Flynn. Data clustering: areview. ACM Computing Surveys, 31(3):264–323, 1999.

[97] D. T. Jones, W. R. Taylor, and J. M. Thornton. A mutation data matrix fortransmembrane proteins. FEBS Letters, 339(3):269–275, 1994.

[98] David T. Jones, William R. Taylor, and Janet M. Thornton. The rapidgeneration of mutation data matrices from protein sequences. Computer Ap-plications in the Biosciences, 8(3):275–282, 1992.

[99] Jacob M. Joseph and Dannie Durand. Family classification without domainchaining. Bioinformatics, 25(12):i45–i53, 2009.

[100] Jin Jun, Ion I. Mandoiu, and Craig E. Nelson. Identification of mammalianorthologs using local synteny. BMC Genomics, 10(1):630, 2009.

[101] Motoo Kimura. Evolutionary rate at the molecular level. Nature, 217(5129):624–626, 1968.

[102] Theresa Kindler. Congenital poikiloderma with traumatic bulla formationand progressive cutaneous atrophy. British Journal of Dermatology, 66(3):104–111, 1954.

[103] Jack L. King and Thomas H. Jukes. Non-Darwinian evolution. Science, 164(3881):788–798, 1969.

[104] Christopher M.O.O. Kleifeld. Validating matrix metalloproteinases as drugtargets and anti-targets for cancer therapy. Nature Reviews Cancer, 6:227,2006.

[105] Thorsten Kloesges, Ovidiu Popa, William Martin, and Tal Dagan. Networksof gene sharing among 329 proteobacterial genomes reveal differences in lat-eral gene transfer frequency at different phylogenetic depths. Molecular Bi-ology and Evolution, 28(2):1057–1074, 2011.

[106] David G. Knowles and Aoife McLysaght. Recent de novo origin of humanprotein-coding genes. Genome Research, 19(10):1752–1759, 2009.

[107] Eugene V. Koonin. Orthologs, paralogs, and evolutionary genomics 1. AnnualReview of Genetics, 39:309–338, 2005.

BIBLIOGRAPHY 57

[108] Joseph B. Kruskal. Multidimensional scaling by optimizing goodness of fit toa nonmetric hypothesis. Psychometrika, 29(1):1–27, 1964.

[109] Cold Spring Harbor Laboratory. DNA from the beginning, 2015. URLhttp://www.dnaftb.org/.

[110] J. E. Lai-Cheong, M. Parsons, and John A. McGrath. The role of kindlinsin cell biology and relevance to human disease. The International Journal ofBiochemistry & Cell Biology, 42(5):595–603, 2010.

[111] Eric S. Lander, Lauren M. Linton, Bruce Birren, et al. Initial sequencing andanalysis of the human genome. Nature, 409(6822):860–921, 2001.

[112] Bret Larget and Donald L. Simon. Markov chain Monte Carlo algorithms forthe Bayesian analysis of phylogenetic trees. Molecular Biology and Evolution,16:750–759, 1999.

[113] Nicolas Lartillot, Thomas Lepage, and Samuel Blanquart. PhyloBayes 3:a Bayesian software package for phylogenetic reconstruction and moleculardating. Bioinformatics, 25(17):2286–2288, 2009.

[114] John A. Laurmann and W. Lawrence Gates. Statistical considerations inthe evaluation of climatic experiments with atmospheric general circulationmodels. Journal of the Atmospheric Sciences, 34(8):1187–1199, 1977.

[115] Shuying Li, Dennis K. Pearl, and Hani Doss. Phylogenetic tree construc-tion using Markov chain Monte Carlo. Journal of the American StatisticalAssociation, 95(450):493–508, 2000.

[116] Wen-Hsiung Li, Zhenglong Gu, Haidong Wang, and Anton Nekrutenko. Evo-lutionary analyses of the human genome. Nature, 409(6822):847–849, 2001.

[117] Yoseph Linde, Andres Buzo, and Robert M. Gray. An algorithm for vectorquantizer design. IEEE Transactions on Communications, 28(1):84–95, 1980.

[118] David J. Lipman and William R. Pearson. Rapid and sensitive protein simi-larity searches. Science, 227(4693):1435–1441, 1985.

[119] Harvey Lodish, Arnold Berk, Chris A. Kaiser, et al. Molecular Cell Biology:Seventh edition. W. H. Freeman and Company, New York, 2013. ISBN9781429234139.

[120] Manyuan Long, Esther Betrán, Kevin Thornton, and Wen Wang. The originof new genes: glimpses from the young and old. Nature Reviews Genetics, 4(11):865–875, 2003.

[121] Nicolas Luc, Jean-Loup Risler, Anne Bergeron, and Mathieu Raffinot. Geneteams: a new formalization of gene clusters for comparative genomics. Com-putational Biology and Chemistry, 27(1):59–67, 2003.

58 BIBLIOGRAPHY

[122] Avery O.T. MacLeod and Maclyn McCarty. Studies of the chemical natureof the substance inducing transformation of pneumococcal types. Inductionof transformation by a deoxyribonucleic acid fraction isolated from pneumo-coccus type III. Journal of Experimental Medicine, 79:137–158, 1944.

[123] James MacQueen. Some methods for classification and analysis of multivari-ate observations. In Proceedings of the Fifth Berkeley Symposium on Math-ematical Statistics and Probability, volume 1, pages 281–297. Oakland, CA,USA., 1967.

[124] Khalid Mahmood, Geoffrey I. Webb, Jiangning Song, et al. Efficient large-scale protein sequence comparison and gene matching to identify orthologsand co-orthologs. Nucleic Acids Research, 40(6):e44–e44, 2012.

[125] Henry B. Mann and Donald R. Whitney. On a test of whether one of tworandom variables is stochastically larger than the other. The Annals of Math-ematical Statistics, pages 50–60, 1947.

[126] Edward M. Marcotte, Matteo Pellegrini, Ho-Leung Ng, et al. Detecting pro-tein function and protein-protein interactions from genome sequences. Sci-ence, 285(5428):751–753, 1999.

[127] Bob Mau, Michael A. Newton, and Bret Larget. Bayesian phylogenetic infer-ence via Markov chain Monte Carlo methods. Biometrics, 55(1):1–12, 1999.

[128] James O. McInerney, Davide Pisani, Eric Bapteste, and Mary J. O’Connell.The public goods hypothesis for the evolution of life on earth. Biology Direct,6:41, 2011.

[129] Gregor Mendel. Versuche über plflanzenhybriden. Verhandlungen des Natur-forschenden Vereines in Brünn, 4:3–47, 1866.

[130] Gregor Mendel. Experiments in plant hybridisation. Cosimo Classics, NewYork, 2008. ISBN 9781605202570.

[131] Nicholas Metropolis, Arianna W. Rosenbluth, Marshall N. Rosenbluth, et al.Equation of state calculations by fast computing machines. The Journal ofChemical Physics, 21(6):1087–1092, 1953.

[132] Julien Meunier, Frédéric Lemoine, Magali Soumillon, et al. Birth and ex-pression evolution of mammalian microRNA genes. Genome Research, 23(1):34–45, 2013.

[133] Vincent Miele, Simon Penel, Vincent Daubin, et al. High-quality sequenceclustering guided by network topology and multiple alignment likelihood.Bioinformatics, 28(8):1078–1085, 2012.

BIBLIOGRAPHY 59

[134] Vincent Miele, Simon Penel, and Laurent Duret. Ultra-fast sequence clus-tering from similarity networks with SiLiX. BMC Bioinformatics, 12(1):116,2011.

[135] Glenn W. Milligan and Martha C. Cooper. Methodology review: Clusteringmethods. Applied Psychological Measurement, 11(4):329–354, 1987.

[136] Boris Mirkin. Mathematical classification and clustering: From how to whatand why. Springer, 1998.

[137] Eloi Montanez, Siegfried Ussar, Martina Schifferer, et al. Kindlin-2 controlsbidirectional signaling of integrins. Genes Dev, 22(10):1325–1330, May 2008.

[138] Gabriel Moreno-Hagelsieb, Victor Treviño, Ernesto Pérez-Rueda, et al. Tran-scription unit conservation in the three domains of life: a perspective fromEscherichia coli. Trends in Genetics, 17(4):175–177, 2001.

[139] Gregory J. Morgan. Emile Zuckerkandl, Linus Pauling, and the molecularevolutionary clock, 1959-1965. Journal of the History of Biology, 31(2):155–178, 1998.

[140] Thomas H. Morgan. The Theory of the Gene. BiblioLife, 2015. ISBN9781297835674.

[141] Saul B. Needleman and Christian D. Wunsch. A general method applicable tothe search for similarities in the amino acid sequence of two proteins. Journalof Molecular Biology, 48(3):443–453, 1970.

[142] Nathan L. Nehrt, Wyatt T. Clark, Predrag Radivojac, and Matthew W.Hahn. Testing the ortholog conjecture with comparative functional genomicdata from mammals. PLoS Computational Biology, 7(6):e1002073–e1002073,2011.

[143] Masatoshi Nei and Masafumi Nozawa. Roles of mutation and selection inspeciation: from Hugo de Vries to the modern genomic era. Genome Biologyand Evolution, 3:812–829, 2011.

[144] Marshall W. Nirenberg and J. Heinrich Matthaei. The dependence of cell-freeprotein synthesis in E. coli upon naturally occurring or synthetic polyribonu-cleotides. Proceedings of the National Academy of Sciences, 47(10):1588–1602,1961.

[145] Richard A. Notebaart, Martijn A. Huynen, Bas Teusink, et al. Correlation be-tween sequence conservation and the genomic context after gene duplication.Nucleic Acids Research, 33(19):6164–6171, 2005.

60 BIBLIOGRAPHY

[146] Johan A.A. Nylander, James C. Wilgenbusch, Dan L. Warren, and David L.Swofford. AWTY (are we there yet?): a system for graphical exploration ofMCMC convergence in Bayesian phylogenetics. Bioinformatics, 24(4):581–583, 2008.

[147] Severo Ochoa. Enzymatic synthesis of Ribonucleic Acid (Nobel lecture).In Nobel Lectures, Physiology or Medicine 1942-1962. Elsevier, Amsterdam,1959.

[148] The University of Utah Health Sciences. Learn.genetics, 2015. URLhttp://learn.genetics.utah.edu/.

[149] Robert J. O’Hara. Population thinking and tree thinking in systematics.Zoologica Scripta, 26(4):323–329, 1997.

[150] Susumu Ohno. Evolution by gene duplication. Springer Science & BusinessMedia, 2013. ISBN 9783642866593.

[151] Tomoko Ohta. Slightly deleterious mutant substitutions in evolution. Nature,246(5428):96–98, 1973.

[152] Eric M. Ostertag and Haig H. Kazazian Jr. Biology of mammalian L1 retro-transposons. Annual Review of Genetics, 35(1):501–538, 2001.

[153] Richard Owen andWilliamW. Cooper. Lectures on the Comparative Anatomyand Physiology of the Invertebrate Animals : delivered at the Royal Collegeof Surgeons. Longman, Brown, Green, and Longmans, London, 1843.

[154] Joe Parker, Georgia Tsagkogeorga, James A. Cotton, et al. Genome-widesignatures of convergent evolution in echolocating mammals. Nature, 502(7470):228–231, Oct 2013.

[155] Frances Pearl, Annabel Todd, Ian Sillitoe, et al. The CATH Domain Struc-ture Database and related resources Gene3D and DHS provide comprehensivedomain family information for genome analysis. Nucleic Acids Research, 33(suppl 1):D247–D251, 2005.

[156] P Pipenbacher, Alexander Schliep, Sebastian Schneckener, et al. ProClust:improved clustering of protein sequences with an extended graph-based ap-proach. Bioinformatics, 18(suppl 2):S182–S191, 2002.

[157] Walter Pirovano and Jaap Heringa. Multiple sequence alignment. In Bioin-formatics. Humana, 2008. ISBN 9781603271592.

[158] Martyn Plummer, Nicky Best, Kate Cowles, and Karen Vines. CODA: Con-vergence diagnosis and output analysis for MCMC. R News, 6(1):7–11, 2006.

BIBLIOGRAPHY 61

[159] Nicholas H. Putnam, Thomas Butts, David E.K. Ferrier, et al. The amphioxusgenome and the evolution of the chordate karyotype. Nature, 453(7198):1064–1071, 2008.

[160] Daniel L. Rabosky. Automatic detection of key innovations, rate shifts, anddiversity-dependence on phylogenetic trees. PLoS One, 9(2):e89543, 2014.

[161] Adrian E. Raftery and Steven M. Lewis. Practical Markov chain Monte Carlo- comment: one long run with diagnostics: implementation strategies forMarkov chain Monte Carlo. Statistical Science, pages 493–497, 1992.

[162] Andrew Rambaut, Marc A. Suchard, Dong W. Xie, and Alexei J. Drummond.Figtree, 2015. URL http://tree.bio.ed.ac.uk/software/figtree.

[163] Andrew Rambaut, Marc A. Suchard, Dong W. Xie, and Alexei J. Drummond.Tracer v1.6, 2015. URL http://beast.bio.ed.ac.uk/Tracer.

[164] Shruti Rastogi and David A. Liberles. Subfunctionalization of duplicatedgenes as a transition state to neofunctionalization. BMC Evolutionary Biol-ogy, 5(1):28, 2005.

[165] T.B.K. Reddy, Alex D. Thomas, Dimitri Stamatis, et al. The Genomes On-Line Database (GOLD) v. 5: a metadata management system based on a fourlevel (meta) genome project classification. Nucleic Acids Research, 43(D1):1099–1106, 2014.

[166] Gerald R. Reeck, Christoph de Haën, David C. Teller, et al. “homology” inproteins and nucleic acids: A terminology muddle and a way out of it. Cell,50(5):667, 1987.

[167] Christian G. Roessler, Branwen M. Hall, William J. Anderson, et al. Tran-sitive homology-guided structural studies lead to discovery of cro proteinswith 40% sequence identity but different folds. Proceedings of the NationalAcademy of Sciences, 105(7):2343–2348, 2008.

[168] Igor B. Rogozin, David Managadze, Svetlana A. Shabalina, and Eugene V.Koonin. Gene family level comparative analysis of gene expression in mam-mals validates the ortholog conjecture. Genome Biology and Evolution, 6(4):754–762, 2014.

[169] Fredrik Ronquist and Andrew R. Deans. Bayesian phylogenetics and its in-fluence on insect systematics. Annual Review of Entomology, 55:189–206,2010.

[170] Fredrik Ronquist, Maxim Teslenko, Paul van der Mark, et al. MrBayes 3.2: ef-ficient Bayesian phylogenetic inference and model choice across a large modelspace. Systematic Biology, 61(3):539–542, 2012.

62 BIBLIOGRAPHY

[171] Fredrik Ronquist, Paul van der Mark, and John P. Huelsenbeck. Bayesianphylogenetic analysis using MRBAYES. In The phylogenetic handbook: apractical approach to phylogenetic analysis and hypothesis testing. CambridgeUniversity Press, 2009.

[172] Niv Sabath, Andreas Wagner, and David Karlin. Evolution of viral proteinsoriginated de novo by overprinting. Molecular Biology and Evolution, pagemss179, 2012.

[173] Cecilia Saccone and Graziano Pesole. Handbook of Comparative Genomics:Principles and Methodology. John Wiley & Sons, 2005. ISBN 9780471326410.

[174] Kristoffer Sahlin. Estimating convergence of Markov chain Monte Carlo sim-ulations. Stockholm University, Master Thesis, 2011.

[175] Naruya Saitou and Masatoshi Nei. The neighbor-joining method: a newmethod for reconstructing phylogenetic trees. Molecular Biology and Evolu-tion, 4(4):406–425, 1987.

[176] Frederick H. Sanger. Chemistry of insulin; determination of the structure ofinsulin opens the way to greater understanding of life processes. Science, 129(3359):1340–1344, May 1959.

[177] Frederick H. Sanger and E.O.P. Thompson. The amino-acid sequence in theglycyl chain of insulin. I. The identification of lower peptides from partialhydrolysates. Biochemical Journal, 53(3):353–366, Feb 1953.

[178] Frederick H. Sanger and Hans Tuppy. The amino-acid sequence in the pheny-lalanyl chain of insulin. 1. The identification of lower peptides from partialhydrolysates. Biochemical Journal, 49(4):463, 1951.

[179] Frederick H. Sanger and Hans Tuppy. The amino-acid sequence in the pheny-lalanyl chain of insulin. 2. The investigation of peptides from enzymic hy-drolysates. Biochemical Journal, 49(4):481, 1951.

[180] Anasua Sarkar, Hayssam Soueidan, and Macha Nikolski. Identification ofconserved gene clusters in multiple genomes based on synteny and homology.BMC Bioinformatics, 12(Suppl 9):S18, 2011.

[181] Robin Sibson. SLINK: an optimally efficient algorithm for the single-linkcluster method. The Computer Journal, 16(1):30–34, 1973.

[182] Joel Sjöstrand. Reconciling gene family evolution and species evolution, 2013.

[183] Joel Sjöstrand, Bengt Sennblad, Lars Arvestad, and Jens Lagergren. DLRS:gene tree evolution in light of a species tree. Bioinformatics, 28(22):2994–2995, 2012.

BIBLIOGRAPHY 63

[184] Temple F. Smith and Michael S. Waterman. Identification of common molec-ular subsequences. Journal of Molecular Biology, 147(1):195–197, 1981.

[185] Robert R. Sokal. A statistical method for evaluating systematic relationships.University of Kansas Scientific Bulletin, 28:1409–1438, 1958.

[186] Nan Song, Jacob M. Joseph, George B. Davis, and Dannie Durand. Sequencesimilarity network reveals common ancestry of multidomain proteins. PLoSComputational Biology, 4(4):e1000063, Apr 2008.

[187] Nan Song, Robert D. Sedgewick, and Dannie Durand. Domain architecturecomparison for multidomain homology identification. Journal of Computa-tional Biology, 14(4):496–516, 2007.

[188] Mike Steel. Some statistical aspects of the maximum parsimony method. InMolecular Systematics and Evolution: Theory and Practice. Springer, 2002.

[189] Hugo Steinhaus. Sur la division des corps matériels en parties. Bull. Acad.Pol. Sci., Cl. III, 4:801–804, 1957. ISSN 0001-4095.

[190] Ann M. Stock, Victoria L. Robinson, and Paul N. Goudreau. Two-componentsignal transduction. Annual Review of Biochemistry, 69(1):183–215, 2000.

[191] Arlin Stoltzfus. On the possibility of constructive neutral evolution. Journalof Molecular Evolution, 49(2):169–181, 1999.

[192] Korbinian Strimmer and Arndt von Haeseler. Quartet puzzling: a quartetmaximum-likelihood method for reconstructing tree topologies. MolecularBiology and Evolution, 13(7):964–969, 1996.

[193] James B. Sumner. The chemical nature of enzymes (Nobel lecture). In NobelLectures, Chemistry 1942-1962. Elsevier, Amsterdam, 1946.

[194] Lena Svensson, Kimberley Howarth, Alison McDowall, et al. Leukocyte ad-hesion deficiency-III is caused by mutations in KINDLIN3 affecting integrinactivation. Nature Medicine, 15(3):306–312, 2009.

[195] Guy Tanentzapf and Nicholas H. Brown. An interaction between integrin andthe talin FERM domain mediates integrin activation but not linkage to thecytoskeleton. Nature Cell Biology, 8(6):601–606, 2006.

[196] Diethard Tautz and Tomislav Domazet-Lošo. The evolutionary origin of or-phan genes. Nature Reviews Genetics, 12(10):692–702, 2011.

[197] Simon Tavaré. Some probabilistic and statistical problems in the analysisof DNA sequences. Lectures on Mathematics in the Life Sciences, 17:57–86,1986.

64 BIBLIOGRAPHY

[198] H. Jean Thiébaux and Francis W. Zwiers. The interpretation and estimationof effective sample size. Journal of Climate and Applied Meteorology, 23(5):800–811, 1984.

[199] Stijn van Dongen. Graph clustering by flow simulation, 2000.

[200] Stijn van Dongen. MCL – a cluster algorithm for graphs, 2015. URLhttp://micans.org/mcl/.

[201] Hubert B. Vickery. The origin of the word protein. The Yale Journal ofBiology and Medicine, 22(5):387, 1950.

[202] Susana Vinga and Jonas Almeida. Alignment-free sequence comparison – areview. Bioinformatics, 19(4):513–523, 2003.

[203] Hongyan Wang, Daina Lim, and Christopher E. Rudd. Immunopathologieslinked to integrin signalling. In Seminars in Immunopathology, volume 32,pages 173–182. Springer, 2010.

[204] Ilan Wapinski, Avi Pfeffer, Nir Friedman, and Aviv Regev. Automaticgenome-wide reconstruction of phylogenetic gene trees. Bioinformatics, 23(13):i549–i558, 2007.

[205] Joe H. Ward Jr. Hierarchical grouping to optimize an objective function.Journal of the American Statistical Association, 58(301):236–244, 1963.

[206] James D. Watson and Francis H. Crick. Molecular structure of nucleic acids.Nature, 171(4356):737–738, 1953.

[207] Caleb Webber and Chris P. Ponting. Genes and homology. Current Biology,14(9):R332–3, May 2004.

[208] Psychology Wikia. Human genome to genes, 2015. URLhttp://psychology.wikia.com/wiki/File:Human_genome_to_genes.png.

[209] Yuri I. Wolf, Igor B. Rogozin, Alexey S. Kondrashov, and Eugene V. Koonin.Genome alignment, evolution of prokaryotic genome organization, and pre-diction of gene function using genomic context. Genome Research, 11(3):356–372, 2001.

[210] Cathy H. Wu, Hongzhan Huang, Lai-Su L. Yeh, and Winona C. Barker. Pro-tein family classification and functional annotation. Computational Biologyand Chemistry, 27(1):37–47, 2003.

[211] Dong-Dong Wu and Ya-Ping Zhang. Evolution and function of de novo orig-inated genes. Molecular Phylogenetics and Evolution, 67(2):541–545, 2013.

[212] Rui Xu and Donald Wunsch. Survey of clustering algorithms. IEEE Trans-actions on Neural Networks, 16(3):645–678, 2005.

BIBLIOGRAPHY 65

[213] Rui Xu and Donald C. Wunsch. Clustering algorithms in biomedical research:a review. IEEE Reviews in Biomedical Engineering, 3:120–154, 2010.

[214] Zefeng Yang and Jinling Huang. De novo origin of new genes with introns inplasmodium vivax. FEBS Letters, 585(4):641–644, 2011.

[215] Ziheng Yang and Bruce Rannala. Bayesian phylogenetic inference using DNAsequences: a Markov chain Monte Carlo method. Molecular Biology andEvolution, 14(7):717–724, 1997.

[216] Ziheng Yang and Bruce Rannala. Molecular phylogenetics: principles andpractice. Nature Reviews Genetics, 13(5):303–314, 2012.

[217] Frank Yates. Contingency tables involving small numbers and the Chi-squared test. Supplement to the Journal of the Royal Statistical Society,pages 217–235, 1934.

[218] Kuo-Chen Yeh, Shu-Hsing Wu, John T. Murphy, and J. Clark Lagarias. Acyanobacterial phytochrome two-component light sensory system. Science,277(5331):1505–1508, 1997.

[219] Golan Yona, Nathan Linial, and Michal Linial. ProtoMap: automatic classi-fication of protein sequences and hierarchy of protein families. Nucleic AcidsResearch, 28(1):49–55, 2000.

[220] Phillip D. Zamore and Benjamin Haley. Ribo-gnome: the big world of smallRNAs. Science, 309(5740):1519–1524, 2005.

[221] Jun Zhan, Xiang Zhu, Yongqing. Guo, et al. Opposite role of Kindlin-1 andKindlin-2 in lung cancers. PLoS One, 7(11):e50313, 2012.

[222] Jianzhi Zhang. Evolution by gene duplication: an update. Trends in Ecology& Evolution, 18(6):292–298, 2003.

[223] Qi Zhou, Guojie Zhang, Yue Zhang, et al. On the origin of new genes inDrosophila. Genome Research, 18(9):1446–1455, 2008.

[224] Christian M. Zmasek. Forester, 2015. URLhttps://github.com/cmzmasek/forester.

[225] Emile Zuckerkandl. On the molecular evolutionary clock. Journal of Molec-ular Evolution, 26(1-2):34–46, 1987.

[226] Emile Zuckerkandl and Linus Pauling. Molecules as documents of evolution-ary history. Journal of Theoretical Biology, 8(2):357–366, 1965.

from genomes to post-processing of bayesian...

Documents